Courrier des statistiques N2 - 2019
RMéS: INSEE’s Statistical Metadata Repository
Understanding what a statistical result covers in order to facilitate its interpretation or reuse is what statistical metadata is all about. They are of many kinds: definitions, classifications, characteristics of the source used to produce the result, etc. To make this information accessible, INSEE’s Statistical Metadata Repository, RMéS, structures it and centralizes it, ensuring the consistency of the information through identification rules. The implementation of international standards for describing metadata facilitates their use, automatically or manually. This also allows sharing with other entities. The environment for communicating with the repository has also been set up: services are offered throughout the life cycle of a statistical operation, both to supply the repository and to access its content. The use of RMéS goes beyond simple reuse: some client applications can rely on metadata to automatically produce components of the statistical process, for example to specify questionnaires, or even eventually describe controls or dissemination products. Cross-functional and structuring for statistical production, RMéS appears as a potential vector for coordination within the French Official Statistical System.
- Documenting to make sense of the figures
- Wide-ranging statistical metadata ...
- ... Standardised, accessible and re-usable ...
- ... Centralised within a repository: RMéS
- An existing wealth of content
- Repository service offerings
- Towards active metadata
- One repository for the whole Official Statistical System?
- Box 1. An example of metadata used in the 2016 ICT survey*
- Box 2. RMéS has replaced the Structured Data Documentation system (DDS)
With the introduction in the 1980s of its Structured Data Documentation System (Dispositif de Documentation Structurée – DDS), INSEE highlighted the importance of all information used to gain a better understanding of statistical output. Without information about concepts, classifications, variables or code lists, statistical output could not be used or interpreted correctly, nor paired with other data – simply put, we would not know what they are measuring. Such information is data about… data: the metadata, which need to be developed and made available in an efficient manner. This article explores how INSEE’s Statistical Metadata Repository (Référentiel de MÉtadonnées Statistiques, or RMéS), which replaced DDS in 2018, addresses this need.
Documenting to make sense of the figures
INSEE’s motto, “measuring, understanding”, underlines its commitment to produce high-quality data that can be used to inform public debate. This commitment is reflected in INSEE’s surveys, use of administrative data sources and range of information systems. These sources can be used to produce figures that are published in reports and releases available on the INSEE website, which are in turn used by the media, researchers and other users.
Yet these figures, when taken in isolation, do not provide any information. To become “data”, they must be accompanied by a suite of information that allows us to understand and interpret them correctly. This information often goes unnoticed as it is interspersed with commentary and remarks that often accompany data analysed in a publication or used in a newspaper article. Take the example of the 2018 summer tourist season review, featured in publication No. 134 (November 2018): “Over the six months of the 2018 summer tourist season, there were 311 million overnight stays in short-stay tourist accommodation in metropolitan France, an increase of 1.3% on the 2017 summer season.” The commentary gives the essential details in order to determine at first glance what the 1.3% refers to. It is a comparison with the same period in the previous year; the period in question is the “summer season”, whose definition is given in the article (April to September inclusive). This comparison refers to the number of “overnight stays”, which is also defined in the article. The territory in question is metropolitan France. The study covers short-stay tourist accommodation, which is made explicit under “Definitions” on the INSEE website, and refers to a number of sub-sectors in division 55 (Accommodation) of the French Classification of Activities (NAF), also available on the INSEE website.
Users are given details that allow them to determine what the figures (i.e. 311 million, 1.3%) refer to and the categories that they use, thereby being able to understand what the 1.3% figure conveys. More informed users can also find further information on the INSEE website regarding the visitor surveys used to arrive at the final figures and which are referenced as sources in the article. They can find out the number and type of units surveyed, the number and type of questionnaires, response rates, adjustment methods or release formats. In our example, the details used to understand the figures are presented in the form of explanatory commentary in the publication. However, we can go one step further: information that defines and describes data, such as concepts (e.g. overnight stay) or classifications (e.g. French Classification of Activities) are in themselves data that need to be isolated and managed. These are known as “metadata” and are used in a number of areas of the statistical production process.
Wide-ranging statistical metadata ...
The above examples are important metadata for statistics, but many more exist, and take a variety of forms. Due to their heterogeneity, gaining a firm grasp of the concept can sometimes prove difficult. For clarity, we can however outline a number of ways in which they can be characterised.
Metadata can be distinguished by the type of information they provide (Box 1): some provide a detailed definition of what a figure refers to (e.g. classifications, concepts, code lists), others provide qualitative information about statistical output (e.g. methodology used, details about a data series, response rate, seasonality adjustment methods), while others clarify the production process (e.g. collection dates, release calendar and frequency). Together, from different perspectives, they give users a clear idea about the subject of discussion, in accordance with the principle of clarity in the European Statistics Code of Practice. However, descriptions of management guidelines or procedures are not included in statistical metadata, as they are not essential to understanding the output, despite the value of such operational documentation for managers and research analysts.
Metadata can vary in scope. Some are cross-functional and may be used by a number of data producers, existing entirely separately from a statistical operation. This is the case for most classifications (Guibert et alii, 1971), which are, by nature, shared, and for concepts that ensure consistency in statistical output (e.g. primary occupation, minimum age, urban area, etc.). Some on the other hand are specific to an operation: its description or the questionnaire in the case of a survey, the survey design (where applicable), as well as variables and code lists. In the space between cross-functional and specific, some may however be shared between multiple statistical operations. This is the case for some variables and code lists that are pooled for multiple surveys.
Metadata for a statistical operation can also be distinguished by the time when they are collected. They are not “collected” in one step; they are instead inputted by the producer throughout the life cycle of an operation. Consequently:
- When specifying needs, the producer may detail the objectives of the planned statistical operation, its position in the information system and its main characteristics, such as sectoral coverage, the date and method of collection or geographic coverage. Where applicable, the producer may also request the creation of concepts during this step.
- When designing and building collection tools, the producer specifies the variables that he/she would like to collect (e.g. turnover/sales, staff numbers, etc.) and any relevant organisation or procedures (e.g. code lists). For this step, the designer may reuse existing concepts or classifications.
- During collection, processing and analysis, the producer may use metadata that are of use in qualifying statistics that emerge from the process, such as the number of units surveyed, response rate, procedures for control, imputation, adjustment and non-response.
- During the dissemination phase, the producer may refine and elaborate in sections relating to the publications in question.
After following these steps, the producer has gradually built up the range of metadata that can be used to understand the output. They can then be used in quality reports for Eurostat, or on the INSEE website.
... Standardised, accessible and re-usable ...
For metadata to act as mediator between figures and their users, not only must they exist but they must be easily accessible, as much so as the figures themselves. As we have seen, information linked to statistics is sometimes incorporated within notes and commentary. This is not data that can be used in isolation; accessing them requires human interpretation. However, most statistical output is not subject to specific commentary. They are communicated through vast data files made available on the INSEE website or sent directly from machine to machine using interfaces. Metadata must therefore be easily accessible, visible, standardised, consistent over time and suited to each transmission method. Accessibility takes on two aspects:
- A shared model for describing metadata: given the volume and range of metadata, it is necessary to model their description to facilitate dialogue between producers and users. A shared model would indicate to users where to find a given piece of information. For the hundred or so metadata that might describe statistical operations, a Europe-wide standard was introduced in 2015, arranging them into approximately twenty groups, such as “Unit of measure”, “Reference period”, “Frequency of dissemination”, “Cost and burden”, “Statistical processing”, etc.
- Standardised formats and language: adopting standards for metadata description ensures that they can be reused, whether to search for and select variables and classifications or to input into client applications. It is essential for linkage between various sources. For example, for surveys, an international consortium defined a technical documentation standard (DDI, or Data Documentation Initiative, launched in 1995). Standardising this documentation involved modelling a range of statistical items (e.g. questions, questionnaires, variables, code lists, etc.) and their relationships, all in the form of XML documents. Standardisation began as conceptual, insofar as each metadata type has a well-defined position within the conceptual metadata descriptive model. But it should also include a technical format for exchange between systems. Conceptual and technical standardisation favour reuse via systems, using APIs (Application Programming Interfaces). DDI is not the only example: the SDMX (Statistical Data and Metadata eXchange) is supported by international bodies, including Eurostat, in order to promote statistical data and metadata transmission and exchange internationally. These standards have been widely adopted by national statistics institutes.
... Centralised within a repository: RMéS
As mentioned above, statistical metadata are essential to understanding the statistical output that they describe. By standardising metadata, they can be accessed and reused easily. The producer of a single statistical operation need not look any further. But for an organisation such as INSEE, which handles a large number of statistical operations, centralisation is necessary. It ensures uniformity in the representation of cross-functional metadata and the consistency of information provided, using identification rules.
Management of “ownership rights” involves designating a manager responsible for updating each item of metadata. This characteristic is particularly helpful for cross-functional metadata such as classifications and concepts. So whilst multiple services use the same item of metadata, only one (the “owner”) is authorised to modify it. For example, the Tourism Unit in the INSEE Occitanie Regional Office “owns” the concept of “overnight stay”. Cross-functional data may be updated once by the “owner” for all operations that use them. Centralisation and standardisation of metadata provide the opportunity for further pooling, such as the use of the same management systems, in order to create, modify and input non-specific metadata for any statistical operation.
This centralisation function (“single point entry”) for statistical metadata, formerly provided by the Structured Data Documentation System (DDS; Box 2) at INSEE, has since been replaced by the Statistical Metadata Repository (RMéS). One leading expert in metadata repositories is Ouvrir dans un nouvel ongletBo Sundgren, Swedish mathematician and Professor of information processing at the Stockholm School of Economics between 1987 and 2005. This repository is primarily a single location for recording items, with shared administration rules. Furthermore, INSEE has deliberately incorporated the aforementioned international standards for statistical metadata description within this repository to facilitate their reuse (see previous point) and pave the way for exchange with other statistics institutes, Ministerial Statistical Departments, and national or international partner organisations.
An existing wealth of content
First deployed in 2018, RMéS has taken over cross-functional metadata previously contained in DDS. It contains approximately 1,200 concepts (e.g. turnover, margin rate, poverty rate, disposable household income, living zone, etc.). Each concept has an identified “owner” within the Official Statistical System, responsible for the relevant definition. RMéS also contains six classification “series”. Designed and structuring for statistical production, classifications are references, validated by expert groups, that are used to classify economic and social information. For instance, INSEE plays a central role in defining and updating professions and socio-professional groups (Professions et Catégories Socioprofessionnelles – PCS). INSEE also administers the classifications for activities and products in France (Nomenclature d’activités française – NAF and Classification des produits française – CPF), in line with equivalent EU and international classifications.
RMéS has gradually incorporated metadata more specific to certain statistical systems but which can nonetheless be shared. These include variables, code lists and the survey questionnaire (or questionnaires, in the case of multimodal collection). For example, the Labour force survey in 2018 consisted of 1,200 variables with a description in RMéS.
All metadata are related to statistical operations, including surveys, systems for incorporating administrative data (e.g. the 2018 survey of hotel attendance, the 2015 Annual Declaration of Social Data), or other types of operations that use new data sources such as cash register data.
RMéS features approximately 130 series of statistical operations, which may be dated. They are classified based on a tree structure for families/series/operations which is used to organise the repository. For example, the survey on usage of information and communication technologies and e-Commerce in enterprises with less than 10 persons employed (TIC – TPE) in 2016, referred to as the “ICT-microenterprise 2016”, is an operation linked to the “ICT-microenterprise” series which, together with the “ICT enterprises” and “ICT households” series, make up the “ICT” family.
The repository was set up using metadata from DDS. It has yet to be supplemented by metadata linked to new operations or which were not included in DDS. At the same time, a comprehensive review of metadata quality and governance must be undertaken to ensure that the repository continues to be appropriate and viable over the long term.
Repository service offerings
RMéS is not a stand-alone database. For it to function effectively, it must have the management tools to facilitate updates and ongoing access. It is an environment (e.g. management applications, services) that has been gradually implemented around the repository (Figure 1). As a result, producers can now input metadata related to their operations only once to enable a range of uses. Downstream, the repository’s most basic function is to provide information that it contains to a range of users (human or machine) in a variety of formats.
One of the first users are the producers themselves, who need statistical metadata for its operation in order to automatically generate the documents necessary for the production process in official statistics, including seeking opportunity statement from the National Council for Statistical Information (CNIS), submissions for the Committee for the National Statistical Quality Label (these functions are planned for RMéS). Producers may also contribute to the dictionary of variables using the corresponding metadata in RMéS that they had previously inputted.
As statistical metadata can be inputted throughout the life of an operation, when complete RMéS will contain all information required to produce a quality report for bodies such as Eurostat in the case of European statistics, or any type of documentation regarding the operation. The repository therefore seeks to record dissemination files, in particular microdata supplied by the Secure Access Data Center (Centre d’accès sécurisé aux données – CASD). The portal used to access files held by the CASD replicates the structure adopted by RMéS.
By providing information for the “Definitions, methods and quality” page on the INSEE website, RMéS helps to ensure widespread availability of statistical metadata, in accordance with the European Statistics Code of Practice. Information is supplied automatically to ensure coordination in publication of data or reports as well as documentation of the statistical operation in question. A specific section is dedicated to concepts and classifications. The “Data sources and indicators” section offers access to descriptions of operations and where applicable an image of the survey questionnaire.
More generally, the repository can supply any system that uses statistical metadata, provided that suitable interfaces have been developed. For example, it already provides the classification of activities via an API released alongside the Sirene API on INSEE’s API portal. Other client applications may “connect” to the repository to retrieve cross-functional (e.g. classifications and concepts) or operation-specific metadata. By preventing duplication and multiple separate operations, this connection favours consistency and facilitates updates to the Statistical Information System.
To perfect statistical metadata storage and extraction processes, a repository consultation portal is set to be developed.
Towards active metadata
The use of repository metadata can extend beyond their function of describing and helping to understand, with access via a dedicated interface or client applications. New tools therefore aim to exploit the exhaustive and standardised nature of metadata to automatically generate components of the statistical process. They take on a new role to some extent, no longer simply “information to facilitate understanding of statistics” but becoming “data used in the production process”; hence the notion of active metadata.
Figure 1. The RMés statistical metadata repository
With this approach, adopted in advance of the survey process, a graphic design interface for a questionnaire (known as “Pogues”) has been developed to allow survey designers to tailor a questionnaire to international standards, with the need to be familiar with the technical properties of the formats. Simply put, survey designers define the structure of their questionnaire in terms of series and modules. They input the questions that they would like to ask survey units, as well as the response format. Using the Pogues application, they can generate the relevant metadata in accordance with standards on request (e.g. variables, code lists and, where applicable, classifications). A second tool, “Eno”, is used to generate collection formats (e.g. PDF questionnaires, online forms) and view the format produced for the designer in a transparent manner. The generator produces formats that adopt a standard template with appropriate formatting. Formats may then by stored on a dedicated platform (e.g. Coltrane, a collection platform for business surveys) and collection can then begin. In this example, the collection format generator “activates” the metadata.
This questionnaire design interface using statistical metadata shows that this approach can produce instant benefits for designers. It is a shared benefit: quality improvements through more consistent, traceable and transparent metadata, as well as the benefits of pooling. Such an approach contributes to efficiency gains and higher quality more generally. Work is ongoing to determine the extent to which the principle of active metadata could be used to describe controls or specify dissemination products.
One repository for the whole Official Statistical System?
Official Statistics Services are generally placing a focus on metadata. Ministerial Statistical Departments already have read access, along with all internet users, to the warehouse that hosts concepts and classifications, and the repository can be viewed on the INSEE website. INSEE and Ministerial Statistical Departments work closely to harmonise shared concepts and validate shared operation documentation.
However, the more ambitious long-term aim is to extend write access to Ministerial Statistical Departments, for a number of reasons:
- Ministerial Statistical Departments own certain concepts and their role thereby entails validation of information added by the INSEE Quality Unit to the repository;
- They may also manage specific classifications in their area of expertise but which are used more widely within the Official Statistical System (e.g. classification of offences, occupational groups (“Familles professionnelles”), etc.);
- Some may want to give a description of their sources in the repository, from which EU-format quality reports can be extracted;
- Lastly, some would like to have access to the application used to define questionnaires used in surveys.
Identifying the needs of Ministerial Statistical Departments is now the subject of ongoing discussion, in order to determine what responses could be given and the related framework. This involves increasing the use of metadata as part of the Official Statistical System’s quality strategy. In this context, and based on close cooperation for which arrangements are as yet undefined, RMéS is expected to play a key role over the coming years.
Box 2. RMéS has replaced the Structured Data Documentation system (DDS)
INSEE has placed a focus on metadata for some time. In the 1980s, INSEE had designed an innovative Dictionary of Statistical Data, which was updated in the early 2000s to become the Structured Data Documentation System (Dispositif de Documentation Structurée – DDS). This system has played a vital role in managing and storing documentation from a wide range of statistical operations. DDS was also used in data feeds to the INSEE website and to produce the dictionary of codes, which is of value to producers and research analysts. It did not however impose one single restrictive model, hence a plethora of models that impeded pooling. As DDS hosted a large number of individual databases, the lack of interlinkage led to greater risk of inconsistencies. The emergence of international standards, combined with the growth of exchanges between organisations, led to an overhaul of the system.
Paru le :27/06/2019
The SIMS (Single Integrated Metadata Structure) was adopted in November 2015 by the European Statistical System Committee (ESSC). It will be gradually phased into regulations and will be a requirement in quality reporting structures. It will facilitate comparisons of data produced in accordance with EU regulations.
Application programming interfaces that enable a software application to provide services to another software application.
By delegation of the project manager, the Department of Sectoral Economic Outlooks within the Business Statistics Directorate.
The adoption of the EU standard for statistical operation description in 2015 highlighted the “once for all purposes” principle.
The CASD is a GIP (groupement d’intérêt public, or public interest group) that provides researchers with a secure tool to access highly detailed individual data once approved by the statistical secrecy committee, and therefore most often subject to additional security restrictions.
In this case the DDI standard, as already mentioned.
Pour en savoir plus
COTTON, Franck et DUFFES, Guillaume, 2010. SDMX : Un standard pour l’échange de données et de métadonnées statistiques. Document interne Insee.
COTTON, Franck, MARTIN, Mélanie et TAILHURAT Romain, 2018. Ouvrir dans un nouvel ongletReport on the implementation of three statistical services – ESSnet SCFE (Sharing common functionalities in the ESS) – Deliverable D3-1 [en ligne]. [Consulté le 8 avril 2019].
CROSNIER, Dominique, 2000. Le nouveau DDS de l’Insee. In : Courrier des statistiques [en ligne]. Mars 2000. n° 93, pp. 10-17. [Consulté le 24 mai 2019].
DESROSIERES, Alain, 2008. Les catégories socioprofessionnelles. In : Courrier des statistiques [en ligne]. Novembre 2008. n° 125, pp. 13-15. [Consulté le 24 mai 2019].
DUBOIS, Thomas et KEROUANTON, Marie-Hélène, 2018. Ouvrir dans un nouvel ongletThe French statistical Metadata Repository, RMéS : managing metadata throughout the whole statistical process. In : European conference on quality in official statistics, Krakow [en ligne]. [Consulté le 8 avril 2019].
DUFFES, Guillaume, 2014. Use of standards at Insee. Document interne Insee : note pour le Workshop of the Modernisation Committee on Standards : International Collaboration for Standards-Based Modernisation, Genève, Suisse, 5 – 7 mai 2015.
EUROSTAT, 2015. Ouvrir dans un nouvel ongletESS handbook for quality reports. In : Theme 1 : General and regional statistics [en ligne]. Édition 2014, Luxembourg, Publications Office of the European Union, Manuals and guidelines. [Consulté le 8 avril 2019].
GREGORY, Arofan, 2011. Ouvrir dans un nouvel ongletThe Data Documentation Initiative (DDI) : An Introduction for National Statistical Institutes. In : Open Data Foundation [en ligne]. Juillet 2011. [Consulté le 8 avril 2019].
GREGORY, Arofan et HEUS, Pascal, 2007. Ouvrir dans un nouvel ongletDDI and SDMX : Complementary, Not Competing, Standards, Version 1.0. In : Open Data Foundation [en ligne]. [Consulté le 8 avril 2019].
GUIBERT, Bernard, LAGANIER, Jean et VOLLE, Michel, 1971. Essai sur les nomenclatures industrielles. In : Économie et statistique [en ligne]. Février 1971. n° 20, pp. 23-36. [Consulté le 24 mai 2019].
POULAIN Claude, 1983. Le dictionnaire des données d’une production statistique. In : Courrier des statistiques [en ligne]. Juillet 1983. n° 27, pp. 25-29. [Consulté le 24 mai 2019].
ROUPPERT, Benoît et KEROUANTON, Marie-Hélène, 2014. Projet de Référentiel de Métadonnées Statistiques. Document interne Insee : note pour le comité des investissements N° 181/ DG75/MHK.