Courrier des statistiques N9 - 2023

This ninth issue of the Courrier des statistiques is marked by a number of highly technical papers and by topics that are unusual for the review.
It all begins with a story: the history of Official Statistics, from the angle of democratic debate, in the 40 years since the creation of INSEE.
In order to contribute to the public debate, INSEE has recently innovated with the introduction of "distributional national accounts", which make it possible to better analyse the distribution of growth and its impact on household incomes. The second paper explains the principles, mechanisms and prospects.
Then occurs a change of theme with two papers on data confidentiality. One sets out the legal framework, the risks associated with breaches of confidentiality and the subtleties of applying statistical confidentiality in a changing context. The other, more operational, explains the rationale behind the "non‑significant statistical code" (CSNS) and how it facilitates the matching of different sources while ensuring the protection of individual data.
The last three papers deal with related topics that are important in a "world of data". We begin with data formats, a topic that is not often discussed, but one that statistics cannot ignore. Choosing and managing the right formats is essential when statisticians use third‑party data sources. The paper on the integration of administrative data reveals an automated processing pipeline driven by metadata, a prerequisite for more traditional statistical production. Finally, the CNAV (National Old‑Age Pension Fund) explains the importance of formalised and documented exchange standards automatically generating control tools for better monitoring of data quality in the field of social security.

Courrier des statistiques
Paru le :Paru le29/10/2024
Yves-Laurent Bénichou, Lead Data Scientist, INSEE innovation lab, INSEE, Lionel Espinasse, Deputy Director of the Demography Department, DSDS, INSEE, and Séverine Gilles, at the time of writing, Statistical Project Manager, DSDS, INSEE
Courrier des statistiques- October 2024
Consulter

The Non-Significant Statistical Code (CSNS): a service to facilitate file matching

Yves-Laurent Bénichou, Lead Data Scientist, INSEE innovation lab, INSEE, Lionel Espinasse, Deputy Director of the Demography Department, DSDS, INSEE, and Séverine Gilles, at the time of writing, Statistical Project Manager, DSDS, INSEE

File matching considerably increases the possibilities of studying economic and social phenomena. The Non-Significant Statistical Code (CSNS) was defined by the Law for a Digital Republic of 2016 in order to allow the implementation of file matching for statistical purposes without using the NIR (or national insurance number), thus ensuring a high level of personal data protection. The general principle is to use a matching key calculated from an irreversible encryption of the national insurance number. This new service offered by INSEE to official statistical services applies to a wide variety of administrative or survey files. An innovative method has been developed to identify people on the basis of their identity. At the end of the process, the reliability of the identification is measured by quality indicators.

The first uses are promising. For example, the CSNS can contribute to the analysis of the integration of young graduates by facilitating the matching of data from the education system and the Ministry of Labour. It also helps to measure the impact of the ecological transition according to household categories by matching vehicle register data and income information.

With the development of efficient information systems in many sectors, has become a very powerful means of enriching data. It allows various types of information from different universes to be linked and is less expensive to implement than field surveys. The Non-Significant Statistical Code (Code statistique non signifiant — CSNS) was defined by the 2016 Law for a Digital Republic () to allow such matching to take place while maintaining confidentiality by limiting the use of the NIR .

Two examples can be used to illustrate the improvements brought by matching: to understand what becomes of higher education students after completing their studies, the relevant data to cross-reference are those pertaining to education, which are held by the Ministry of Higher Education, and employment data held by the Ministry of Labour. Or, using a different register, comparing data concerning the fleet of motor vehicles and their energy consumption with the income of their owners makes it possible to determine the impact of increases in fuel prices on different population categories. By linking information collected by different bodies, file matching therefore considerably increases the possibilities for studying economic and social phenomena.

. The records pertaining to the same individuals can then be linked and the information concerning them from the two files pooled. Ideally, the identifier is formal in nature, is held within a register and makes it possible to distinguish between individuals in a unique and unambiguous manner. The NIR, i.e. the identification number in the National Register for the Identification of Individuals (), commonly referred to as the “national social insurance number”, is a very good example of an identifier (Espinasse and Roux, 2022).

However, certain rules must also be respected, particularly when it comes to the expectations of citizens concerning the protection of their personal data. Prior to 2018, in order to meet such expectations, file matching based on the NIR for the purposes of Official Statistics required in particular, and each time data was processed, a decree of the Council of State on the basis of a reasoned opinion issued and published by the National Commission for Information Technology and Civil Liberties (). In view of the growing demand for matched data for the purposes of designing and evaluating public policies, it has become necessary to consider a process that is simpler from a legal standpoint and that respects the principle of data minimisation and the special status conferred on the NIR by the Data Protection Act (Law on data processing, data files and individual liberties). The idea was therefore put forward to no longer use the NIR directly as a matching identifier, but to instead use a non-significant identifier, the “hashed NIR”, which retains the technical properties of an identifier while making it impossible to link it directly back to a person’s identity.

A legislative innovation for a digital Republic: the CSNS, a service offered by INSEE

In this regard, a specific system has been put in place with the creation of the Non-Significant Statistical Code (). More than just a simple code, the CSNS is also an INSEE service aimed at facilitating file matching for individuals within an Official Statistical Service (service statistique public — SSP), which guarantees a higher level of protection of personal data than when using the NIR.

The Law for a Digital Republic specifies that institutions belonging to the Official Statistical Service may benefit from the CSNS. Therefore, the service provided by INSEE is also aimed at Ministerial Statistical Offices (MSOs). Matching can be performed between files from two MSOs, one MSO and INSEE, or two INSEE units. These configurations can easily be extended to three or four partners.

INSEE’s service includes a technical dimension for the purposes of identifying individuals and ensuring the irreversible encryption of the NIR, and an organisational dimension with the provision of an application allowing users to submit their requests and retrieve their results.

Any processing carried out is recorded in the processing activities register provided for in Article 30 of the General Data Protection Regulation () and must be published. Each time data is processed, the National Council for Statistical Information (Conseil national de l’information statistique — CNIS) must also be informed so that it can evaluate this new system. This list of processing activities is published on the CNIS website.

The principle of data minimisation is also subject to specific scrutiny: only the information strictly necessary for CSNS calculations is exchanged between INSEE’s partner organisations, with INSEE subsequently destroying the data entrusted to it as soon as the calculation in question is complete.

Encrypting the national social insurance number: the CSNS as a code

The CSNS is intended to serve a number of purposes in various fields. Rather than limiting its use to the matching of a few specific files, it must be possible for it to be used by all Ministerial Statistical Offices for all their data sources. It therefore relies on a population register that covers the entire population living in France, the National Register for the Identification of Individuals (RNIPP), which collates all national social insurance numbers.

The general principle of the CSNS is that it irreversibly encrypts national social insurance numbers (NIR) to create a matching key, which ensures that it is impossible to individually identify data subjects.

As each person has a unique NIR, the CSNS calculation will always give the same result, regardless of which file it is applied to, and therefore allows for matching without the need to know the identity of the persons in question.

Lastly, the calculation of the CSNS relies on a fully automated process. This principle was laid down during the design phase of the project. Owners of files have access to a dedicated application and can independently submit their requests and retrieve their results. They receive CSNS files produced without any human intervention, including for the handling of even the most complex cases (see below). Such automation not only offers the advantage of making the cost of providing this service manageable for INSEE, thereby allowing it to provide this service free of charge, but also saves on processing time. However, it does require standardisation of the process of identifying people and the process of evaluating the quality of the results. It also means that the requesters are responsible for performing the data preparation work, which varies depending on the specific features of each file.

Two options for adapting to needs

The wealth and quality of information regarding the identity of individuals varies depending on the files to be matched. Some already contain the NIR, while others do not, but instead collate identity traits, such as last name, first name(s), sex and date and place of birth. Two options are therefore offered (figure 1). The CSNS can either be calculated on the basis of the NIR or on the basis of identity traits.

Figure 1 - How is the CSNS obtained?

 


Calculations based on the NIR simply require irreversible encryption of the NIR. Calculations based on identity traits are more common, and require an additional step. The NIRs of the individuals in question must first be found using their identity traits. This is referred to as the “identification step”. It is then simply a case of encrypting the NIRs found. This latter option is more complex to implement and returns results with varying reliability depending on the quality of the identity data entered.

It works as follows: each file owner submits the NIRs or identity traits of the individuals in their file to INSEE’s CSNS service.

They are transmitted in a secure manner via a dedicated application that is only accessible to requesters approved by the Official Statistical Service. In return, each requester receives a CSNS calculated for each of the individuals within their file (box 1). The CSNS will always be the same for the same individual, regardless of the source in which this individual appears and regardless of the year in which the CSNS is calculated. Using this , partners are then able to match their file.

Box 1: The CSNS in practice

The service that calculates the CSNS on the basis of the NIR has been available since October 2021, while the service offering identification based on identity traits was launched in October 2022. It is reserved for use by the Official Statistical Service.

To gain access, each Official Statistical Service must first sign a subcontracting agreement with INSEE, which sets out the rights and obligations of each party and indicates the operating principles of the process. This agreement is valid for 5 years. Once signed, each Official Statistical Service is free to make as many requests as it wishes. The service is provided free of charge.

However, the CSNS calculation phase is just one of the phases of the matching process. It forms part of a more general approach in which two file owners match their data, make use of the result of that matching and disseminate it. In its capacity as CSNS operator, INSEE offers a subcontracting service to these file owners for the production of a shared matching key, but is not involved in any other aspects.

In particular, at least one of the file owners must act as controller within the meaning of the GDPR and must perform all the necessary steps in this regard, in particular the production of a Data Protection Impact Assessment (DPIA) if necessary.

CSNS processing is therefore usually a six-step process:

  • establishment of an agreement between the two file owners, setting out the conditions under which their data will be matched and the use of those data;
  • declaration of processing by the controller(s);
  • recording of the processing in the work programme submitted to the CNIS;
  • request from each owner to INSEE for a CNIS calculation and return of the result of the calculation, together with quality indicators, from INSEE to each owner;
  • matching of data by the file owners according to the conditions they have defined in their agreement;
  • retention by each owner of their data in accordance with the rules set out in the legislation and reiterated in the subcontracting agreement.

The practical means of submitting a CSNS calculation request are simple. Once the authorisations have been verified, the requester submits their NIR or identity traits file via a dedicated online application. They then receive the CSNS in return, together with their quality indicators. Each requester can repeat their requests as many times as they wish, particularly if they make improvements to their input file.

The encryption technique used complies with stringent security requirements and takes place in two stages. The first stage in the calculation of the CSNS is to hash the NIR by running it through a program that irreversibly converts the NIR into a string of non-significant characters and the second is to encrypt this “hashed NIR” using a secret key. This two-stage operation ensures that it is impossible to trace the CSNS back to the NIR. Even if the encryption key were to be revealed, it would only allow for the hashed NIR to be unencrypted and not the NIR itself.

After matching, the CSNS must be retained by each owner in a secure and isolated manner in a file that does not include any socio-demographic variables, NIRs or identity traits. The only element allowing for the CSNS to be linked back to the individual in the future for the purposes of subsequent matching is a serial number. In addition, the retention period must be proportionate to current and potential future needs.

If the requester does not have the NIR — only identity traits — an identification stage is required before this NIR hashing and encryption process.

Retrieving NIRs based on identity traits: the identification stage

The process of identifying the NIR based on identity traits involves the use of an engine specifically developed for calculating the CSNS. There is already a process for identifying individuals within the RNIPP, but this is used for administrative purposes (tax services, for example). There is no permitted margin for error when it comes to these configurations, which have administrative consequences for people’s lives. Identification is only confirmed if the match between the identity traits to be verified and the identity traits in the RNIPP is certain. Identification sometimes fails for this reason. The search for an exact match sometimes results in a relatively high match failure rate.

However, statistical needs are not the same as administrative needs. Out of a population of several thousands or several hundreds of thousands of people, a few identification errors would not have any significant impact on the final statistical results (and no impact on the individuals themselves).

It may therefore be in our best interests to accept a few approximations in the matching of identity traits if this makes it possible to increase the identification rate, subject to a small but acceptable margin for error. The challenge is to strike a balance between maximising identification and minimising errors. This is all the more important when the statistical sources used are surveys. In general, survey respondents are less careful when filling in their identity data for a survey than they would be when filling in an administrative form. It is particularly the case with the optical reading process for paper questionnaires, such as for the population census, which may add uncertainty regarding the quality of the identity data entered. That is why a specific identification engine has been developed for the CSNS.

This identification engine was built partly on the basis of the matching theory (Ouvrir dans un nouvel ongletChristen, 2012 ; Ouvrir dans un nouvel ongletFellegi et al. 2014), but also takes account of the specific nature of the population register used. The data preparation phase is the responsibility of the requesting body and not INSEE’s CSNS team, although some actions are necessary at the start of the calculation process in order to optimally adapt the data to the engine’s operation.

In addition, the population register’s very large volume (the RNIPP contains 130 million records) resulted in three steps being established for identifying a NIR with the aim of optimising processing times depending on the difficulty of the cases in question. In this regard, the principle adopted was to process simple cases with less time-consuming processes and to reserve complex processes for the cases that actually need them.

A sequence of steps, from the simplest to the most complex

In order to find a NIR, a sequence of three steps is established, each based on the various levels of complexity of the search (figure 2).

Figure 2 - (Fictitious) example of finding a NIR (national social insurance number)

 


The first step is an “exact” query. The identifying elements (last name, first names, date of birth, geographical code of the place of birth) are searched for in the RNIPP for exact matches. All of the first names must also be exact. During this step, identification follows a fundamental principle: it can only take place if only one is a candidate for identification. It should also be noted that searches are also performed using previous names (where people have changed them) or using aliases (often a married name when listed in the RNIPP).

Then comes a second step known as “simple queries”. This allows for some “” on certain variables. Here, again, only echoes without competitors are retained. Five successive relaxations of the identification constraints are implemented:

  • a simple relaxed constraint on first names: the identification elements are all identical except for first names, for which identification is permitted on just one rather than all;
  • a simple relaxed constraint on the geographical code of the place of birth: the identification elements are all identical except for the geographical code of the place of birth; for first names, the first of these must be exact;
  • a simple and total relaxed constraint on first names: the identification elements are all identical except for the first names;
  • a simple and total relaxed constraint on the last name: the identification elements are all identical except for the last name; for first names, the first of these must be exact. This relaxed constraint on the last name may appear surprising at first, since the last name would appear to be the determining element for an identity. However, tests have shown that many identification failures were caused by using a married name in the user files, while this may be unknown in the RNIPP (the married name is not a mandatory variable in the RNIPP). This query therefore makes it possible to identify many married women who declare their married name in surveys rather than their birth name;
  • Identification by switching the last name and the first first name: the identification elements are all identical with the exception of the last name and first first name fields, which are switched for the identification search.

These successively relaxed variables generally allow for the addition of many more records to the identifications made during the exact query stage. For the administrative files tested and with high quality entry of identity traits, these first two steps generally allow for the identification of more than 95% of individuals. Conversely, in the case of survey files where the quality of the entry of identity traits is poorer, these two steps return fewer results (80%).

Following these first two steps, the most complex cases still need to be identified. Some were deliberately excluded from the selection of previous queries, as their identity data are subject to a degree of caution: these are persons who have declared their date of birth as 1 January and 31 December () as well as those with a date of birth (day/month) of (00/00). Identifications resulting in a person with a different sex than the one declared are also excluded from the previous selections. The remaining complex cases are simply the people who were not identified during the previous steps.

An innovative method for the most complex cases

In order to process these cases, a third and final “approximate value” query allows for the constraints on several variables to be relaxed simultaneously. The person to be identified matches multiple echoes within the RNIPP and the aim is to choose the one that has the most characteristics in common with the person to be identified. The selection is based on a ranking of echoes, which is achieved by calculating a score. The score is a “points” total, with points being allocated based on matches between the RNIPP echo and the person to be identified.

These scores are first assigned to fully identical identity elements. For example, matches with the same geographical code for the place of birth are allocated 20 points; matches with the same day and month of birth are allocated 20 points; matches with the same last name are allocated 10 points; matches with the same first first name are allocated 10 points, etc.

However, this method is not able to properly differentiate between echoes and prevent the allocation of tied scores. In such cases, an additional criterion is added to narrow down the searches and calculate more detailed scores.

The second search criterion involves filtering the echoes from the RNIPP based on fragments of last names or first names that they have in common with the person to be identified. These fragments are strings of characters of varying length (between 3 and 5 characters). For example, the first name Justine is made up of five 3-character strings: jus – ust – sti – tin – ine. This technique, referred to as the “n-gram method” makes it possible to effectively differentiate between the various echoes with elements in common with the identity traits of a person being sought. It also enables us to counteract some typing or spelling errors. Next, each matched string is allocated points, which are added to the score. These operations are performed using the Elasticsearch software (box 2).

Box 2 : A search engine at the core of the identification engine within the RNIPP

In recent years, the range of tools available to statisticians has been significantly enhanced: programming languages such as R, Python and Julia, high-performance module libraries to process data cleaning, transformation and analysis, machine learning, deep learning, visualisation, language processing, etc.

These developments have been accompanied by equally innovative technological advances in the way that data are stored, organised and processed: widespread availability of management systems for the SQL databases, NoSQL databases, new storage formats, distribution of processing. A great many technological advances have been made, the majority of which are open-source, which allows for the effective processing of big data.

Among these tools, search engines have benefited significantly from all of these recent developments, have been involved in the success stories of the biggest Internet players and are now an essential part of everyday life. We all make conscious use of them each and every day when we open our browser, but we also use them subconsciously, since applications for social media, online shopping, maps, transport, and music and video streaming all use their real-time search and analysis capabilities on a very large scale.

For the purposes of identifying people within the RNIPP, where it is a case of matching identity traits files, which may contain up to several million rows, with the register of individuals, which contains more than 130 million rows, with reasonable processing times, the CSNS project team quickly turned to a solution using the Elasticsearch search engine for a number of reasons:

1) A powerful and scalable technical architecture

Elasticsearch works on the basis of a distributed infrastructure: multiple instances (or nodes) may be launched on a set of servers (or server farm) and collaborate with one another. Depending on the volume of data involved, they may be split into multiple partitions and then distributed and replicated across multiple nodes in order to deliver high performance and security thanks to the search engine’s load-sharing features and high availability.

Should one or more nodes fail within the farm, the system continues to operate in fail-safe mode, without any impact other than increased response times, provided the remaining nodes are able to access at least one operational version of the data.

- Elasticsearch works on the basis of an expandable architecture: the processing power offered by the farm can be adapted to changing needs; it is possible to dynamically expand the farm by adding servers.

2) Advanced text processing and analysis functions

Elasticsearch offers native functions, such as full-text search, analytical tools for processing and standardising the text, synonym search, geospatial data analysis and segmentation into lexical units which makes it possible to split sentences into words or n-grams of words or characters.

3) Flexible architecture

Elasticsearch can be easily integrated into an IT project, as it is easy to query using any programming language, including Java, R and Python, to name just those included in INSEE’s IT master plan.

In addition to all the mechanisms and features described above, the ElasticSearch search engine’s strength lies in its ability to find information from among millions of rows in real time. To do so, it “cheats” a little: everything is pre-calculated during the “indexation” phase while the data is loading; “everything” should be understood to mean all of the possible n-grams, versions of words without special characters, without capital letters, without meaningless words, etc. This phase is very time-consuming (6 hours for the RNIPP) and is performed once per month in the case of the CSNS to incorporate recent updates to the RNIPP.

Then, when a search query is submitted, the server receiving it distributes it across the servers in the farm and a relevance score is calculated for each record matching the query. This score is based on the similarity module used by Elasticsearch and configured by INSEE to evaluate the similarities between search terms and indexed terms. The records with the highest scores are considered to be the most relevant and are suggested first in the search results.

In conclusion, the solution chosen for the CSNS allows it to benefit from the efficiency of a search engine that integrates a combination of several factors: pre-calculation of values for all search fields, distribution of information across multiple servers, distribution of queries across all of these servers to increase processing power and, lastly, a similarity engine to produce the relevance scores for the results.

The echo from the RNIPP with the highest score is selected and the CSNS will then be calculated by encrypting the NIR associated with that echo as described in the previous steps.

An obligation: measuring the quality of the identification

The regulatory provisions also stipulate that the CSNS service must produce a measure of the quality of the identification of individuals. This is essential to allowing users to assess the reliability of future matching and to adapt any statistical processing accordingly. The quality of the identification is assessed from two angles: the general quality of the identity traits file used as an input for the process and the quality of the identification for each individual record.

The quality of the input file is the responsibility of the requester, but the CSNS service gives them information to help to improve it. For each of the variables “last name, first names, year of birth, day and month of birth and geographical code of the place of birth”, indicators are provided with regard to the rates of anomalies and missing values. This information is provided directly by the application made available to users. This gives them the freedom to test multiple versions of their file. Following a first analysis, they can thus identify the variables that need to be improved. This analysis can be repeated as many times as necessary.

The quality of the identification of each record is the responsibility of the CSNS service and is measured in terms of false positives. A false positive is an identification that results in a different NIR being found than the one belonging to the person being sought. This means that we retrieve the wrong person and matching will combine data from two different people. Conversely, false negatives are persons who were not identified when they should have been. The measure of the quality of the individual identification of each record is presented on the basis of a false-positive rate. The lower the rate, the better the expected quality.

The strategy consists in minimising identification errors (having few false positives) and providing a transparent indication of risk for each record. The decision has therefore been made to identify the entire input file, with the exception of a few extreme cases in which the data are too fragmented, and to indicate an estimated false positive probability for each record. This estimate takes the form of a , quality indicator ranging from “perfectly reliable” (1) to “not reliable” (7).

The user remains in control of their decision as to whether or not to accept the identification and the associated CSNS offered. Depending on its objectives, they may reconsider cases that they deem to be too uncertain, either by improving identification using other information (such as address, which is not included in the CSNS process), or by rejecting these identifications and processing these data with adjustment techniques similar to those used to process non-responses.

An alternative approach might have been to only return CSNS for identifications deemed to be of good quality; however, prior consultation with future users highlighted the need to have as much information as possible, even of lower quality, to leave open the possibility of reprocessing and improving it, provided that an assessment of the degree of reliability is provided for each record.

A quality measurement suitable for different identification methods

Practical arrangements for calculating quality indicators must factor in two limitations. The first one is that the identification methods used differ depending on the stage of the process, ranging from exact identification to an identification made using approximate values. The second is that in order to accurately calculate a false-positive rate, it is necessary to have the NIRs of the persons to be identified in order to compare them with the NIRs found by the engine. However, these files are not particularly common and do not make up the majority of the files for which the CSNS service will be used. In addition, if a user has the NIR, the CSNS will be calculated by means of simple hashing and encryption without the need to complete the identification step based on identity traits.

During the CSNS project phase, the false-positive rates were therefore calibrated using files containing the NIR with a view to defining average rates for each stage of the process. For example, calculations were performed on the Nominative Social Declarations (déclarations sociales nominatives — DSN), and the NIRs found following identification using identity traits were compared with the actual NIRs contained within these files, which allowed for the calculation of false positive rates across a large volume of data. It was possible to carry out such empirical calculations on the Permanent Demographic Sample (Échantillon démographique permanent — EDP), on the annual census surveys for which the NIR was found by a non-CSNS tool (Ouvrir dans un nouvel ongletJabot and Treyens, 2018), and on files covering integration schemes, as well. The average values of false positive rates calculated within the scope of this calibration were taken as a reference to determine the quality level of each step.

Based on this method, it seems that average false positive rates vary from 0 to 2% when a person has been identified using exact or simple queries (first two steps) (figure 3).

Figure 3 - The primary quality criterion is the false positive rate: the lower it is, the better the quality.

 


However, when it comes to the “approximate value” query method (final stage), the quality measure requires a specific approach, even though the principle of seeking out false positives remains the same. The score calculated for each individual offers interesting information, but this cannot be used directly. With the n-gram method, the longer the word, the more character strings it contains and the higher its potential score. It is therefore impossible to establish a direct link between the score and the probability of returning a false positive. However, very low scores often correspond to false positives. This information must therefore be combined with information from another source.

The ratio of the score of the echo selected (the best) to that of the second best revealed itself as another piece of information worth exploiting. Indeed, tests have shown that the smaller the difference between the top two echoes, the greater the probability of a false positive. In other words, if the two best echoes found for a person have a similar score, it is likely that the difference in proximity with the original identity traits is too small to be significant. This high proximity runs the risk of selecting the wrong person and introducing a false positive. For that reason, false positive rates calculated empirically using the same files as for the previous steps were classified according to a dual scale of values: one for the score and the other for the ratio of the “score of the top echo to the score of the second echo”.

In the end, the seven-level quality indicator, which ranges from “perfectly reliable” to “not reliable” is calculated as follows: identifications found during the exact query stage or upon the first simple query with relaxed constraints on the first name are classified as “perfectly reliable” with a false positive probability rate of less than 1%. Those found by means of the other simple queries are classified as “very reliable” with a false positive probability rate of 1 to 2%. Identifications made during the “approximate value” stage have indicators ranging from “very reliable” to “not reliable” depending on how they have been classified based on their score and the value of the ratio between the scores of the top and second search results (Ouvrir dans un nouvel ongletMalherbe, 2022) (figure 4).

Figure 4 - The primary quality criterion is the false positive rate: more options for “approximate value” queries.

 


The final stage of the quality measurement is the evaluation of the file as a whole. This last phase provides the user with information regarding the share of records falling into each of the seven quality levels: this information is provided automatically within the dedicated application.

In addition, users are provided with other tools to help them to understand the quality of the identifications made with their file. In particular, an age distribution is calculated for the initial population of the file input into the process and for the identified population output by the application. By comparing the two, it is possible to identify whether a particular sub-population is over-represented among the failed identifications. Similar information is provided to allow for comparisons between populations born in France and those born abroad, since the quality of civil status data can sometimes differ.

These quality indicators have been tested by volunteer MSOs and they were found to be useful and relevant for future matching.

Promising first uses

By facilitating matching between different sources, the CSNS helps to expand the possibilities for analysing economic and social phenomena. The data pool is considerable and the first uses are revealing the potential of this new process.

During the CSNS project phase, four Ministerial Statistical Offices actively participated in a number of tests (), in particular to offer suggestions and evaluate the robustness of the methodological choices and the relevance of the quality indicator calculations. The following examples show the diverse nature of the topics covered and the interest in terms of knowledge of society and the evaluation of public policies.

The InserSup project led by SIES in collaboration with DARES aims to more effectively measure the professional integration of higher education graduates. The objective is to produce professional integration indicators, broken down by educational establishment and degree, covering all graduates, and to make this information available to pupils and students on , etc. to help them to choose the course they wish to follow. It is also a question of providing regional stakeholders and employers with information regarding the relationship between training and employment, and more generally to inform the public debate on professional integration.

The CSNS contributes to this project by facilitating the matching of administrative sources from various different originators without having to resort to specific surveys. The sources used are the various student registers held by SIES and the Nominative Social Declarations (Déclarations sociales nominatives — DSN) (DARES), which include a large amount of information regarding employees. The use of the non-employees database, population census files and tax files is also envisaged. The service provided by the CSNS therefore makes it possible to increase the number of possible matches between different sources of observations, to reduce the amount of time it takes to make them available and to offer comprehensive information for the coverage in question.

More generally, obtaining the CSNS for the 40 million employees within the Labour Movements Statistical Information System (système d’information statistique sur les mouvements de main-d’œuvre — SISMMO) created by DARES will make it possible in particular to study student employment by means of matching with higher education registration databases.

Moving to a completely different area, the CSNS could help to better measure the impact of the ecological transition on the different categories of households. Matching data from the Statistical Register of Road Vehicles (Répertoire statistique des véhicules routiers — RSVERO) with data on income, precise location and characteristics of households taken from , opens up a new field for analysis; for example, the link between vehicle characteristics (horsepower, fuel type) and vehicle owner income could be better understood. Such information is essential for defining and evaluating public policies in the areas of fuel poverty and the ecological transition.

As regards social affairs, the CSNS will allow for the enrichment of data from surveys covering autonomy and dependence. Matching of the survey on capabilities, help and resources for seniors (Capacités, aides et ressources des seniors — CARE) with information regarding benefits such as the personalised autonomy allowance (allocation personnalisée d’autonomie — APA) and housing benefits (aide sociale à l’hébergement — ASH) will make it possible to monitor changes in the dependency of seniors two years after the survey. At the same time, by reconciling data from the survey Everyday Life and Health Survey (Vie quotidienne et santé — VQS) with data from social security schemes and those regarding professional integration and employment (in particular the Nominative Social Declaration), it will be possible to link the management of a loss of autonomy with professional integration and to tackle the issue of end-of-career disability. In the longer term, the CSNS will also open up opportunities for studies and analyses in a number of areas, such as the future of children under the care of child welfare services, the career paths of the beneficiaries of the Active Solidarity Income (revenu de solidarité active — RSA) or the APA, etc. For such purposes, the comprehensive nature of the matched administrative sources will provide regularly updated territorial data.

All of these new analytical possibilities offered by the CSNS may be of interest to the research world, beyond the limited scope of the Official Statistical Service (SSP) (Gadouche, 2019). Therefore, although the CSNS calculation process is reserved for the SSP, the end results of the matching can be made more widely available, just without including the CSNS itself.

Furthermore, the 2016 Law for a Digital Republic also provided for a . An identical operation for encrypting the national social insurance number to create a non-significant code that can be used as a matching key will be a new option offered to researchers. This differs from the CSNS in that this research code is linked to a specific research project and cannot be used for any other project. This means that an individual with a non-significant code within a research project will not have the same code within a different research project.

Just a few months on from the full launch of the CSNS service in October 2022, this new system is already being used by five Ministerial Statistical Offices as well as by several INSEE units. The systematic and annual integration of the CSNS is scheduled for several files that are widely used by official statistics: FIDELI (tax income), annual census surveys (enquêtes annuelles de recensement — EAR), files taken from the Nominative Social Declaration (Déclaration sociale nominative — DSN), the Permanent Demographic Sample (Échantillon démographique permanent — EDP), etc. The expression “to CSNSify a file” is spreading among statisticians. Although it is not particularly elegant, it heralds the introduction of good habits that will make the matching process simpler and smoother in the future, while also contributing to the continual expansion of the data sources necessary to provide accurate observations of our society.

Legal references

Paru le :29/10/2024

Matching is the process of linking together third-party statistical data sources on the basis of personal data, which allows for the creation of new files containing some or all of the variables contained in each of the original source files.

See the legal bases at the end of the article.

The NIR is the registration number in the National Register for the Identification of Natural Persons (RNIPP) and is more commonly referred to as the “national social insurance number”.

Other methods can be used, in particular by reconciling identity traits (see below), but these are more complex to implement as part of a standardised and industrialised process.

See the legal bases at the end of the article.

See the legal bases at the end of the article.

The CSNS is required by Article 34 of the Law for a Digital Republic of 7 October 2016 and its conditions for use are specified by a decree of the Council of State (No 2016-1930 of 28 December 2016) and by a decree of the Minister for the Economy (dated 28 September 2020) (see Legal bases).

See the legal bases at the end of the article.

The CSNS calculated are valid for 10 years from 2022, unless a security breach is detected. At the end of the 10-year period (or sooner in the event of a breach), the NIR encryption keys will be changed. As part of the renewal process, a correspondence table will be produced for the new and the old CSNS.

An echo is the return of information following a search of a person’s identity traits. If the identity traits being searched for are incomplete or very common, several people may meet the search criteria and, in such cases, several echoes may be returned.

“Relaxed constraints”: reduce the number of matching identity components needed to confirm an identification.

Persons actually born on 1 January or 31 December will then be identified during the final step.

Perfectly reliable (1)/Very reliable (2)/Reliable (3)/Somewhat reliable (4)/Not very reliable (5)/Very unreliable (6)/Not reliable (7).

Direction de la recherche, des études, de l’évaluation et des statistiques (Directorate of Research, Studies, Evaluation and Statistics — DREES), the Ministerial Statistical Office covering healthcare and social affairs.

The Directorate of Research, Economic Studies and Statistics (Direction de l’animation de la recherche, des études et des statistiques — DARES) is the Ministerial Statistical Office covering labour; the Directorate of Research, Studies, Evaluation and Statistics (Direction de la recherche, des études, de l’évaluation et des statistiques — DREES) is the Ministerial Statistical Office covering healthcare and social affairs; the Data and Statistical Studies Department (Service des données et études statistiques — SDES) is the Ministerial Statistical Office covering housing, construction, transportation, energy, the environment and sustainable development; the Information Systems and Statistical Studies Sub-Directorate (Sous-direction des systèmes d’information et des études statistiques — SIES) is the Ministerial Statistical Office covering higher education and research.

Parcoursup is a web platform aimed at collecting and managing the placement wishes of future higher education students in France; MonMaster is a platform on which students can consult all available master’s degrees and submit applications for the first year of their master’s; the Affelnet procedure makes it possible to allocate pupils in their final year of middle school places at the high schools in their local area; Onisep is a government operator that produces and disseminates all information regarding training and trades.

Fichiers démographiques sur les logements et les individus (demographic files on dwellings and individuals).

Articles 7 to 9 of Decree No 2016-1930 of 28 December 2016 simplifying the advance formalities relating to processing for statistical or research purposes (see legal bases).

Pour en savoir plus

CHRISTEN Peter, 2012. Ouvrir dans un nouvel ongletData Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. [accessed 14 March 2023].

ESPINASSE Lionel and ROUX Valérie, 2022. Le Répertoire national d’identification des personnes physiques (RNIPP) au cœur de la vie administrative française. In : Courrier des statistiques. [online]. November 2022. Insee. N° 8, pp.72-92. [accessed 14 March 2023].

FELLEGI, Ivan, SUNTER, Alan, JARO, 2014. Ouvrir dans un nouvel ongletApproach to Record Linkage (Method). [accessed 14 March 2023].

GADOUCHE Kamel, 2019. Le centre d’accès sécurisé aux données (CASD), un service pour la data science et la recherche scientifique. In : Courrier des statistiques. [online]. December 2019. Insee. N° N3, pp.76-92. [accessed 14 March 2023].

JABOT P. and TREYENS P.E. (2018). Ouvrir dans un nouvel ongletProposition d’un nouvel appariement de l’enquête CARE par identification du plus proche écho. Actes des journées de méthodologie statistique 2018. [accessed 14 March 2023].

MALHERBE Lucas, 2022. Ouvrir dans un nouvel ongletMéthodologie des appariements individuels. JMS 2022. [accessed 14 March 2023].