Courrier des statistiques N9 - 2023

This ninth issue of the Courrier des statistiques is marked by a number of highly technical papers and by topics that are unusual for the review.
It all begins with a story: the history of Official Statistics, from the angle of democratic debate, in the 40 years since the creation of INSEE.
In order to contribute to the public debate, INSEE has recently innovated with the introduction of "distributional national accounts", which make it possible to better analyse the distribution of growth and its impact on household incomes. The second paper explains the principles, mechanisms and prospects.
Then occurs a change of theme with two papers on data confidentiality. One sets out the legal framework, the risks associated with breaches of confidentiality and the subtleties of applying statistical confidentiality in a changing context. The other, more operational, explains the rationale behind the "non‑significant statistical code" (CSNS) and how it facilitates the matching of different sources while ensuring the protection of individual data.
The last three papers deal with related topics that are important in a "world of data". We begin with data formats, a topic that is not often discussed, but one that statistics cannot ignore. Choosing and managing the right formats is essential when statisticians use third‑party data sources. The paper on the integration of administrative data reveals an automated processing pipeline driven by metadata, a prerequisite for more traditional statistical production. Finally, the CNAV (National Old‑Age Pension Fund) explains the importance of formalised and documented exchange standards automatically generating control tools for better monitoring of data quality in the field of social security.

Courrier des statistiques
Paru le :Paru le29/10/2024
Alexis Dondon and Pierre Lamarche
Courrier des statistiques- October 2024
Consulter

Which formats for which data?

Alexis Dondon and Pierre Lamarche

In order to be intelligible to its users and to fulfil its function of conveying information, the data must be structured. This structure is then formalised through a data model, which determines the choice of format. Data formats are various and address specific problems, according to the context of use. Ideally, the standards behind the data models and the associated formats are shared by as many users as possible. For statistical data, these problems are located in the objects for which the data are likely to convey information, but also in data documentation - i.e. metadata - or in the wish to link specific software solutions with the data particularly well suited to its processing.

On this last point, the emergence over the last few decades of open-source solutions has made it possible to address simultaneously different issues securing data import, efficiency of data processing, replicability, etc. In particular, formats such as Parquet are integrated into software solutions accessible to all and adopted by an increasingly large community, convinced of their advantages. Nevertheless, there is no clear-cut answer for the choice of a format: choices are made after a precise analysis of the needs relative to each step of the data’s life cycle. In this way, the choice of a format is the concrete expression of a requirement driven standards specific to each of these phases.

Paru le :29/10/2024