The Information Commissioner’s Office (ICO), together with the Financial Conduct Authority and the Alan Turing Institute, hosted a joint roundtable to discuss the challenges of synthetic data validation in the financial sector. As a result, the Research Paper “Exploring Synthetic Data Validation – Privacy, Utility and Fidelity” was recently published on the ICO website.
Synthetic data is information that is artificially created and can be used as a stand-in for real life data sets. It is generated from ‘real’ data using a model that is trained to reproduce the characteristics and structure of the original data. This means that it can be used to deliver very similar results based on its analysis.
According to the paper, synthetic data “can help organisations to better harness the power of their data by mitigating data quality issues, modelling new and emerging scenarios, and protecting commercially sensitive data”. Synthetic data can also enhance privacy by replicating or removing personal identifiable information. However, due to the demand that the synthetic data be true to the original data, there is a tension between these two aspects. The discussion below will outline how to evaluate synthetic datasets in order to address this tension.
Elements of a synthetic dataset: the relationship with the original data
As stated in the ICO Paper, synthetic data “uses a mathematical model or algorithm to generate statistically realistic, but ‘artificial’ data”, which can deliver data driven insights from the original data, e.g., financial transactions.
Therefore, by splitting the definition above into smaller components, synthetic data can be identified as follows:
- It is extracted from a real-world system (original data) through a generation process;
- It is generated by a mathematical model or an algorithm;
- While generating synthetic data, statistical properties of original data are preserved:
- The data must be generated in a way that sufficiently obscures the individuals’ personal information, which is not relevant for the purpose of the synthetic dataset.
Drawing from the definition above, synthetic data may provide a significant contribution to accelerating innovation, as it replicates some features of the original data and ensures compliance with data protection legislation. To undertake a synthetic data generation process, three main aspects should be assessed: utility, fidelity and privacy.
Attributes of synthetic datasets and their evaluation: how to drive adoption of synthetic data while ensuring privacy requirements?
The generation process of synthetic dataset requires the assessment of three attributes:
- Utility refers to the synthetic data’s usefulness for a specific task or set of tasks (such as, testing of algorithms or in first-stage mobile app development);
- Fidelity is the comparison between synthetic data and original data in order to figure how well they match;
- Privacy “measures the risk that specific individuals (or other sensitive data) can be reidentified from the synthetic dataset”.
With regard to privacy, the roundtable report recommends the use of a risk-based model, during which it will be necessary to clarify:
- the extent to which data subjects can be identified in the synthetic data; and
- which personal data would be revealed upon successful identification.
However, the risk of a data disclosure is impossible to eliminate, which is why the paper recommends the implementation of a risk-based model that accepts some level of inherent risk in generating and processing synthetic data. The approach should be iterative, by considering specific circumstances of the synthetic dataset, such as:
- The level of linkability and inference to personal information of the original dataset (e.g., how much the information provided by the synthetic dataset is easily linkable to the personal information included in the original data);
- The release model (e.g., is synthetic data shared within a secure environment? Will the dataset be published on a public website or on company’s intranet/platform?);
- The users who will access and see these datasets (e.g., are they internal or external? Are levels of access clarified?);
- The time, cost, and technologies/techniques available for re-identification.
Furthermore, the paper also recommended ensuring a post-generation testing to confirm that the data protection obligations of synthetic dataset are sufficiently confirmed.
The trade-off between utility/fidelity and privacy
Concerns on how the application of utility/fidelity can be combined with data protection requirements can surface. For example, fidelity demands that synthetic data is similar to the original one while privacy requests the synthetic dataset to be free of bias as well as any information through which the data subject may be identified.
According to roundtable participants, in order to navigate the trade-off between utility/fidelity and privacy, organisations should:
- Firstly, assess the use case to evaluate both accuracy and privacy. Once defined, it may be helpful grouping potential use cases in terms of the type of fidelity (i.e., which features of the original data will be replicated by the synthetic data) and then generating multiple synthetic datasets. By creating a consistent synthetic dataset system, it will be easier to evaluate privacy guarantees.
- Consequently, identify specific privacy requirements for each grouped dataset. Once the datasets are grouped in terms of type of fidelity, by following the lines of the GDPR, organisations will need to ensure their compatibility with the original purpose of the original data (e.g., do they have the informed consent of the data subjects for such processing? Are the data subjects sufficiently informed about the potential processing of their data for creating synthetic data?).
How to encourage the adoption of synthetic data
Synthetic data has an enormous potential as it delivers data driven insights in a privacy preserving manner. However, concerns regarding data protection represent a barrier to its adoption. In this regard, the paper suggests three interdependent approaches for encouraging the adoption of synthetic data:
- Use a case study: drawing from what was outlined above, a deep analysis of the use case in scope needs to be documented by defining how utility was measured, fidelity considerations, privacy trade-offs.
- Establish best practices: the implementation of standards, frameworks and guidance would facilitate the understanding of validating synthetic data from utility/fidelity, and privacy perspectives by providing practical advice.
- Role of Regulators: Regulators will play an important role in supporting the implementation of synthetic data, as well as overcome barriers to its adoption.
Conclusion
Undoubtedly, synthetic data has very strong potential, as it can significantly accelerate new innovations, by combining secure research environments with privacy guarantees. Although there is tension between fidelity and privacy, as the article has outlined, there are approaches that can minimise any risk to data subjects when it comes to generating synthetic datasets.
Although the paper refers solely to the implementation of synthetic data in the financial sector, synthetic data is gaining traction within different fields that cope with large-scale datasets, such as health.
However, synthetic data is still nascent, and challenges as those summarised above remain an open point to be sorted.
Trilateral’s Data Protection and Cyber-risk team have data protection specialists with extensive expertise and experience in supporting public and private sector organisations that deal with the latest technologies from a data protection perspective, including synthetic data and its assessment. Please get in touch with our advisors to discuss your organisation’s requirements.