What Privacy Officers Need to Know About Synthetic Data
Regulation, cost, and other factors can hinder the great many benefits of access to data for analysis. This impedes not only the efforts of organizations to rationalize resources (e.g., workforce analytics, resource allocations, and consumer insights), but research into health, medicine, the social sciences, and other endeavors that benefit society at large.
Finding a balance between privacy and the potential benefits of sharing personally identifiable or sensitive data (e.g., personal health information) seems an intractable problem putting privacy advocates and those who wish to use data for commercial or research purposes at loggerheads.
There is, however, a rapidly emerging solution to the use and sharing of data across organizations and, indeed, across borders: Data synthesis
Synthetic data, as its name implies, is not actual data taken from real world events or individuals’ attributes. Rather it is data that has been generated by a computer – i.e., synthetic data generation tools – that match the key statistical properties of the real sample data. Importantly, the resulting data is not the actual data that has been pseudoanonymized or anonymized. It is artificial data that does not map back to any actual natural person.
The implications of being able to create and share data providing insights into cohorts and segments without impinging on privacy has profound implications for the adtech and martech ecosystem struggling with privacy-centric moves by Apple, Google, and other platforms.
During the Spokes Privacy Conference, Dr. Khaled El Emam, CEO of Replica Analytics, provided a highly accessible technical overview of Data Synthesis and Synthetic Data. He was joined by Mike Hintze, Partner at Hintze Law PLLC and a Future of Privacy Forum Senior Fellow.
How Synthetic Data is Generated
Beginning with the source dataset (the actual data, say of individual health characteristics or consumer actions) you create a model of that dataset typically using various AI machine learning techniques. The resultant model “captures the patterns and statistical properties of that source data set,” explains Khaled. You then apply that model to generate the synthetic data. As the synthetic data is produced from a model there is not a one-to-one mapping between the synthetic records (e.g., individual consumers) created and the “real” source records. In other words, privacy is protected.
Figure 1 below illustrates the process going from i) source data through ii) model creation, and iii) the application of that model to create the synthetic data set. (The images on the right-hand side are actually fakes built from fitted models that were trained on images of real people.)
The synthesized consumers are in a sense the ultimate deep fake. They are fake, but as El Emam notes, they “have the same properties as real people, [are] very realistic, and they work quite well.”
The Simulator Exchange
Khaled sees simulators as the next evolution in the application of synthetic data. “These generative models, also called simulators, can be put in a catalog or marketplace exchange,” said Khaled. Access to those simulations can be provided to data consumers “who would select the simulators based on the types of information, or datasets, they want, generating datasets on demand.” (For an example of this, see open-source synthetic patient generator Synthea™.)
Furthermore, the data generation is scalable: 100, 1,000, or 10-million records can be generated once the model has been built. The implications for data-intensive sectors like ad-tech are meaningful: imagine comparing simulated consumer profiles based on, for example, zero-party data, versus profiles created from cookies and pixels.
Again, access to actual data is not provided, only the simulators (i.e., the data models). And “providing access to different types of data simulators is an effective way to enable broad access to data like consumer profiles within, or external to, the organization.”
Are there Privacy Risks with Synthetic Data?
As illustrated in Figure 1, the source data – which contains the personally identifiable or sensitive information – is used only to create a model, which is then “fitted” and consequently used to create the synthetic data. Again, there is no one-to-one relationship. The likelihood that data could be “re-identified” is very small, but one cannot say, zero.
So how much risk is there?
El Emam notes that there are privacy models for evaluating the risk of re-identification of synthetic datasets.
Figure 3 (below) details the results of such a re-identification risk model performed on actual health datasets.
The commonly used risk threshold for what’s deemed to be acceptable non-identifiable datasets for health data is 0.09. “It is used by the European Medicines Agency, Health Canada, and centers for Medicare and Medicaid.” Figure 3 models the re-identification risk for both the original (real) dataset as well as the synthesized data set. The results show that the risk of re-identification is significantly lower for the synthesized dataset than that of the original dataset (and well below the risk threshold).
“This is a common pattern that we’ve seen if synthetic data is generated well (i.e., not overfitted or biased) and using generally accepted approaches,” says Khaled. Furthermore, he adds that “these are the privacy risks without Including any additional controls…If you add security controls, privacy controls contractual controls, to manage residual risk than your overall risk of re-identification can be significantly lower.”
How Well does Synthetic Data Work?
There’s always a privacy-utility tradeoff. “If you maximize privacy or get very low privacy,” says El Emam, “there is going to be an impact on your data utility.”
That said, the level of concordance between the actual data and the synthetic data built on models based on the actual data are actually quite high.
Figure 4 below shows a mortality by age group analysis using the Canadian Covid-19 data. As Khaled explains, “factoring out all the other confounders we can see strong concordance between real and synthetic datasets in terms of the estimated mortality with a 95% confidence interval. The numbers across the top are the ‘confidence interval overlaps’ indicating how much overlap there was between real and synthetic dataset confidence intervals for each age group.
As the results clearly show, the synthetic data closely resembles the actual data demonstrating a high utility rate while maintaining privacy protections.
The Legal Considerations of Synthetic Data
There are three central legal questions that must be addressed when it comes to synthetic data, says Mike Hintze:
- As the act of creating synthetic data relies on using actual data – which is personal data under most privacy laws – what are the obligations that apply to that ‘act’ because it is processing actual personal data?
- Because very few organizations will have this capability in-house, you’re likely going to be working with a third party. So, what issues arise as a result of sharing that original dataset containing personal data with a third party? And,
- How, if at all, is the resulting synthetic dataset regulated under privacy law.
Clearly the use of the original dataset to generate or evaluate synthetic datasets is regulated under data protection law, says Hintze, and “there are a number of obligations to think about when it comes to generating synthetic data,” primarily under the GDPR, “and those laws that are based on it.”
First, do you have a legal basis for this type of processing of the personal information? Do you need consent of the data subjects to use actual personal data to generate a synthetic data set? “The answer around consent is likely “no,” opines Hintze, but there are other legal bases that can come into play and will almost certainly apply. And under the GDPR the most likely (and obvious one) is the legitimate interest legal basis.
Relying on the legitimate interest basis under the GDPR requires a balancing test. It is the balance between the interest of the data subject and/or a third party in the processing of personal data, evaluated against the risk to the fundamental rights and freedoms of the data subject.
As Khaled described, there are a lot of really compelling use cases for synthetic data…and in fact, the interest of other parties such as researchers are likely to be very, very high while the risks to the data subject are likely to be very, very low.
There are other laws that might come into play, such as HIPAA, when healthcare data is used. But while HIPAA “requires individual authorization for a lot of different data uses, there are a number of exceptions for which individual authorization is not required,” reminds Hintze. “While synthetic data is notably different that de-identified data,” argues Hintze, “this language in HIPAA that uses de-identification as an example should pretty clearly apply” to synthesizing data, “so there is a very compelling case for saying [synthetic data] falls within one or more of the HIPAA exceptions.”
The second fairly straightforward consideration concerns the obligations of the data controller sharing data with a third party who will generate the synthetic data on the controller’s behalf.
“Presuming that the entity creating synthetic data from a real database is not doing anything else with that original dataset, so this would fall into the provisions under most privacy laws that allow for the use of service providers,” says Hintze. “Under the GDPR that would be a ‘data processor,’ under the CCPA that would be a ‘service provider,’ and under HIPAA that would be a ‘business associate.’”
The final question then, is the synthetic dataset itself regulated by privacy law?
I would say that if you go through the definitions of personal information under most privacy laws, [synthetic data] is going to fall outside the definition of personal data. If this data is not data about an actual individual, it’s not personal data. That means it can really be used for any purpose, including shared publicly.
As synthetic data is not data relating to a natural person (under the GDPR), or a particular consumer or household (under the CCPA) or relating to an individual (under HIPAA), then it is not personally identifiable or sensitive information and outside the scope of these privacy laws.
“Is the future of data sharing synthetic data?” asks Khaled El Emam. “The evidence that’s accumulating around privacy and utility is quite high. One of the main advantages of data synthesis is that the process is largely automated so it addresses some of the challenges with historical techniques that have been used for creating non-identifiable data which requires a lot of skill and expertise to execute.”
 Although synthetic data first started to be used in the ‘90s, an abundance of computing power and storage space [has] brought more widespread use.” (Dilmegani, 2021)
 A confidence interval is a range of values describing uncertainty surrounding an estimate in terms of its relationship to the actual numbers. In other words, does the model you created accurately represents what it purports to represent.