What Privacy Officers Need to Know About Synthetic Data
Finding a balance between privacy and the potential benefits of sharing personally identifiable or sensitive data (e.g., personal health information) seems an intractable problem putting privacy advocates and those who wish to use data for commercial or research purposes at loggerheads.
There is, however, a rapidly emerging solution to the use and sharing of data across organizations and, indeed, across borders: Data synthesis. 
Synthetic data, as its name implies, is not actual data taken from real world events or individuals’ attributes. Rather it is data that has been generated by a computer – i.e., synthetic data generation tools – that match the key statistical properties of the real sample data. Importantly, the resulting data is not the actual data that has been pseudoanonymized or anonymized. It is artificial data that does not map back to any actual natural person.
The implications of being able to create and share data providing insights into cohorts and segments without impinging on privacy has profound implications for the adtech and martech ecosystem struggling with privacy-centric moves by Apple, Google, and other platforms.
During the Spokes Privacy Conference, Dr. Khaled El Emam, CEO of Replica Analytics, provided a highly accessible technical overview of Data Synthesis and Synthetic Data. He was joined by Mike Hintze, Partner at Hintze Law PLLC and a Future of Privacy Forum Senior Fellow.
How Synthetic Data is Generated
Figure 1 below illustrates the process going from i) source data through ii) model creation, and iii) the application of that model to create the synthetic data set. (The images on the right-hand side are actually fakes built from fitted models that were trained on images of real people.)
The synthesized consumers are in a sense the ultimate deep fake. They are fake, but as El Emam notes, they “have the same properties as real people, [are] very realistic, and they work quite well.”
The Simulator Exchange
Furthermore, the data generation is scalable: 100, 1,000, or 10-million records can be generated once the model has been built. The implications for data-intensive sectors like ad-tech are meaningful: imagine comparing simulated consumer profiles based on, for example, zero-party data, versus profiles created from cookies and pixels.
Again, access to actual data is not provided, only the simulators (i.e., the data models). And “providing access to different types of data simulators is an effective way to enable broad access to data like consumer profiles within, or external to, the organization.”
Are there Privacy Risks with Synthetic Data?
As illustrated in Figure 1, the source data – which contains the personally identifiable or sensitive information – is used only to create a model, which is then “fitted” and consequently used to create the synthetic data. Again, there is no one-to-one relationship. The likelihood that data could be “re-identified” is very small, but one cannot say, zero.
So how much risk is there?
El Emam notes that there are privacy models for evaluating the risk of re-identification of synthetic datasets.
Figure 3 (below) details the results of such a re-identification risk model performed on actual health datasets.
The commonly used risk threshold for what’s deemed to be acceptable non-identifiable datasets for health data is 0.09. “It is used by the European Medicines Agency, Health Canada, and centers for Medicare and Medicaid.” Figure 3 models the re-identification risk for both the original (real) dataset as well as the synthesized data set. The results show that the risk of re-identification is significantly lower for the synthesized dataset than that of the original dataset (and well below the risk threshold).
“This is a common pattern that we’ve seen if synthetic data is generated well (i.e., not overfitted or biased) and using generally accepted approaches,” says Khaled. Furthermore, he adds that “these are the privacy risks without Including any additional controls…If you add security controls, privacy controls contractual controls, to manage residual risk than your overall risk of re-identification can be significantly lower.”
How Well does Synthetic Data Work?
That said, the level of concordance between the actual data and the synthetic data built on models based on the actual data are actually quite high.
Figure 4 below shows a mortality by age group analysis using the Canadian Covid-19 data. As Khaled explains, “factoring out all the other confounders we can see strong concordance between real and synthetic datasets in terms of the estimated mortality with a 95% confidence interval. The numbers across the top are the ‘confidence interval overlaps’ indicating how much overlap there was between real and synthetic dataset confidence intervals for each age group.
As the results clearly show, the synthetic data closely resembles the actual data demonstrating a high utility rate while maintaining privacy protections.
The Legal Considerations of Synthetic Data
There are three central legal questions that must be addressed when it comes to synthetic data, says Mike Hintze:
- As the act of creating synthetic data relies on using actual data – which is personal data under most privacy laws – what are the obligations that apply to that ‘act’ because it is processing actual personal data?
- Because very few organizations will have this capability in-house, you’re likely going to be working with a third party. So, what issues arise as a result of sharing that original dataset containing personal data with a third party? And,
- How, if at all, is the resulting synthetic dataset regulated under privacy law.
Clearly the use of the original dataset to generate or evaluate synthetic datasets is regulated under data protection law, says Hintze, and “there are a number of obligations to think about when it comes to generating synthetic data,” primarily under the GDPR, “and those laws that are based on it.”
First, do you have a legal basis for this type of processing of the personal information? Do you need consent of the data subjects to use actual personal data to generate a synthetic data set? “The answer around consent is likely “no,” opines Hintze, but there are other legal bases that can come into play and will almost certainly apply. And under the GDPR the most likely (and obvious one) is the legitimate interest legal basis.
Relying on the legitimate interest basis under the GDPR requires a balancing test. It is the balance between the interest of the data subject and/or a third party in the processing of personal data, evaluated against the risk to the fundamental rights and freedoms of the data subject.
As Khaled described, there are a lot of really compelling use cases for synthetic data…and in fact, the interest of other parties such as researchers are likely to be very, very high while the risks to the data subject are likely to be very, very low.
There are other laws that might come into play, such as HIPAA, when healthcare data is used. But while HIPAA “requires individual authorization for a lot of different data uses, there are a number of exceptions for which individual authorization is not required,” reminds Hintze. “While synthetic data is notably different that de-identified data,” argues Hintze, “this language in HIPAA that uses de-identification as an example should pretty clearly apply” to synthesizing data, “so there is a very compelling case for saying [synthetic data] falls within one or more of the HIPAA exceptions.”
“Presuming that the entity creating synthetic data from a real database is not doing anything else with that original dataset, so this would fall into the provisions under most privacy laws that allow for the use of service providers,” says Hintze. “Under the GDPR that would be a ‘data processor,’ under the CCPA that would be a ‘service provider,’ and under HIPAA that would be a ‘business associate.’”
The final question then, is the synthetic dataset itself regulated by privacy law?
I would say that if you go through the definitions of personal information under most privacy laws, [synthetic data] is going to fall outside the definition of personal data. If this data is not data about an actual individual, it’s not personal data. That means it can really be used for any purpose, including shared publicly.
“Is the future of data sharing synthetic data?” asks Khaled El Emam. “The evidence that’s accumulating around privacy and utility is quite high. One of the main advantages of data synthesis is that the process is largely automated so it addresses some of the challenges with historical techniques that have been used for creating non-identifiable data which requires a lot of skill and expertise to execute.”
 Although synthetic data first started to be used in the ‘90s, an abundance of computing power and storage space [has] brought more widespread use.” (Dilmegani, 2021)
 A confidence interval is a range of values describing uncertainty surrounding an estimate in terms of its relationship to the actual numbers. In other words, does the model you created accurately represents what it purports to represent.
Suggested Blog Posts
A key component of privacy governance is assessments. While Records of Processing Activity (ROPAs) do not assess risk...
Retrieve Unstructured Data and Save Time With WireWheel’s Trust Access and Consent Center’s M365 Integration
Privacy Laws continue to proliferate across the globe. Many of these laws, including the European Union’s GDPR,...
We are seeing a parallel to what the financial and banking industry went through during the early years of...
Congressional testimony from a former Facebook employee has sparked outrage over the governance of the company’s...
Introduction ‘Personal Data’ has different legal definitions in the GDPR, CCPA in California, CDPA in Virginia, LGPD...