Synthetic Data: Useful, Privacy-Risk-Free Data

Computer vision models need to be trained on vast data sets, and synthetic data—images generated using the same CGI software as big budget movies and games—can train that AI without compromising anyone’s personal information.

The people have spoken. They want stricter privacy guarantees when it comes to the collection, use, and dissemination of their personal details.

Traditionally the problem has been that compiling useful data sets requires infringing on people’s personal information, but guaranteeing privacy means either smaller or lower quality data sets, or stripping them of information to the point they are no longer useful.

How can we increase both data utility

At the risk of simplifying a complex problem: synthetic data is the solution.

Let’s take a step back. First, why do we need more data? To train AI[1]. AI makes up our today. And our tomorrow: we are already leveraging AI towards a future of self-driving cars, robot surgeons and virtual assistants. Machine learning and deep learning, as subsets of AI, make up the new programming paradigm, where engineers ask how a computer can automatically learn and make its own performance rules just by looking at data. With machine learning, humans input data as well as the answers expected from the data and the computer figures out its rules (this is the AI, so to speak). This model can then be deployed to new data to produce original answers. Bottom-line: the more data a model can train on, the better the model will perform.

So, to push technological development, we need more data. But not just any data – we need quality data. A model will only be as learned as the data on which it is trained.[2]

Which leads us to our second question, where do we get the data now? Today, the norm is to use real data sets. Walk down any street in San Francisco and guaranteed you’ll see at least one car outfitted in sensors and cameras, gathering data to train its autonomous vehicle brethren. Also par for the course is data scraped off the internet. That old picture you uploaded to that website you built in third grade? Publicly available, so yeah, that’s fair game.

There are many problems with using real data to train AI. Besides the more technical problems (e.g., the necessary labelling/annotating of data is a tedious and imprecise manual exercise that falls short of the detail and richness we need to meet the increasingly complex tasks we demand from our AI), using real data to train models is rife with privacy risks (especially now with the rise of comprehensive privacy regimes like the GDPR in Europe and the CCPA in California). To counteract these risks, real data must undergo a de-identification process, which, as mentioned above, reduces the utility of the data set.

De-identification, sometimes referred to as anonymization, strips a data set of personal identifiers. The extent of what -and how- data is anonymized is important: if data elements used to identify an individual are removed (i.e., anonymized) from a data set, the remaining data becomes nonpersonal information and privacy and data protection laws generally do not apply. But, the data set is now less rich and has less information on which an AI can train.

Further, while there is a regulatory distinction between de-identified/anonymized information and pseudonymized data (legal term for data that can be reversed and re-identify individuals), the truth of the matter is that all anonymized data is subject to reversal. The only real bar is the state of technology at the point in time. Anonymized data today becomes pseudonymized data tomorrow as AI becomes better at re-identifying data points. In the future, algorithms will likely be capable of linking seemingly innocuous data points to construct very intimate profiles on us.

And thus our third question: where can we get data that is useful and not inevitably subject to re-identification? Enter synthetic data.

Synthetic data is useful: it is computer generated and thus inherently boasts pixel-perfect labels and annotations, and has the potential to cover all edge cases, utilizing ML techniques to augment real distributions.

Synthetic data also erases privacy concerns. We can snooze the consequences of using real data, try and strip it (generalize and suppress it) to the point where, today, we can no longer identify the discrete real data points within the set. But this is a temporary band-aid. Synthetic data is fake data; no personal identifiers that could be susceptible to re-identification down the road. Synthetic data guarantees privacy by changing the paradigm and getting rid of any need to use real data.

So yes, a generalization of a complex problem, but synthetic data may be how we strike the balance between privacy and utility.

With synthetic data, we can have our cake and eat it: more precise, accurate, and complex AI (which necessitates detailed data), and guaranteed privacy.

[1] In the words of Francois Chollet, AI & deep learning researcher and developer of Keras: “A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning…” See Chollet, F. Deep Learning with Python. Manning Publications (2017).
[2] Moreover, models are notoriously ‘stupid’: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics – they will output the statistics of the data set they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself: https://www.cs.ryerson.ca/~aharley/vis/conv/. Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each data set.

Synthetic Data: Useful, Privacy-Risk-Free Data

How can we increase both data utility

Machine Learning Is Sometimes Wrong — How You Deal With That Is EVERYTHING