Over the last year or so there has been rapidly growing interest in methods for data synthesis. Methods for data synthesis can be applied to many different types of data, such as simulated environments for autonomous vehicles, text, images, audio and video. Our main focus here is on structured data that is of interest to the life sciences industry.
The main problem that data synthesis methods solve is data access. While there are more and more strict privacy regulations being implemented in various jurisdictions around the world, the demand for access to more data, in terms of granularity and volume, is simultaneously growing. To satisfy these requirements, synthetic data is becoming an increasingly good solution because that kind of data cannot be accurately mapped to real individuals.
The two biggest use cases in that context are generating data for AI and machine learning projects, and for testing statistical programmes. These can be for internal consumption within the organisation or for external sharing with partners and collaborators.
The datasets can be from clinical trials or they can be real-world data (RWD). The RWD can originate from healthcare providers and research registries, or be collected directly from patients (e.g. patient support programmes).
We can try to create synthetic data by specifying distributions and attributes of data and then generate these data points using simple simulation. For example, we may generate a thousand records having 80% female and 20% male, and 40% African American and 60% White, by specifying these parameters. However, the challenges here are that we need to know and specify what the distributions and attributes are. Recall that clinical trial datasets, and some RWD, can have thousands of variables. Specifying each one of these variables accurately and realistically can be challenging.
But more importantly, a valid synthetic dataset needs to also model the relationships among the variables in question. It is not enough that the marginal distributions are reflective of real data, but the internal structure, correlations, and interactions have to be realistic as well. Going back to our example, what should the correlation between race and gender be? That information is important to be able to synthesise observations with accurate internal structure.
Therefore, the second approach is to start from a real dataset, build a machine learning model to characterise that real dataset, and then use that model to generate the synthetic data. This captures the marginal distributions, as well as the internal structure, correlations and interactions in the original dataset. That is the approach that we are interested in exploring.
During the workshop on 22nd September (09:00–13:00 EDT), attendees will get an applied look at data synthesis, with an introduction to methods for generating and evaluating synthetic datasets. This is intended to be a workshop for data scientists who are familiar with R. The basic set-up will be a mix of instruction and hands-on programming in R.
Through this workshop you will learn:
-
The basic concepts of data synthesis
-
How to synthesise datasets, with examples and hands-on exercises in R
-
How to evaluate the utility of synthetic data, with examples and hands-on exercises in R
-
How to evaluate the privacy risks in synthetic data, with examples and hands-on exercises in R
-
Practical tips for implementing data synthesis.
The first hands-on part of the workshop will focus on using existing generative models to synthesise datasets. These generative models have already been built from real datasets. We will use generative models from public datasets for the purpose of this exercise.
Following that, attendees will generate synthetic datasets themselves, experimenting with changing some of the synthesis parameters. The utility of the generated synthetic data will also be evaluated and various utility reports can be produced. These will tell how good the synthetic datasets are. Part of the lecture component will be to go over the utility reports to understand how to interpret them.
At the end of this workshop, attendees will get a concrete applied understanding of how to synthesise clinical data and how to use generative models. It will get them started on solving their data access challenges.
The computing environment we will use is Jupyter Notebooks. Participants are expected to have a working knowledge of the R programming languages to be able to perform the hands-on component of the workshop.
The specific agenda will be as follows:
-
What is synthetic data? (20 mins)
-
When to synthesise data (20 mins)
-
Q&A
-
Hands-on exercise: Using simulators to generate synthetic data (30 mins)
-
30-min break
-
Evaluating the utility of synthetic data (20 mins)
-
Q&A
-
Methods for data synthesis (30 mins)
-
Q&A
-
Hands-on exercise: Generating datasets (30 mins)
-
Assessing privacy risks in synthetic data (20 mins)
-
Q&A