The greatest utilization value of synthetic data in 2023

Developing successful AI and ML models requires access to large amounts of high-quality data. However, collecting such data is challenging because:

Many business problems that AI/ML models can solve require access to sensitive customer data, such as personally identifiable information (PII) or personal health information (PHI). The collection and use of sensitive data raises privacy concerns and leaves businesses vulnerable to data breaches. For this reason, privacy regulations such as GDPR and CCPA limit the collection and use of personal data and impose fines on companies that violate these regulations.
Some types of data collection are expensive or infrequent. For example, collecting data representing a variety of real-world road events for self-driving cars can be prohibitively expensive. Bank fraud, on the other hand, is an example of a rare event. Collecting enough data to develop machine learning models to predict fraud is challenging because fraud is rare.

As a result, enterprises are turning to a data-centric approach to AI/ML development, including synthetic data, to solve these problems.

Generating synthetic data is inexpensive compared to collecting large datasets and can support AI/deep learning model development or software testing without compromising customer privacy. It is estimated that by 2024, 60% of the data used to develop AI and analytics projects will be generated synthetically.

In 2020, synthetic data generated by AI has exceeded real data, and it is expected to expand in the 2030s.

Figure 1: Synthetic data will exceed real data in the future

What is synthetic data?

Synthetic data, as the name suggests, is data that is artificially created rather than generated by actual events. It is often created with the help of algorithms and is used in a wide range of activities, including as test data for new products and tools, model validation, and AI model training. Synthetic data is a type of data augmentation.

Why is synthetic data important now?

Synthetic data is important because itcan be generated to meet specific needs or conditionsthat are not available in existing (real) data. This is useful in many situations, e.g.

When privacy requirements limit the availability of data or how the data can be used
Testing a product for release requires data, but this data either doesn’t exist or is unavailable to testers
Machine learning algorithms require training data. However, especially in the case of self-driving cars, this data is expensive to generate in real life.

While synthetic data first began to be used in the 1990s, abundant computing power and storage space in the 2010s led to wider use of synthetic data.

What are its applications?

Industries that can benefit from synthetic data:

cars and robots
Financial Services
medical
manufacturing
Safety
social media

Business functions that can benefit from synthetic data include:

marketing
machine learning
Agile development and DevOps
human Resources

Synthetic data allows us to continue to develop new and innovative products and solutions, otherwise the required data would not exist or be available.

Compare the performance of synthetic data and real data

Data is used in applications, and the most direct measure of data quality is the validity of the data at the time of use. Machine learning is one of the most common data use cases today. MIT scientists wanted to measure whether machine learning models derived from synthetic data could perform as well as models built from real data. In a 2017 study, they divided data scientists into two groups: one used synthetic data and the other used real data. 70% of the time the group using synthetic data was able to produce comparable results to the group using real data. This will give synthetic data advantages over other privacy-enhancing technologies (PETs), such as data masking and anonymization.

Benefits of synthetic data

Being able to generate data that simulates the real thing is a seemingly endless way to create test and development scenarios. While there is a lot of truth to this, it is important to remember thatany synthetic models derived from data can only replicate specific properties of the data, which means they can ultimately only model general trends.

However, synthetic data has several benefits compared to real data:

Overcome real data usage restrictions: Real data may be subject to usage restrictions due to privacy rules or other regulations. Synthetic data eliminates this problem by replicating all the important statistical properties of real data without exposing it.

These benefits suggest that as our data becomes more complex and more tightly protected, the creation and use of synthetic data will only grow.

Synthetic data generation/creation 101

When determining the best way to create synthetic data, it’s important to first considerthe type of synthetic data you intend to have. There are three broad categories to choose from, each with different advantages and disadvantages:

Fully synthetic: This data does not contain any raw data. This means that it is almost impossible to re-identify any single unit, and all variables remain fully available.

Partial synthesis: Only sensitive data will be replaced with synthetic data. This requires heavy reliance on imputation models. This results in less model dependence, but does mean that some disclosure can be made due to the true values retained in the dataset.

Hybrid Synthesis: Hybrid synthetic data is derived from both real and synthetic data. While ensuring the relationship and integrity between other variables in the data set, the underlying distribution of the original data is studied and the nearest neighbor of each data point is formed. For each record of the real data, an approximate record in the synthetic data is selected, and then the two are concatenated to generate hybrid data.

Two general strategies for constructing synthetic data include:

Drawing numbers from a distribution: This method works by observing a real statistical distribution and replicating fake data. This can also include creating generative models.

Agent-based modeling: To obtain synthetic data in this approach, a model is created to explain the observed behavior and then the same model is used to reproduce random data. It emphasizes understanding the impact of interactions between agents on the entire system.

Deep Learning Models: Variational autoencoders and generative adversarial network (GAN) models are synthetic data generation techniques that increase data utility by feeding the model with more data. Feel free to learn more about how data augmentation and synthetic data support deep learning.

Challenges of synthetic data

While synthetic data offers various benefits that can streamline an organization’s data science projects, it also has limitations:

Possibly missing outliers: Synthetic data can only mimic real-world data, it is not an exact replica of it. Therefore, the synthetic data may not cover some of the outliers that the original data had. However, outliers in the data can be more important than regular data points, as Nassim Nicholas Taleb explains in depth in his book The Black Swan.

Machine Learning and Synthetic Data: Building AI

Illustration explaining the relationship between machine learning and synthetic data

Figure 2: The relationship between ML and synthetic data

The role of synthetic data in machine learning is rapidly increasing. This is becausemachine learning algorithms are trained using large amounts of datathat may be difficult to obtain or generate without synthetic data. It can also play an important role in creating algorithms for image recognition and similar tasks that are becoming the baseline for artificial intelligence.

There are several otherbenefits of using synthetic data to helpdevelop machine learning:

Ease of data generation after establishing initial synthesis model/environment

Label accuracy is expensive or even impossible to obtain by hand

The flexibility of the synthesis environment can be adjusted as needed to improve the model

Availability to replace data containing sensitive information

2 synthetic data use cases that are widely adopted in their respective machine learning communities are:

Autonomous driving simulation

Learning through real-life experiments is hard in life and hard for algorithms.

This will be especially difficult for those who end up being hit by a self-driving car, as was the case in the fatal Uber crash in Arizona. 2While Uber has scaled back their operations in Arizona, they should probably ramp up simulations to train their models.

To minimize data generation costs, industry leaders such as Google have relied on simulations to create millions of hours of synthetic driving data to train their algorithms. 3

Generative Adversarial Network (GAN)

These networks are also called GANs or generative adversarial neural networks, introduced in 2014 by Ian Goodfellow et al. These networks are the latest breakthrough in image recognition. They consist ofa discriminator and a generator network. While the generator network generates synthetic images that are as close to reality as possible, the discriminator network is designed to identify real images from synthetic images. Both networksbuild new nodes and layers to learn to perform tasks better.

While this method is popular among neural networks used in image recognition, its uses extend beyond neural networks. It can also be applied to other machine learning methods. It is often called Turinglearning , as a reference to the Turing test. In the Turing test, humans talk to an invisible speaker and try to understand whether it is a machine or a human.

Synthetic data tools

Tools related to synthetic data are usually developed to meet one of the following needs:

Test data for software development and similar purposes

Training data for machine learning models

UnrealSynth Unreal Synthetic Data Generator uses the real-time rendering capabilities of Unreal Engine to build realistic three-dimensional scenes and provides automatically generated images and annotation data for the training of AI models such as YOLO. The synthetic data generated by UnrealSynth can be used for training and verification of deep learning models, which can greatly improve the implementation efficiency of target recognition tasks in various industry segmentation scenarios, such as: hard hat detection, traffic sign detection, construction machinery detection, vehicle detection, Pedestrian detection, ship detection, etc.

UnrealSynth steps to generate synthetic data:

1. After adding the GLB file to the scene, you can configure the UnrealSynth synthetic data generation parameters. The parameter configuration instructions are as follows:

Model category: Generate synthetic data. Record the type of object in the synth.yaml file.

Environment change: change scene background

Number of screenshots: Generate the number of images in the image directory of the synthetic data set, and generate half of the total number of images in the train and val directories.

Number of objects: Set the number of objects in the scene. Currently, up to 5 are supported, and the category of the model is randomly selected.

Random rotation: objects in the scene rotate at random angles

Random height: the height at which objects in the scene move randomly

Screenshot resolution: Image resolution in the generated images image dataset

Zoom: Object scaling and resizing

2. After clicking [OK], two folders and a yaml file will be automatically generated in the local directory…\UnrealSynth\Windows\UnrealSynth\Content\UserData: images, labels, and test.yaml files.

UnrealSynth\Windows\UnrealSynth\Content\UserData |- images |-train |- 0.png |- 1.png |- 2.png |- ... |-val |- 0.png |- 1.png |- 2.png |- ... |- labels |-train |- 0.txt |- 1.txt |- 2.txt |- ... |-val |- 0.txt |- 1.txt |- 2.txt |- ... |- synth.yaml

3. Model training: There are three ways to train the model after the data set is generated: using python scripts, using the command line, and using online services.

The first is to use a python script. You need to install the ultralytics package first. The training code is as follows:

from ultralytics import YOLO # Load a model model = YOLO('yolov8n.yaml') # build a new model from YAML model = YOLO('yolov8n.pt') # load a pretrained model (recommended for training) model = YOLO('yolov8n.yaml').load('yolov8n.pt') # build from YAML and transfer weights # Train the model results = model.train(data='synth.yaml', epochs=100, imgsz=640)

The second is to use the command line, which requires the YOLO command line tool to be installed. The training code is as follows:

# Build a new model from YAML and start training from scratch yolo detect train data=coco128.yaml model=yolov8n.yaml epochs=100 imgsz=640 # Start training from a pretrained *.pt model yolo detect train data=coco128.yaml model=yolov8n.pt epochs=100 imgsz=640 # Build a new model from YAML, transfer pretrained weights to it and start training yolo detect train data=coco128.yaml model=yolov8n.yaml pretrained=yolov8n.pt epochs=100 imgsz=640

The third option is to use ultralytics hub or other online training tools.

Reprint: The greatest utilization value of synthetic data in 2023 (mvrlink.com)