Synthetic Data | AI's Next Feedstock?

Your next investment opportunity?

7 months ago   •   13 min read

By Alan Shi, Michael Donaldson,


What happens when we run out of data? Or we can't use some valuable datasets due to privacy issues.

What will our AI models learn from once we hit "peak data"? Maybe we'll never reach peak data, same as we never reached "peak oil".

But we're hearing rumbles of AI models running out of internet to train on.

Enter Synthetic Data. The category has emerged as a picks-and-shovels play for those looking to invest in the AI boom without trying to pick an LLM player or pay exorbitant prices.

Well, maybe some exorbitant prices - Scale AI is in talks with Accel to raise hundreds of millions at a $13B valuation (up from $7.3B in 2021) but they're rumored to be doing $675M in Revenue which would give it at ~20x multiple, which in AI land isn't crazy. Scale sells its data generation services to leading AI model companies like OpenAI, Cohere, and Nvidia.

As we'll see below, the synthetic data category has gone from an interesting new tech category in the late 2010s to a critical enabling technology for advanced AI research and ML modeling.

Table of Contents:

  1. Introduction to Synthetic Data
  2. End Markets & Use Cases
  3. Synthetic Data Market Map
  4. Investment Considerations

💡
Synthetic data refers to artificially generated data that mimics the statistical properties and patterns of real-world data, without containing any actual real data. It is created using advanced machine learning techniques, particularly generative models like generative adversarial networks (GANs)

Unite AI explains:

"When it comes to GAN models, they are called “adversarial” networks due to the fact that GANs are actually two networks that compete with each other. The generator is responsible for generating synthetic data, while the second network (the discriminator) operates by comparing the generated data with a real dataset and tries to determine which data is fake. When the discriminator catches fake data, the generator is notified of this and it makes changes to try and get a new batch of data by the discriminator. In turn, the discriminator becomes better and better at detecting fakes. The two networks are trained against each other, with fakes becoming more lifelike all the time."

  • In scenarios where real data is limited, biased, or outdated, synthetic data provides a way to generate realistic data for analysis, forecasting, and decision-making. For example, generating synthetic patient data for medical research or financial data for risk modeling.
  • Industries dealing with sensitive personal data like healthcare, finance, and telecommunications can leverage synthetic data to adhere to data privacy regulations (GDPR, HIPAA, etc.).
  • Synthetic tech firms help their customers by building a synthetic data lake and impose the "shape" of the original data onto a set of synthetic or fake data, as discussed by the Chief Commercial Officer at MDClone in this clip.

▶ Gretel.ai, a synthetic data company based in Palo Alto, which raised a $50M Series B in 2021, shows below how customers can leverage their platform to perform analyses that yield reliable insights leveraging their synthetic data platform.

💰 The synthetic data market was already growing before the latest AI boom in 2022-2024 as evidenced by the tangential Digital Twins market having a CAGR of 58% which was projected to reach $50 billion by 2025, according to McKinsey.

The recent frenzy to develop and deploy AI models has rapidly pulled forward demand. According to Gartner, in 2021 only 1% of all data used to train AI was synthetic but they predict that number could be up to 60% in 2024.

âŹ‡ïž In the following sections, we highlight a few companies that are well-positioned to become market leaders and take advantage of AI's voracious demand for data.

End Markets & Use Cases

Training AI/ML Models

One of the primary use cases of synthetic data is training AI and machine learning models when real data is scarce, sensitive, or difficult to obtain. Synthetic data can augment or replace real data for tasks like:

  1. Training computer vision models on synthetic images/videos for object detection, segmentation, etc.
  2. Training natural language processing models on synthetic text data.
  3. Synthetic data comes with perfect ground truth annotations, eliminating the need for manual labeling which is time-consuming, expensive and prone to errors.

Healthcare & Life Sciences

Synthetic health data accelerates clinical research while protecting patient privacy by generating artificial records that resemble real EHR data without including any actual patient information.

Organizations also generate synthetic data to augment AI training datasets by boosting low sample sizes, balancing between classes, filling in missing fields, and simulating new examples for underrepresented medical conditions.

  1. Synthetic healthcare data can help in simulating patient data for clinical trials, facilitating more efficient initial testing phases and protocol development.
  2. Enable the safe analysis of electronic health records (EHR) by using synthetic patient data, maintaining patient privacy while extracting valuable insights for healthcare improvement.
  3. Reduce the time required to access data

Some of the largest Healthcare and Insurance providers have begun exploring how to use synthetic data. Anthem stated in 2022 that it was working with Alphabet to create a platform for synthetic data to validate and train AI algorithms that identify things like fraudulent claims or abnormalities in a person’s health records, and those AI algorithms will then be able to run on real-world member data.

The plan is to use algorithms and statistical models to generate approximately 1.5 to 2 petabytes of synthetic data, including artificially generated data sets of medical histories, healthcare claims and other key medical data, created in partnership with Google Cloud.

The ultimate goal, he said, is to validate and train AI algorithms on large amounts of data, while reducing privacy issues surrounding personal medical information - Anil Bhatt, CIO Anthem/Elevance.

Finance & Public Sector

Synthetic data usage offers financial organizations a privacy-preserving solution to enhance risk assessment, fraud detection, algorithm training and software development.

  1. For Banks: Improve fraud, anti-money-laundring and anomaly detection models
  2. For Insurers: Personalized customer insights based on high-quality synthetic data
  3. For FinTechs: Secure algorithm training by maximizing the data and minimizing the privacy risks.
  4. Regulatory compliance with data protection regulations
  5. Population Health & Security analyses around image identification for satellite imagery

Below are a few examples of how leading synthetic data companies are servicing their customers.

Source: Mostly AI

Syntho.ai, a startup based in the Netherlands, has seen considerable market pull from the banking sector across the globe for its synthetic data products, highlighting the global nature of the opportunity for vendors in the space.

Syntheticaic, sells its Rapid Automation Image Categorization (RAIC) platform to customers looking to analyze large sets of geospatial data. Using the RAIC platform they were the first company to find the infamous Chinese spy balloon by analyzing satellite imagery. RAIC found the ballon in two minutes. The company raised a $15M Series B in Feb 2024, with IBM Ventures and Booz Allen participating.

Epistemix, a synthetic data company based in Salt Lake City, built its initial product around public health simulations. As described by their CEO & Co-Founder, John Cordier:

our software is like a hyper-realistic SimCity where you can evaluate the impact strategies, policies, and interventions have on the populations and places in the simulation before you take action in the real world.

The current iteration of their product allows customers to create agent-based models to simulate outcomes of policy decisions and is built on top of a synthetic population of the United States that is statistically accurate down to the census block level.

Source: Epistemix FRED model

Synthetic Data Market Map

There's been significant activity in the space over the last few years, with the first generation of companies carving out niches in specific end markets (Banking, Healthcare, etc.)

  • A lot of companies started in Europe in response to the GDPR rules forcing EU companies and US companies with European consumer data to tread more carefully when analyzing their data (e.g, Mostly AI).
  • The next generation of companies in the space are looking to take advantage of the recent explosion in AI companies and models, positioning themselves as more software infrastructure companies (e.g., Tonic.ai)

Mostly AI has an active blog commenting on the space, below they've outlined a more exhaustive list of competitors and players entering the space.

Mostly AI's Synthetic Data Market Map

Some of these companies are classified by Matt Turk at FirstMark in his annual The 2024 MAD (ML, AI & Data) Landscape in the 'Data Generation & Labelling' category.


Investment Considerations

Despite what looks like an attractive market, synthetic data isn't without its limitations and there remain key risks and open questions for anyone investing or building in the category.

❓Quality of Synthetic Data and ability to draw inferences

    • Synthetic data may not capture the complexity of real-world datasets and can potentially omit important details or relationships needed for accurate predictions.
    • A 1B parameter synthetic data set based on a 10M parameter real data set may generate insights that are incrementally helpful on those 10M parameters but said insights could be vastly different when compared to a 1B parameter real data set.

🧠 Disintermediation Risk from Companies closer to the data

    • Platforms and technologies, such as data warehouses/lakes, that sit closer to customers' data may build their own synthetic data tooling which would disintermediate some of the pure play synthetic data vendors
    • Modern Healthcare Data platforms and EHRs such as Zus, and Tuva offer synthetic patient data tools for Health Tech companies to run analyses on their patient population.
    • Larger software vendors may scoop up smaller teams via M&A or partnerships, increasing competition and leaving synthetic players at risk of being viewed as a feature (ex: Aetion acquired Replica, Facebook acquired AI.Reverie and Datavant acquired Syntegra)

đŸ„Š Competitive Market with at least 5 companies raising $40M+

    • Questionable as to whether we would need net new Seed or Series A stage companies in the space unless there is a unique angle
    • Identifying technical differentiation among competitors becoming more difficult

đŸ’” Consumption-based pricing models may make predicting revenue difficult

    • Larger vendors will be able to require minimum spends but smaller companies may struggle to generate enough revenue to warrant investment.
    • Price erosion possible overtime due to competition and commoditization of technology
    • Tech lends itself to requiring services/onboarding which may lower gross margins until fully self-service models are available

If you are investing or building in the space please reach out at michael@laborcapital.co if you'd like to review deal opportunities or provide additional commentary to the article.

Contributing Editor: Thanks to Alan Shi, a sophomore at Washington University in St. Louis, for his help researching this article. Alan is part of the Labor Capital Fellows Program.


Additional Resources

HelixNano
Clinical trial digital twin service Unlearn.AI raises $50M
Unlearn.AI, a startup developing ‘digital twin’ technology for clinical trial applications, has raised $60 million in a venture round led by Insight Partners.
MDClone snags $63M series C for synthetic data analysis platform
Health data company MDClone grabbed $63 million in a series C funding round to power its continued global growth. | Health data company MDClone banked $63 million in series C funding to power growth for its self-service platform, which generates synthetic data to expand access to research while protecting patient privacy.
Advancing AI to power the future of medicine
Welcome to the future of medicine
The People in This Medical Research Are Fake. The Innovations Are Real.
‘Synthetic-data technology,’ by creating artificial patient populations, has the potential to speed up innovations without compromising privacy.
Home
Syntegra creates accurate, privacy-preserved synthetic data that bridges the gap between your organization’s data privacy and data science needs, allowing you to take a data-centric approach to innovation in patient care and improved clinical outcomes.
Anthem Looks to Fuel AI Efforts With Petabytes of Synthetic Data
The idea of synthetic data in enterprises has been around for decades, but it’s picking up steam as companies begin making more use of artificial intelligence and have greater need for faster access to better data sets.
Self service AI generated synthetic data platform
Explore the self service AI generated synthetic data platform now to accelerate your data-driven tech solutions!
Synthetic Data Guide
Synthetic Data Guide. Whatever ‘maturity level’ you are in with your organization, Syntho offers guidance
AI-generated Synthetic Data, easy and fast access to high quality data? - Synthetic data software
%
Leeds spinout Slingshot Simulations secures ÂŁ3 million Series A funding round
Slingshot Simulations, a pioneering digital twin and decision intelligence company, has secured a ÂŁ3 million Series A round of investment from Northern Gritsto
Digital twins: From one twin to the enterprise metaverse
Forward Networks raises $50M for digital twin technology to help model, manage and secure complex enterprise networks
Digital twinning tech is a clever way to model and run complex systems. Now a startup building this for enterprises has raised a round to grow.
MOSTLY AI Raises $25M to Bring Synthetic Data to Every Enterprise
MOSTLY AI, the pioneers of synthetic data raised $25M Series B funding led by Molten Ventures with participation from Earlybird, 42CAP and Citi Ventures.
MOSTLY AI raises $25 million to further commercialize synthetic data in Europe and the US
Synthetic data startup MOSTLY raised a $25 million Series B round led by British VC firm Molten Ventures, with participation from new investor Citi Ventures and existing investors 42CAP and Earlybird.
Everything that happened in the synthetic data space in 2022
Read about the new players, market developments, and perspectives on how the synthetic data space evolved in 2022.
The market for synthetic data is bigger than you think
To understand what’s happening, but also what’s coming if synthetic data does get more broadly adopted, we talked to various CEOs and VCs over the last few months.

https://www.gartner.com/en/newsroom/press-releases/2023-08-01-gartner-identifies-top-trends-shaping-future-of-data-science-and-machine-learning

Spread the word