From Synthetic Data to the Reshaping of Approaches to Representation

decorative image

The introduction of artificial intelligence (AI) into our lives, and particularly its generative form, has raised numerous ethical and social issues over the years—ranging from concerns about privacy protection to the reproduction of stereotypes, and more recently to matters regarding the safeguarding of intellectual property. All these instances, among others, share a common element: data. Datasets play a crucial role both in the design phase of AI models and in the generation of their outputs. Specifically, at this historical juncture, it is especially relevant to reflect on a particular category of data: synthetic data.

Why Does Data Matter?

First, it is necessary to address the question of data itself. As it is well known, AI systems can only be designed if a large dataset is available to train the algorithm. From the training process of these AI models to the moment they generate outputs, the breadth and quality of the datasets available to them significantly affect their processing and generative capabilities.

This dataset can come from various sources: it might be assembled through web scraping, which allows for the indiscriminate collection of data—for example, from social media—later labeled by human workers (often poorly paid). This is the method used by nearly all companies that have developed generative AI models, such as OpenAI, Google, and others.

Alternatively, data can be meticulously assembled by humans, as in the case of AlphaFold, the AI system that predicts the 3D structure of proteins and which was awarded the Nobel Prize in Chemistry in 2024. AlphaFold was made possible because, starting in the 1970s, scientists worldwide created a public database containing the 3D structures of over 170,000 proteins.

Some of the Problems that Derive from Data

Data, however, also lies at the heart of some of the most well-documented issues related to AI. From the inability of certain facial recognition systems to accurately identify Black individuals—as famously investigated by Buolamwini and Gebru in 2018—to the predisposition of the COMPAS recidivism prediction system to overestimate the likelihood of re-offending for incarcerated people of color compared to white inmates, and up to the more recent and widely publicized cases of occupational stereotyping by various text-to-image AI models, these issues have consistently been traced back to gaps within the datasets. As a result, these systems do not reflect the world itself, but rather the way in which that particular segment of data composing their dataset perceives the world.

Further problems concern the quality and availability of such data. First, obtaining a dataset of sufficiently high quality is a lengthy and costly process. Second, even if it were possible to cleanse any existing dataset (which currently is not feasible), we must still contend with the fact that data, like everything else, is finite. Rather than relying on scarce or hard-to-access datasets, or limiting development to those already available, the current trend is to turn to artificially generated data, also known as synthetic data.

The Solution: Synthetic Data

Synthetic data is data that has been artificially generated by AI systems based on a real dataset. The methods for generating such data are multiple and highly complex and it would not be useful to delve into them in this contribution; what is sufficient to know is that the data generated is not a simple redistribution of the original data, but it is actually new, previously unobserved data, that has been generated while preserving the statistical characteristics of the original dataset.

Synthetic data can serve multiple purposes, two of which will be addressed here. The first involves the use of synthetic data to “fill in” gaps within an existing dataset. In the aforementioned case analyzed by Buolamwini and Gebru, the problem was in the insufficient representation of individuals with dark skin tones within the dataset used to train facial recognition systems. In such cases, in the absence of synthetic data, the only solution would be to invest time and financial resources in collecting real-world data in order to rebalance the initial dataset. With synthetic data, by contrast, it becomes possible to simply expand the dataset by generating additional data points to ensure greater representation of Black individuals. In this instance, the issue of representational imbalance—a problem that, in other contexts, has contributed to the emergence of stereotypes surrounding people of color—is addressed not by striving to represent empirical reality, but rather by instructing a machine to generate a statistically balanced version of it.

The second application concerns the use of synthetic data to make datasets containing sensitive information usable. This is the case, for example, with datasets containing banking, medical, or other personal data that cannot be employed in their original form due to privacy protections. As previously noted, data constitutes one of the most valuable resources in the field of AI. Rather than forgoing the use of a carefully constructed dataset, one can instead instruct the system to generate a new dataset that preserves the statistical distribution of the original, while excluding any personally identifiable information.

Is Synthetic Data as Great as It Appears?

This raises an important epistemological question about the status of synthetic data: it is not real, but rather realistic. It is generated by machines—with all the implications of delegating agency to technological systems whose operations we do not fully control—yet it is used as if it were real data. From this perspective, synthetic datasets can be considered as a simulacra, in the sense theorized by Jean Baudrillard in his 1981 Simulacra and Simulation. For the French philosopher, simulacra are representations of reality that have lost their connection to the original reality they are supposed to represent. Baudrillard’s concept of the simulacrum is deeply rooted in his analysis of how modern society has shifted away from representing reality, creating “hyperreality”—where simulations and copies take precedence over real experiences and authentic meaning. As synthetic data has been gradually replacing real data in the process of training of AI systems (it tends to be cleaner and free from the noise that affects empirical data), we are witnessing a process of manipulation of reality through the increased preference for a synthetic one.

It is therefore crucial to ask ourselves what consequences might arise: is there a possibility that we will reach a point at which the synthetic reality is so far removed from the actual one and we will have to adhere to it? Will we base our perception of reality according to the one generated by synthetic data, or will we be able to understand the difference?

The Issue of Communication and Perception

This leads us to the question of how this data is perceived. In both public and technical discourse, synthetic data is often presented as objective and bias-free—as if it were merely a tool for observing the world in an impartial and aseptic way. However, this is not exactly the case. As illustrated by Shanley & colleagues, in the process of generating synthetic data, choices are made in defining what to represent, how to represent it, and the social, cultural, and political contexts in which data is produced and interpreted. This was already true for empirical data collection, and it is even more pronounced for synthetic data, which is the product of algorithms designed by social actors who carry with them worldviews, interests, and cultural assumptions. The paradox is that precisely because they lack a real-world referent, synthetic data seems even “cleaner,” more unbiased, and safer. It is said that it is ethical because it does not violate privacy and does not directly discriminate, but in reality, what it transmits is the ideological framework that governed its generation. In short, its objectivity is a discursive effect, not an intrinsic property of the data.

A Critical Stance on Synthetic Data

As it emerges, the issue of synthetic data is highly complex and demands critical reflection. From my perspective, resorting to tools such as synthetic datasets—which, as true simulacra, are detached from reality and constitute a realistic projection rather than an actual representation—ultimately benefits only those who design AI models. While AI companies benefit from the appearance of inclusivity derived from the use of synthetic data, which positively reflects on their brand reputation, they are not actually committed to solving the problem of representing marginalized communities, thereby further depriving real individuals of the power of self-representation. In line with other studies on synthetic data, this contribution argues for the need to maintain a critical attitude towards these tools, as their use entails concrete consequences and reveals a specific stance from the companies that use them.

In conclusion, synthetic data is reshaping not only the technical practices of AI development but also our relationship with reality itself. Far from being neutral, it carries the assumptions of those who produce it, raising urgent epistemological and ethical questions about the worlds we build through data.

picture of author
Carla Fissardi

Carla Fissardi is a PhD student in Semiotics at the University of Palermo. She graduated with a Master's degree in Semiotics with a thesis on the presence of stereotypes and bias in images generated by text-to-image AI. She is currently working on generative AI and its framing in the cultural and social imaginary.

Previous articleStriking a Balance: Reconciling Democratic Citizenship and Epistemic Agency
Next articleLongtermism and its Limits

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here