Fabiana Clemente on Synthetic Data for AI and Agentic Systems – O’Reilly

Synthetic data has been around for a long time, decades even. But as KPMG’s Fabiana Clemente points out, “That doesn’t mean there aren’t a lot of misconceptions.” Fabiana sat down with Ben to clarify some of the current applications of synthetic data and new directions the field is taking—working with offshore teams when privacy controls just don’t allow you to share actual datasets, improving fraud detection, building simulation models of the physical world, enabling multi-agent architectures. The takeaway? Whether your data’s synthetic or from the real world, success often comes down to the processes you’ve established to build data solutions. Watch now.

About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2026, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

Check out other episodes of this podcast on the O’Reilly learning platform or follow us on YouTube, Spotify, Apple, or wherever you get your podcasts.

Transcript

This transcript was created with the help of AI and has been lightly edited for clarity.

00.47
All right. Today we have Fabiana Clemente, senior director and distinguished engineer at KPMG. Fabiana, welcome to the podcast.

00.57
Thank you. It’s a pleasure to be here.

01.00
Our main topic today is synthetic data. We’ll try to focus on that, but obviously we may get derailed here and there. I think it’s fair to say at this point most listeners have heard of this notion of synthetic data. Some have probably even tried to generate their own or used a tool. But obviously you’re much more hands-on and much more active on a day-to-day basis when it comes to synthetic data. So maybe we’ll start, Fabiana, if you can describe the top two to three use cases where synthetic data seems to work right now.

01.46
Yeah that’s a good start. And yes, it’s true that a lot of users have already heard of synthetic data before. That does not mean there aren’t a lot of misconceptions. But we can delve into that a bit later on.

But in a nutshell, understanding that synthetic data is the concept of any data that is not collected from real-world events, we can think of a different set and spectrum of use cases and applications, and we can go from the low-hanging fruit of test data management—data that will allow you to test systems—all the way to more intelligent use cases where you need to help the development of AI agents, and in between. You can think of synthetic data as a privacy-preserving way for you to have access to data.

So it’s a large and broad scope, and the scope is not served by all means by the same technology. Of course, it will vary depending on your application use case and what you want and expect to gain from synthetic data generation.

02.56
When you talk about AI applications, most people think of things like coding and programming and maybe customer support, things like that. What would be the equivalent for synthetic data? What are the most cited examples? If you were to give a talk, and you’re pressed to give examples where synthetic data is being used, what would be the top two most common reasons for using [it]?

03.34
Yeah, the three ones that I mentioned are the most common. So one of them is, “OK, I have a real dataset. I want to try to share this with my offshore team, but I can’t.” So the data can’t leave the country, but I still want to keep some level of structure, but also correlations. So you go for synthetic data instead. And here you use synthetic replicas, which is a type of synthetic data.

Or you are developing your own AI agents, and you are looking into improving your training, your evals. And then you leverage synthetic data to construct the whole system and change the epistemics around your AI agents. So I would say those two are fundamentally different, but they are true applications on how synthetic data can help nowadays.

04.32
You’ve been working on synthetic data for a while. What’s one or two examples where synthetic data solved a problem and it actually surprised you?

04.47
Surprised me? I wouldn’t say it surprised me, but definitely it’s probably the best way to leverage it. One of them—I just mentioned it—was really to enable how offshore teams would have access to a dataset that is similar and in this case, develop analytic solutions on top, for example. And that one is. . . Usually you think about how companies are restricted to share data with external entities. But you don’t think sometimes [about] how an external entity can still be the same company, just in a different country.

05.37
On the other hand, I would say that I also have seen cases where synthetic data did help a lot in improving the results of fraud detection, which, to an extent, is something that is not obvious that [that] will be a good path for a good way for you to improve your results when it comes to fraud detection.

06.05
So for teams that don’t have a lot of experience with synthetic data, what are, let’s say, the two most common mistakes?

06.15
Oh, that’s a good one. Yeah. I would say that the biggest mistake I’ve seen is perhaps oversimplifying the complexity of synthetic data. And I’m not saying synthetic data complexity in a bad way. But as in anything that leverages data, you need planning. You need to think about “What do you want to get as an outcome?” So even if you are just building a test dataset to test the software application, you need to plan “What use cases do you really want to cover on the synthetic data?”

And usually people have this expectation that synthetic data is just “Click on a button. It’ll do exactly everything I want—it’s simple and it’s just dummy. So it’s very easy to do.” That, I’ll say, is one of the biggest mistakes I have observed.

07.17
And the second one, I would say, is not understanding [that] there are different methodologies and different types of synthetic data that you can leverage, and being able to select the correct one for their objectives. And these are two fundamental [concepts]. They are not technical, if you ask me. They are really around requirements, and understanding the technology that you want to leverage.

07.46
Is it fair to say that, I guess, historically, a few years ago, synthetic data—my impression at least, and I guess this was before ChatGPT—tended to be around computer vision, images, those kinds of things. So these days, [what are the] data modalities basically across the board everyone is using trying to do synthetic data? I mean, even people in robotics are doing synthetic data at this point. But what’s the dominant type of data that people are. . .?

08.28
I would say that the first data type that leveraged synthetic data was actually structured data way before text or images, if you think about [it]. We have been doing that for more than 50 years probably regardless. And I do think that images did evolve quite interestingly in the last 10 years, probably, I would say, as well as text.

And I would say that nowadays, if you think about it, text is probably the type of synthetic data that is dominating the market. That doesn’t mean the space of synthetic data for text is well-defined or well-structured, because now anyone today considers synthetic data is just. . . An issue of oversimplifying: The outcome of an LLM can be considered synthetic data, but that does not mean it’s well-structured or is actually being correctly used and leveraged for what they are doing.

But definitely text is dominating nowadays.

09.45
So without synthetic data, normally what you would do is say, “OK, I want to build a model; here’s some historical data or in the case of finance, here’s historical trades and financial data.” And then I’ll build the model and test the model out and then deploy to production. But obviously things can go wrong even in the scenario I painted. You can have drift—so the real world changes and then what you built your model on is no longer the same. Or you may have ended up kind of. . . The sample you created your model from was biased and so on and so forth.

Obviously the same problems will occur with synthetic data. So what are some of the common technical problems? I guess is the question for synthetic data.

10.50
I wouldn’t say that it’s a technical problem from synthetic data. It’s a technical problem from data in general. What you just described is definitely a fundamental problem of how the processes around building data solutions are defined.

11.00
But it could be the case, Fabiana, that your data is perfectly fine, but your synthetic data tool was bad. And so then the data says the synthetic data generated was bad.

11.21
No, I wouldn’t say. . . And again, that goes exactly [back] to my initial point: You also can end up with good data and end up with a crappy model. And that’s a you problem. That’s a problem of you not understanding how models behave.

11.42
But surely, just like models and model building tools, there are synthetic generation tools that are better than others. So I guess what should people look for in terms of what tools they’re using?

11.59
It depends a lot on the use case on the end application, right?

12.04
Yeah. That’s a reasonable answer.

12.07
And it’s an answer that nobody likes to hear. But for me that’s the true answer: It depends. And you need to be aware of what you want, in order to search for the right parameters and functionalities that you are looking for.

12.27
But basically, synthetic data becomes a part of the workflow, just like real data, right? So what you would do in order to harden whatever model or analytics that you’re building with real data, you would apply the same hardening steps, if you’re using synthetic data.

12.52
100%. And I think it’s very important that you have what they would call a governance process around what you consider is a synthetic dataset that is ready for you to leverage.

If there are evaluation metrics that you should put in place, those evaluation metrics will depend on the type of data that you are leveraging but also on the use case that you are building. And those processes are really important. You should make sure that the people there are leveraging synthetic data and also well trained on it. Because as you said, yes, training a model [on] synthetic data can lead to potential mistakes that you don’t want to propagate. And those mistakes usually stem exactly from the lack of processes of governance on how to generate synthetic [data], when to generate it, from where, and from what, for what. . . And having those metrics and that insurance I think it’s essential for companies to adopt on a daily basis a synthetic data generation method.

14.04
With the rise of foundation models and generative AI, you know a few of the trends: There are things like agents, multimodality, reasoning. So let’s take them one at a time. So agents. . . Obviously, agents is a broad topic, but at the simplest level, you have an agent that does one thing well, but even that one thing may involve multiple steps, could involve tool callings and things like this. Are people starting to use synthetic data as part of their agent building process?

14.52
I wouldn’t generalize to everyone across the industry, but I would say that we have evidence that some companies are definitely adopting [synthetic data]. Meta, OpenAI. . .

15:12
So it sounds like really advanced companies.

15.15
Yes, exactly. And I was about to say that. Even xAI, they are all leveraging synthetic data and and all of them are betting on leveraging synthetic data to enable a different structured exploration of the knowledge spaces.

Exactly what you said, an AI agent or a set or a multi-agent system will require reasoning, a multistep kind of framework. And usually your knowledge base is not structur[ed that] way, or it’s less structured if you go and check. So synthetic data is actually one of the pieces that is helping on having those knowledge spaces well-structured in a way that they can optimize the outcome from agents for example, or even to change how models actually acquire the understanding.

16.15
So in the traditional way we used to think about building an AI system, as we collect the data, we build the model, we have an output. . . A lot of those more sophisticated companies are actually already thinking a different way, right? The AI, especially the agents, will need to learn or to be developed in a different way, where you have an hypothesis, you want to cover that hypothesis with your data, you want to model, you want to evaluate that hypothesis and make sure that your systems are updated.

And that’s where synthetic data is actually helping in changing. And this is what we call the acceleration through epistemic development, where synthetic data is the main tool to achieve that. But this is how we know, “Are we understanding the general way how sophisticated companies are using it?” I wouldn’t dare to say that everyone in the industry is using it that way.

17.15
Yeah, yeah, yeah. So one of the more interesting things in this area is this emerging body of practice around agent optimization. And the key insight there is that you can boost your agent a lot by just rewiring the agent graph without upgrading your model. So now you’ve got a bunch of open source projects ranging from TextGrad, the DSPy, OpenEvolve, GEPA. . .all designed to do a lot of these things.

And I would imagine, even as you’re optimizing your agent, you’re gonna want to run this agent through a bunch of scenarios that don’t exist in your dataset—and could involve even edge cases. And now that these agents are actually, as we discussed, doing a bunch of things, using a bunch of tools—that space is kind of broad, and I doubt that you would have that historical data handy anyway—you would need to have tools that would allow you to, with confidence, know that you’ve optimized this agent properly and that it’s ready to at least be rolled out, even in a limited way.

18.50
Exactly, exactly. What you just described is exactly this need of a change of paradigm, right? We used to think that we need to learn by exposure, by learning historical data. We definitely now need to have our systems learning by construction and be able to test it right away. And that’s where I think the synthetic data is actually a very good (and a needed) accelerator. And I’m just glad that AI agents brought that perspective because. . . This perspective already existed. It was just harder to conceptualize and see the value, because it’s very abstract.

19.32
If you think of all the agents at least on the business side, right, so server side, the coding agents, actually a lot of these business agents are coming out of China. Since I spent a lot of time in China in the past, I’ve been talking to a bunch of people there, and I guess, the reason that the Chinese companies are moving to the West is it’s much easier to charge people in the West than in China.

So for whatever reason, they’re here; they’re building these tools that will automate a bunch of things. Right. So the canonical example would be, “Create a PowerPoint presentation based on the following specs and blah, blah, blah.” But if you can imagine these business process agents becoming more and more complex, hitting more and more tools, it’s just impossible to think that you would have all of that historical data handy anyway, so you would really need a way to simulate the behavior of these agents.

20.45
And one question I have, Fabiana, is one of the things that you keep reading about and I guess is generally true of millennials is chatbots becoming kind of true friends or companions or even romantic partners.

It got me thinking. So if that’s happening, in order to harden this chatbot, you would need to simulate data where the chatbot is now starting to detect emotion, emotional response—you know, not just not just plain text, but there’s got to be, as you’re testing these chatbots, you have to inject all sorts of emotional scenarios, because now it’s like acting like a friend of someone. So have you heard of emotion being part of synthetic data generation somehow?

21.52
Not really. And I’m probably a bit more skeptical when it comes to emotion. I understand your point. It depends on what you consider emotion.

22.05
I’m skeptical too. I’m not sure if it’s happening. I’m just speculating that because the interaction is becoming emotional to some degree, there must be some people attempting to generate data that has an emotional dimension somehow. I’m just making this up, by the way.

22.30
Yeah, yeah, yeah. [laughs] No, I bet it’s a possibility and I’m not surprised if someone was doing that. Emotions have been like one of the focuses of AI. We always heard about sentiment analysis, that always happens. So I wouldn’t be surprised. I’m not aware [of any] myself. But as I told you, I’m really skeptical that even synthetic data could be helpful on that side.

Perhaps you can create better boundaries, if that makes sense. But still, there’s always a limited capability of these models to really understand beyond syntax. And that’s where I still stand. Even if someone told me I was able to get some better results, I [would think] that those better results were achieved in a very specific, narrowed kind of situation. Though. . .

Well, we have heard the stories of people [who] are very happy with bots, that they never felt more companionship than [with] the bots they have right now. So there’s a lot of nuance there. [laughs]

23.51
One of the things that brought synthetic data back in the headlines maybe 12 or 18 months ago was there was so suddenly a lot of talk about “We’re running out of data. All these models are being trained on internet data, but everyone has basically vacuumed all of that data. So then now we have to distinguish our model or make our models even better.”

Obviously scaling laws have multiple dimensions. There’s compute; there’s data. But since data is running out, we need synthetic data, right? On the other hand, though, a lot of people raised the possibility that AI trained on AI data is going to lead to some sort of model collapse. So what have you heard recently in terms of the concerns around. . .

You know, obviously “There’s no such thing as free lunch. . .” So every kind of thing you use has potential disadvantages. So this disadvantage that people bring up, Fabiana, [is] if you’re able to train models on synthetic data then that’s going to degrade the model over time, because basically it’s like a loop, right? The model’s capability of generating synthetic data is limited by the model itself. So therefore, you know…

25.42
And that’s under the assumption that the synthetic data that we are talking about is generated by the LLMs. We can’t forget that there’s way more about synthetic data. There are simulations, and simulations [have been] used for quite some time with very good results. They were used for the studies of COVID vaccination. It is used every day with weather, and they work. But of course there’s a limitation. I agree there’s no free lunch. I wouldn’t say it degrades the capability of the model, but I would definitely say a plateau.

Because unless you are doing assumptions based on what you know, and you just know that there is no collected data but this actually happens. . . But unless you know new behaviors, the fact that we are generating the same data from around the same behaviors, you will achieve a plateau. But also I think that’s one of the things that regardless narrative AIs like LLMs will always have a problem with. They always are dependent on having seen a lot of data.

And we know that that plateau will eventually be achieved. And then we have a totally different problem. How mathematically can we solve this bottleneck? And on that side, I don’t think synthetic data will be the answer anymore.

27.32
What we just discussed there focuses mainly on LLMs and foundation models involving text. But one area that people seem particularly excited about these days are foundation models for the physical world, primarily robotics. So in that world, it seems like there’s two general approaches that people are doing. One is [to] actually collect data, but obviously they don’t have the same internet scale data that you’ll have for LLMs.

Secondly, you generate data by having humans do a task, and you just capture it on video and that’s how you collect data. And then the third approach is simulation. So basically now that you’ve collected human data, maybe you can have simulations to expand the amount of data you have. The critics say that simulations are fine, but there’s still a gap between [the] simulation [and] real data.

I mean these are you know, people like Rodney Brooks—one of the granddaddies of robotics. So it seems like, in certain areas like that, synthetic data may still need work, no?

29.12
I wouldn’t say “may still need work,” but I would say that definitely needs to be more explored. It’s more on that side. Because I know companies that work on specifically synthetic data for robotics, and they are having very good results.

And I understand that a lot of people. . .

29.39
We have to have them talk to Rodney. [laughs]

29.41
Perhaps. Because we have to be pragmatic. You want to develop robots and solutions for automation. But data collection is expensive, time-consuming. And it’s very hard to get all the movements that you want to capture collected just by nature.

Having said that, simulation is great. Synthetic data can help in, you know, building a bridge between the real data and the simulations. In some cases, it won’t cover 100%, but it will cover perhaps 80% to 90%. And sometimes it’s better to just have 80% of the cases than having the 20% covered by real data. I think here it’s more a pragmatic approach, and [in] real-world scenarios, a lot of times the 80% are very good. Excellent actually.

30.42
So in closing, going back to the topic of agents, obviously, people tend to get ahead of themselves—people are still working on single agents to do very narrow tasks. But then on the other hand, there’s already a lot of talk about multi-agents, and obviously multi-agents introduce a lot more complexity, for one, particularly if the agents are communicating. So there’s just communication challenges between those agents. What are some of the new tools that you’re hearing about that target specifically multi-agents or the scale that agents have introduced to synthetic data?

31.34
Not new tools, actually. But of course, we have been actively working on—and a lot of the vendors in synthetic data that already work with this type of data are exploring—covering new scenarios and new features. A lot of these agents are relying, for example, on document processing. So there are new solutions for document generation, which is highly helpful.

One of the things that I also like is, for example, in market research, there’re all these synthetic personas required nowadays to accelerate hypothesis testing—learning speeds, for example, which is very interesting. Or there are solutions being developed, but to help with reasoning structure for bots. So those are, I wouldn’t say specifically tools that are coming out, but are definitely solutions that are being developed targeting the needs and requirements to test for multi-agent architectures.

32.46
Yeah. It seems like there’s. . . Like there’s a group out of Meta that—I don’t know how real this is, but they released a paper basically even uses Ray for scale and orchestration and specifically, [to] increase throughput mainly to generate synthetic data for multi-agent scenarios. I’m not sure. It seems like according to the paper, they’re actually using this, but I’m not sure if anyone else is using it.

33.41
Yeah, but that’s. . . The companies will use a different way. Right? That’s an architecture solution for a problem they have. They want to augment the throughput, test the system loads. And that will be a decision for the different engineering teams on how to apply synthetic data generation.

Testing throughput, testing systems capabilities, well, we have been using synthetic data that way for decades now. It’s just a change of paradigm. And by the way, it’s not really a change because if we think about multi-agents, just as we think about microservices from the 2010s, it’s the same concept; it’s the same needs. It’s just a shift in terms of tools.

Just because instead of being applied to just software engineering, you are actually applying this to AI-driven solutions. So I see a lot of change in that area, on tooling, even, for example, authentication for agents—we are seeing a lot of solutions exactly for that. But it’s not something specific to synthetic data. It’s more on the broader sense of architectural solutions to deliver multi-agent systems.

35.01
Yeah. And also it seems like it fits into the natural tooling that’s happening in multimodal data and data for generative AI in general in that you need high throughput, but you also need efficient utilization of a lot of resources between GPUs and CPUs and fine-grained utilization, because, basically, these are precious computing resources.

And with that, thank you, Fabiana.

35.37
Thank you, Ben. Thank you for having me. This was a pleasure.

Fabiana Clemente on Synthetic Data for AI and Agentic Systems – O’Reilly

Transcript

Leave a Comment Cancel reply