Vision language models gain clearer vision through synthetic training data

AI vision, reinvention: the power of synthetic data — Cosyn works by leveraging the language skills of open source AI models to create training data for other AI models and learning how to read complex, text-rich images. Credit: Yue Yang

In the race to develop AI that understands complex images such as financial forecasts, medical diagrams, nutrition labels, etc., AI works independently in everyday settings, but closed source systems such as ChatGPT and Claude are currently setting the pace. But no one outside the manufacturer knows how those models were trained or what data they used.

Now, Penn Engineering researchers and the Allen Institute of AI (AI2) have developed a new approach to training open source models. Create scientists, charts, and tables using AI to teach other AI systems how to interpret complex visual information.

Those tools, Cosyn (short for code-induced synthesis), tap on the coding skills of open source AI models to render text-rich images, generate relevant questions and answers, and provide the data you need to learn and understand how to “see” scientists to other AI systems.

As researchers detail the ACL 2025 paper, Cosyn-trained models, one of the world’s leading AI conferences, match or excel with their own peers.

“It’s like taking a student who’s good at writing and asking someone to teach them how to draw,” says Yue Yang (Greng’25), a former AI2 co-author and research scientist: The Perceptual Inference and Interaction Research Group. “Essentially, it shifts the strengths of open source AI from text to vision.”

Composite images, actual results

The resulting dataset, called COSYN-400K, contains over 400,000 composite images and 2.7 million sets of corresponding instructions in a variety of categories, including scientific charts, chemical structures, and user interface screenshots. The COSYN-trained model surpassed the best proprietary systems like the GPT-4V and Gemini 1.5, flashing with a suite of seven benchmark tests.

In a particularly impressive case, the researchers synthetically generated just 7,000 nutritional labels and trained a model of nutritional Qa, a new benchmark model they created. That small, targeted dataset allowed the model to beat others trained with millions of real images.

“Training AI at COSYN is extremely efficient,” said Mark Yatskar, assistant professor at CIS and co-advisor for Yang’s doctoral program. “We show that synthetic data helps models generalize to real-world scenarios that may be unique to a person’s needs, such as reading nutritional labels for people with low vision.”

Yue Yang demonstrates Cosyn’s capabilities and uses models trained with synthetic data created in Cosyn to read nutrition labels and solve mathematical problems. Credit: Sylvia Zhang

Scaling and diversifying datasets

Creating hundreds of thousands of useful and diverse training examples has created unique challenges.

To reach the scale you need, co-author Ajay Patel, a doctoral student in Computer and Information Science (CIS), developed a software library called DataDreamer, which automates the entire data generation process. This allowed the team to promote language models in parallel, allowing for large-scale production of composite images and instructions.

To avoid repetition, the team used “Persona” to utilize short character profiles such as “sci-fi novelist” and “chemistry teacher” to guide AI responses and shape the content and tone of each example. Embed these personas in the prompts, CoSyn now generates richer and more diverse training data across a wide range of domains.

“AI models tend to repeat themselves unless they’re fine-tuned to different perspectives,” explains Patel. “Persona gives us a scalable way to do that, and the results speak for themselves.”

Leveling open source AI stadiums

By fully building open source tools, researchers hope to democratize access to powerful vision language training methods without the ethical and legal challenges surrounding web scraping and copyrighted content.

“This is a step towards AI that helps us make new scientific discoveries,” adds Chris Callison-Burch, a CIS professor who co-adapted Yang and is now advising Patel. “It opens the door to an AI system that allows us to reason about scientific documents and can help a wide range of people, from university students to researchers.”

From understanding to action

The team released the complete COSYN code and dataset to the public, and invited the global research community to build on their work.

Yang already pioneered the synthetic data that helps AI not only understand images but interact with them, acting as an intelligent digital agent that can click buttons, fill out forms and assist users with daily tasks.

“In the long run, we want AI that can not only explain it, but also act in the world,” Yang says. “This is one way of teaching that.”

Details: Scaling Text-rich Image Understanding via Code-Generated Synthesis Multimodal Data Generation, Yueyang1996.github.io/papers/cosyn.pdf

Provided by the University of Pennsylvania

Quote: AI Vision, Reinvention: Vision Language Models acquire clearer vision through synthetic training data (2025, July 21st). July 22, 2025 https://techxplore.com/news/2025-07-ai-vision-vision-reinvented-language-gain.html

This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.

Source link

What's Hot

The 21-year-old MIT Dropout raises $32 million at a $300 million valuation led by Insight

Donald Trump accuses Barack Obama of “treason” over 2016 election claims | Donald Trump News

Warehouse automation: Building a data-driven business

See real-time how AI is restructuring its work

Why humans are better at recognizing objects from fragments while AI is struggling

Ai’ Investigating thoughts reveals that models use tree-like mathematics to track shift information

Dairy Entrepreneurs Return to Shape the Future of British Dairy Products

Returning to British Vegetables or Derailed NHS Health Plans, Growers Warn

Welsh farmers urged plastics to be removed as new recycling tests begin

NFU warns that inheritance tax will break the backbone of UK agriculture

Fund Managers conclude their position in Europe’s defense

10 Things to Do on the Right Path for Stocks as Another Tariff Deadline approaches

Why Delta and United are pulling away from airline packs

Archives

Categories