
Cosyn works by leveraging the language skills of open source AI models to create training data for other AI models and learning how to read complex, text-rich images. Credit: Yue Yang
In the race to develop AI that understands complex images such as financial forecasts, medical diagrams, nutrition labels, etc., AI works independently in everyday settings, but closed source systems such as ChatGPT and Claude are currently setting the pace. But no one outside the manufacturer knows how those models were trained or what data they used.
Now, Penn Engineering researchers and the Allen Institute of AI (AI2) have developed a new approach to training open source models. Create scientists, charts, and tables using AI to teach other AI systems how to interpret complex visual information.
Those tools, Cosyn (short for code-induced synthesis), tap on the coding skills of open source AI models to render text-rich images, generate relevant questions and answers, and provide the data you need to learn and understand how to “see” scientists to other AI systems.
As researchers detail the ACL 2025 paper, Cosyn-trained models, one of the world’s leading AI conferences, match or excel with their own peers.
“It’s like taking a student who’s good at writing and asking someone to teach them how to draw,” says Yue Yang (Greng’25), a former AI2 co-author and research scientist: The Perceptual Inference and Interaction Research Group. “Essentially, it shifts the strengths of open source AI from text to vision.”
Composite images, actual results
The resulting dataset, called COSYN-400K, contains over 400,000 composite images and 2.7 million sets of corresponding instructions in a variety of categories, including scientific charts, chemical structures, and user interface screenshots. The COSYN-trained model surpassed the best proprietary systems like the GPT-4V and Gemini 1.5, flashing with a suite of seven benchmark tests.
In a particularly impressive case, the researchers synthetically generated just 7,000 nutritional labels and trained a model of nutritional Qa, a new benchmark model they created. That small, targeted dataset allowed the model to beat others trained with millions of real images.
“Training AI at COSYN is extremely efficient,” said Mark Yatskar, assistant professor at CIS and co-advisor for Yang’s doctoral program. “We show that synthetic data helps models generalize to real-world scenarios that may be unique to a person’s needs, such as reading nutritional labels for people with low vision.”
Scaling and diversifying datasets
Creating hundreds of thousands of useful and diverse training examples has created unique challenges.
To reach the scale you need, co-author Ajay Patel, a doctoral student in Computer and Information Science (CIS), developed a software library called DataDreamer, which automates the entire data generation process. This allowed the team to promote language models in parallel, allowing for large-scale production of composite images and instructions.
To avoid repetition, the team used “Persona” to utilize short character profiles such as “sci-fi novelist” and “chemistry teacher” to guide AI responses and shape the content and tone of each example. Embed these personas in the prompts, CoSyn now generates richer and more diverse training data across a wide range of domains.
“AI models tend to repeat themselves unless they’re fine-tuned to different perspectives,” explains Patel. “Persona gives us a scalable way to do that, and the results speak for themselves.”
Leveling open source AI stadiums
By fully building open source tools, researchers hope to democratize access to powerful vision language training methods without the ethical and legal challenges surrounding web scraping and copyrighted content.
“This is a step towards AI that helps us make new scientific discoveries,” adds Chris Callison-Burch, a CIS professor who co-adapted Yang and is now advising Patel. “It opens the door to an AI system that allows us to reason about scientific documents and can help a wide range of people, from university students to researchers.”
From understanding to action
The team released the complete COSYN code and dataset to the public, and invited the global research community to build on their work.
Yang already pioneered the synthetic data that helps AI not only understand images but interact with them, acting as an intelligent digital agent that can click buttons, fill out forms and assist users with daily tasks.
“In the long run, we want AI that can not only explain it, but also act in the world,” Yang says. “This is one way of teaching that.”
Details: Scaling Text-rich Image Understanding via Code-Generated Synthesis Multimodal Data Generation, Yueyang1996.github.io/papers/cosyn.pdf
Provided by the University of Pennsylvania
Quote: AI Vision, Reinvention: Vision Language Models acquire clearer vision through synthetic training data (2025, July 21st). July 22, 2025 https://techxplore.com/news/2025-07-ai-vision-vision-reinvented-language-gain.html
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.