
Examples of ARC and BBH tasks in which the model resolves successfully only after applying test time training. Credit: Arxiv (2024). doi:10.48550/arxiv.2411.07279
For all impressive abilities, large-scale language models (LLMs) are often lacking when given challenging new tasks that require complex inference skills.
Accounting firm LLMs may be good at summarizing financial reports, but if they are tasked with predicting market trends and identifying fraudulent transactions, the same model can fail unexpectedly.
To make LLMS more adaptable, MIT researchers have investigated how to strategically deploy specific training methods to enhance the performance of models on unfamiliar, challenging problems.
They show that test-time training, a method of temporarily updating some of the internal workings of a model during deployment, can lead to a six-fold improvement in accuracy. Researchers have developed a framework for implementing test time training strategies that maximize these benefits using new task examples.
Their work can increase the flexibility of the model and enable off-the-shelf LLMs to adapt to complex tasks that require planning and abstraction. This could lead to LLMS becoming more accurate in many applications that require logical deductions, from medical diagnosis to supply chain management.
“Real Learning – What we did here with Test Time Training is what we can’t do on our own after these models are shipped. They can’t acquire new skills or get better at the task. But they showed that pushing the model a little can lead to major improvements in performance. ’25, the lead author of the study.
Akyürek has been joined by graduate students Mehul Damani, Linlu Qiu, Han Guo and Jyothish Pari. Adam Zweiger from the faculty. Senior author Een Kim, assistant professor of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Computer Science and Artificial Intelligence (CSAIL). Jacob Andreas, an associate professor at EECS and a member of CSAil.
The study will be presented at the International Conference on Machine Learning (ICML 2025) held in Vancouver from July 13th to 19th. This paper is currently available on the Arxiv prelint server.
Working on the hard domain
LLM users often try to improve model performance with new tasks using a technique called context learning. It supplies the model with some examples of new tasks as a text prompt to guide the model’s output.
However, learning within context does not always work for problems that require logic and inference.
MIT researchers investigated how test time training could be used in combination with in-context learning to improve performance for these challenging tasks. Test time training involves updating the internal variables (internal variables) that you use to make predictions using a small amount of new data specific to the task at hand.
The researchers investigated how test-time training interacts with in-context learning. They studied design options that maximize performance improvements that can be separated from the generic LLM.
“We see that test-time training is a more powerful form of learning. Although providing examples can increase accuracy with precision, when you actually update your model with these examples, performance can be significantly improved, especially in challenging domains,” says Damani.
Context learning requires small examples of tasks, such as problems and solutions. Researchers use these examples to create task-specific datasets needed for test time training.
To expand the size of this dataset, create new inputs by slightly modifying the example problems and solutions, such as inverting some input data horizontally. They discovered that training the model on the output of this new dataset leads to top performance.
Additionally, researchers update only a small number of model parameters using a technique called low-rank adaptation, which improves the efficiency of the test-time training process.
“This is important because this method needs to be efficient when deployed in the real world. You can see that very small amounts of parameter training can greatly improve accuracy,” Akyürek says.
Developing new skills
Test time training is employed per instance, so streamlining the process is important. This means that users must do this for each individual task. The model update is temporary and after making predictions the model returns to its original form.
Models that usually take less than a minute to answer a query can take 5-10 minutes to provide answers in test time training, adds Akyürek.
“I don’t want to do this for every user query, but it can be useful if there is a very difficult task of solving a model well. Also, there may be tasks that LLM is difficult to solve without this method,” he says.
The researchers tested the approach on two benchmark datasets of highly complex problems, such as IQ puzzles. It has been six times more accurate than the technique that uses only in-context learning.
Tasks containing tasks with structured patterns or tasks using completely unfamiliar types of data showed the greatest performance improvements.
“For simpler tasks, learning within the context may be fine, but updating the parameters themselves can potentially develop new skills in the model,” says Damani.
In the future, researchers hope to use these insights towards developing models that will continuously learn.
The long-term goal is LLM, and given a query, it can automatically determine whether a parameter needs to be updated using test time training or whether a task can be resolved using context learning, and implement the best test time training strategies without the need for human intervention.
Details: Ekin Akyüreket al, The Suspicy Shot Learning’s incredible effectiveness of test time training, Arxiv (2024). doi:10.48550/arxiv.2411.07279
Journal Information: arxiv
Provided by Massachusetts Institute of Technology
This story has been republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and education.
Quote: Test time training could lead to a superior LLMS for complex inference (July 8, 2025) obtained from https://techxplore.com/news/2025-07-llms-complex.html.
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.
