
Three types of attention masks and the corresponding indication graph G used in the analysis (self-loops are omitted for clarity). The directional edge from token J to me indicates that I am attending j. The center node highlighted in yellow (Definition 3.1) represents a token that can be directly or indirectly correspond to by all other tokens in the sequence. As depicted in the top line, the graph theory formulation captures both the direct and indirect contributions of tokens to the overall context, providing a comprehensive view of token interactions under multilayered attention. Credit: Arxiv (2025). doi:10.48550/arxiv.2502.01951
Research shows that large-scale language models (LLMs) tend to overemphasize information at the beginning and end of a document or conversation, ignoring the center.
This “position bias” means that if your lawyer uses an LLM-driven virtual assistant to obtain a specific phrase in a 30-page affidavit, LLM is more likely to find the correct text if it is on the initial or final page.
Researchers at MIT have discovered the mechanism behind this phenomenon.
They created a theoretical framework to study how information flows through the machine learning architecture that forms the backbone of LLM. They found that certain design choices that control how the model processes input data can cause position bias.
Their experiments revealed that influences model architecture, particularly the architecture of the model, and in particular the way it spreads to input words within the model, generates or enhances position bias, and training data also contributes to the problem.
This work is published on the ARXIV preprint server.
In addition to identifying the origin of position bias, those frameworks can be used to diagnose and modify them in future model designs.
This could lead to more reliable chatbots that maintain topics during long conversations, medical AI systems that infer more equitable when processing trobes of patient data, and code assistants that pay close attention to every part of the program.
“These models are black boxes, so as an LLM user, you probably don’t know that position bias can contradict the model. You just give it your document in the order you want and expect it to work. But by a better understanding of the underlying mechanisms of these black box models, you can improve them by addressing these limitations,” decision-making system (the lid), and the first author of the paper.
Her co-authors include Yifei Wang from MIT Postdoc. Additionally, senior author Stefanie Jegelka is an associate professor of Electrical Engineering and Computer Science (EECS) and a member of the IDSS and the Institute of Computer Science and Artificial Intelligence (CSAIL). Ali Jadbabaye is a professor in the Faculty of Civil and Environmental Engineering, a core professor of IDSS and lead researcher of lids. This research will be presented at an international conference on machine learning.
Analyze attention
LLMs such as Claude, Llama, and GPT-4 carry types of neural network architectures known as transformers. Transformers are designed to process sequential data, encode tenks into chunks called tokens, and learn the relationships between tokens and predict the next word.
These models are very good at this as they use context nodes using interconnected layers of data processing nodes by allowing tokens to selectively focus or focus attention on the relevant tokens.
But if every token can pay attention to every other token in a 30-page document, it will soon become computationally out of hand. Therefore, when engineers build trans models, they often use caution masking techniques that limit the words that a token can attend. For example, a causal mask allows you to pay attention only to what the word came before it.
Engineers also use position encoding to help the model understand the position of each word in a sentence and improve performance.
MIT researchers have built a graph-based theoretical framework to explore how these modelling choices, attention masks, and location encodings affect position bias.
“It’s extremely difficult to study because everything is combined and intertwined within a attention mechanism. Graphs are a flexible language for explaining the dependencies between words within a attention mechanism and tracking them across multiple layers,” says Wu.
Their theoretical analysis suggested that causal masking gives the model-specific bias towards the onset of input, even when the bias is not present in the data.
If the previous words are relatively insignificant to the meaning of the sentence, causal masking can cause the trance to pay more attention to the first thing in the first place.
“It is often true that when LLM is used in tasks that are not natural language generation, such as rankings and information searches, the previous words and words after the sentence are more important, but these biases can be very harmful,” says Wu.
As the model grows, there is an additional layer of attention mechanism, which amplifies this bias as the previous portion of the input is used more frequently in the model inference process.
They also found that using position encoding to link words more strongly to nearby words can alleviate position bias. This technique refocuses the attention of the model in the appropriate location, but the effect can be diluted in models with more attention layers. These design choices are just one of the causes of position bias. It can arise from the training data used by the model. Learn how to prioritize words in sequences.
“If you know that your data is biased in a certain way, you need to fine-tune your models as well as adjusting your modeling options,” says Wu.
Lost in the middle
After they established theoretical framework, the researchers conducted experiments that systematically altered the correct position of the text sequence of information retrieval tasks.
The experiment showed a phenomenon “lost inside” and the accuracy of the search followed a U-shaped pattern. The model performed most when the correct answer was placed at the beginning of the sequence. If the correct answer approached the end, the performance fell shorter as it approached the centre before a slight rebound.
Ultimately, their research suggests that using different masking techniques to remove excess layers from attention mechanisms or strategically adopting position encoding reduces position bias and improves model accuracy.
“By combining theory and experiment, we were able to see the results of model design choices that were not clear at the time. If you want to use models in high-stakes applications, you need to know when, when, why, why, why, why, why, when, when it doesn’t work,” says Jadbabaie.
In the future, researchers would like to further explore the effects of location encoding and study how location bias is strategically exploited in certain applications.
“These researchers provide a rare theoretical lens to the attention mechanism at the heart of transformer models. They provide a compelling analysis that clarifies long-standing habits in transformer behavior, with attention mechanisms, particularly in causal masks, essentially showing bias models towards the beginning of the sequence. Saberi is professor and director of the Stanford University Center for Computational Market Design, who was not involved in this work.
Details: Xinyi Wu et al, appearance of position bias in Transformers, Arxiv (2025). doi:10.48550/arxiv.2502.01951
Journal Information: arxiv
Provided by Massachusetts Institute of Technology
This story has been republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and education.
Quote: Lost in the Center: AI Position Bias obtained from LLM Architecture and Training Data from https://techxplore.com/news/2025-06 (2025, June 17th)
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.