
Process differences with or without hardware accelerators. Credit: Electronics (2024). doi:10.3390/Electronics13234683
Large-scale language models (LLMs) like Bert and GPT are driving major advances in artificial intelligence, but their size and complexity usually require strong servers and cloud infrastructure. Running these models directly on devices left a challenging technical challenge, even relying on external computations.
The research team at Sejong University has developed a new hardware solution that will help change it. This work has been published in the journal Electronics.
Scalable Transaccelerator Units (STAUs) are designed to efficiently run a variety of transformer-based language models in embedded systems. It dynamically adapts to a variety of input sizes and model structures, making it particularly suitable for real-time on-device AI.
At the heart of the STAU is a variable systolic array (VSA) architecture that performs matrix operations (core workloads of transformer models) in an input sequence length and reduced manner. By feeding rows of input data rows and loading weights in parallel, the system reduces memory stalls and improves throughput. This is especially important for LLMS where statement lengths and token sequences vary widely between tasks.
In benchmark tests published in Electronics, the accelerator demonstrated a 3.45x faster speeds over CPU-only running, while maintaining numerical accuracy of over 97%. Additionally, processing long sequences reduced total calculation time by over 68%.
Since then, continuous optimization has further improved system performance. According to the team, recent internal testing has achieved speedups of up to 5.18 times, highlighting the long-term scalability of the architecture.

Top module architecture. Credit: Electronics (2024). doi:10.3390/Electronics13234683

Processing Elements (PE) and Variable Contraction Array (VSA) architecture. Credit: Electronics (2024). doi:10.3390/Electronics13234683
The researchers also redesigned SoftMax functions, an important part of the transpipeline. Because it typically relies on exponentialization and normalization, the bottleneck was redesigned using a lightweight RADIX-2 approach that relies on shift-and-address manipulation. This reduces hardware complexity without compromising the output quality.
To further simplify the calculations, the system uses a custom 16-bit floating point format that is tailored specifically to suit your trans workload. This format eliminates the need for layer normalization (another common performance bottleneck), and contributes to a more efficient and streamlined data path.
The Stau was implemented in a xilinx FPGA (VMK180) and controlled by an embedded ARM Cortex-R5 processor. This hybrid design allows developers to support a variety of transformer models, including those used in LLMS. No hardware changes are required.
The team sees it as steps to make advanced language models more accessible and deployable across a wider range of platforms, including real-time AI execution, privacy and low latency responses.
“The STAU architecture demonstrates that even the transformer model, and even the larger model, can be practical for applications on devices,” says author Seok-Woo Chang. “It provides the foundation for building scalable and efficient intelligent systems.”
More information: Seok-Woo Chang et al, Scalable Transaccelerator, Electronics (2024) with variable systolic arrays of multiple models of voice assistant applications. doi:10.3390/Electronics13234683
Provided by Sejong University
Citation: Scalable Transaccelerator enables devices to run on large language models (2025, July 21) from July 21, 2025, obtained from https://news/2025-07.
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.