
Oaken’s quantization algorithm consisting of three components: (a) threshold-based online offline hybrid quantization, (b) group-shift quantization, and (c) dense density and sparse encoding. Credit: Proceedings of the 52nd Annual International Symposium on Computer Architecture (2025). doi:10.1145/3695053.3731019
Modern generation AI models such as Openai’s ChatGPT-4 and Google’s Gemini 2.5 require not only high memory bandwidth but also large memory capacity. This is why generate AI cloud operating companies like Microsoft and Google buy hundreds of thousands of Nvidia GPUs.
As a solution to address the core challenges of building such high-performance AI infrastructure, Korean researchers have successfully developed NPU (neural machining unit) core technology that improves the inference performance of generated AI models by an average of over 60%, while consuming approximately 44% less power than modern GPUs.
Professor Jongse Park, a research team from Kaist School of Computing, has collaborated with HyperAccel Inc. to develop high-performance, low-power NPU core technologies specializing in generator AI clouds like ChatGpt.
The techniques proposed by the research team were presented by the Ph.D. Students Minsu Kim and Dr. Seongmin Hong, PhD, from HyperAccel Inc., are co-first authors at the 2025 International Symposium on Computer Architecture (ISCA 2025), held in Tokyo from June 21-25.
The key objective of this study is to improve the performance of large-scale generation AI services by minimizing loss of accuracy and solving memory bottleneck problems while reducing the weight of the inference process. This research has been highly praised for its integrated design of AI semiconductors and AI system software, which are key components of the AI infrastructure.
While existing GPU-based AI infrastructures require multiple GPU devices to meet high bandwidth and capacity demands, this technology allows for the same level of AI infrastructure configuration with fewer NPU devices via KV cache quantization. KV caches account for most memory usage, and quantization significantly reduces the cost of building a generated AI cloud.

Overall Oken Accelerator Architecture. Credit: Proceedings of the 52nd Annual International Symposium on Computer Architecture (2025). doi:10.1145/3695053.3731019
The researchers designed it to integrate with memory interfaces without modifying the operational logic of existing NPU architectures. This hardware architecture not only implements the proposed quantization algorithm, but also employs page-level memory management techniques to efficiently utilize limited memory bandwidth and capacity, and introduces new encoding techniques optimized for quantized KV caches.
Furthermore, when building an NPU-based AI cloud with superior cost and power efficiency compared to modern GPUs, the high performance and low power nature of NPUs is expected to significantly reduce operational costs.
Professor Jongse Park said, “Through collaboration with HyperAccel Inc., this research has discovered solutions for the optical strength algorithms of generated AI inference and successfully developed core NPU technology that can solve memory problems.
“This technology demonstrates the potential to implement high-performance, low-power infrastructure specialized for generating AI, and is expected to play an important role not only in AI cloud data centers, but also in AI transformation (AX) environments, represented by dynamically viable AI such as agent AI.”
Details: Minsu Kim et al, Oaken: Fast and efficient LLM using the minutes of the Computer Architecture (2025) on Online Offline Hybrid KV Cache Quantization, 52nd Annual International Symposium. doi:10.1145/3695053.3731019
Provided by Korea Institute of Advanced Science and Technology (KAIST)
Quote: AI Cloud Infrastructure Faster and Eco-Friendly: NPU Core will improve 60% (2025, July 7) above 60% (2025, July 7) from https://techxplore.com/news/2025-07-Ai-cloud-infrastuture-faster-greener.htmll on July 8, 2025.
This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.
