With the widespread deployment of long-context LLMs, KV cache has emerged as a critical bottleneck by expanding linearly in size with the sequence length. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3.1-8B-Instruct, Llama-3-8B-Instruct-1M, GLM-4-9B-Chat-1M, Yi-9B-200K, Phi-3-Mini-128K-Instruct, and Qwen2-7B-128K-Instruct, we show that it supports up to 6x larger batch sizes and boosts throughput by up to 3.04x on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.
To demonstrate the efficiency of ShadowKV, we deploy it into real-world large batch serving scenarios. By measuring the throughput during decoding across different models, we show that ShadowKV can support up to 6x larger batch sizes and boost throughput by up to 3.04x. Our efficiency evaluation includes LLMs such as Llama-3.1-8B-Instruct, Llama-3-8B-Instruct-1M, GLM-4-9B-Chat-1M, and Yi-9B-200K on an A100. The baseline selects the largest batch size that can fit entirely on the GPU with full attention. We also include results for the same batch size of ShadowKV and the infinite batch size, assuming infinite GPU memory capabilities.
Model | Context | Full Attention | ShadowKV | Gain | Full Attention (Inf) |
---|---|---|---|---|---|
Llama-3-8B-1M (8 KV heads) |
60K | 160.62 (8) | 455.14 (48) | 2.83x | 168.72 (48) / 273.07 (Inf) |
122K | 80.77 (4) | 239.51 (24) | 2.97x | 83.05 (24) / 134.30 (Inf) | |
244K | 40.37 (2) | 119.01 (12) | 2.95x | 52.00 (12) / 67.15 (Inf) | |
Llama-3.1-8B (8 KV heads) |
60K | 160.93 (8) | 472.77 (48) | 2.94x | 168.72 (48) / 273.07 (Inf) |
122K | 80.78 (4) | 245.90 (24) | 3.04x | 83.05 (24) / 134.30 (Inf) | |
GLM-4-9B-1M (4 KV heads) |
60K | 241.05 (12) | 615.89 (50) | 2.56x | 266.24 (50) / 436.91 (Inf) |
122K | 122.67 (6) | 293.40 (25) | 2.39x | 158.83 (25) / 214.87 (Inf) | |
244K | 61.13 (3) | 136.51 (12) | 2.23x | 78.84 (12) / 107.44 (Inf) | |
Yi-9B-200K (4 KV heads) |
60K | 204.81 (10) | 544.36 (42) | 2.66x | 271.21 (42) / 364.09 (Inf) |
122K | 101.44 (5) | 260.03 (21) | 2.56x | 133.53 (21) / 179.06 (Inf) | |
244K | 46.74 (2) | 118.55 (10) | 2.54x | 65.79 (10) / 89.53 (Inf) | |
The algorithm of ShadowKV is divided into two main phases: pre-filling and decoding. The pre-filling phase involves low-rank decomposition of the post-RoPE key cache, offloading the value cache, and constructing landmarks to facilitate subsequent high-throughput decoding. The decoding phase includes accurate KV selection and efficient sparse KV cache reconstruction.
In our paper, we show that ShadowKV can reduce the GPU memory footprint of the KV cache by over 6x without accuracy degradation on a wide range of models and evaluation benchmarks with minimal sparse KV cache budget.
As illustrated in the figure, ShadowKV enhances long-context LLM inference throughput by offloading the value cache to the CPU while maintaining a low-rank key cache, landmarks, and outliers on the GPU. During decoding, it employs landmarks for efficient sparse attention, reducing computation and data movement. ShadowKV effectively utilizes a limited KV budget to achieve high accuracy, theoretically reaching over 7 TB/s equivalent bandwidth on an A100, and empirically boosts generation throughput by 3.04x for Llama-3.1-8B with on a batch of 122K contexts.
Our design of ShadowKV is inspired by two critical empirical observations regarding LLMs when dealing with long contexts, detailed as follows.
As shown in the figure below, we observed that pre-RoPE keys are exceptionally low-rank compared to the layer inputs, post-RoPE keys, values, key weight matrix, and value weight matrix. Moreover, the pre-RoPE keys lack significant similarities in low-rank subspaces across different sequences, while a sequence and its continuation tend to strongly share low-rank subspaces, enabling high compression rates within each sequence.
Meanwhile, in long-context LLM inference, the quadratic scaling of attention computation with sequence length makes the linear cost of low-rank decomposition during pre-filling negligible. This observation motivates us to store the low-rank keys and offload the values to reduce the memory footprint for larger batch sizes and longer sequences.
To further reduce the latency overhead in sparse attention, including fetching the selected value cache from the CPU and reconstructing the corresponding key cache, an accurate KV selection method is needed to minimize the sparse KV cache budget while maintaining the accuracy. We found most post-RoPE key cache exhibits spatial locality, with high cosine similarity to adjacent tokens, except for a few outliers.
This finding suggests that for the majority of chunks, we can maintain the mean value as compressed landmarks to select minimal important KV pairs (1.56%) accurately during decoding. Outlier chunks, which may contain dense or critical information and are difficult to approximate, are retained to ensure accuracy. Given their relatively small number (0.2-0.3%), storing them on the GPU is feasible without affecting memory capacity. Furthermore, as shown in Figure, considering the temporal locality of the KV cache, a cache policy can be leveraged to further reduce the latency overhead by 60% during decoding with optimized kernels.
Leveraging the ShadowKV framework, we enable efficient long-context LLM inference for larger batch size and longer sequences, making long-context LLM serving more viable. ShadowKV can be further integrated with various works on KV quantization, enhancing its performance by reducing the KV cache bit-width. Our empirical experiments demonstrate ShadowKV can support up to 6x larger batch sizes and enhance throughput by up to 3.04x on an A100 across various long-context models. ShadowKV holds great promise for improving long-context LLM inference. We look forward to staying engaged with the community to further advance this field.
@article{sun2024shadowkv,
title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
journal={arXiv preprint arXiv:2410.21465},
year={2024}
}