검색 상세

A Performance Characterization Study of LLM Inferences on GPU Memory Tiering

초록/요약

This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a NVIDIA Grace Hopper Superchip. This hardware’s architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM infer- ence, utilizing host memory connected via NVLinks alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are stored in GPU memory. However, even with a high-bandwidth interconnect, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization and the pipelined execution are a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.

more

목차

1 Introduction 1
2 Challenges and opportunities 3
2.1 Limited GPU Memory for Serving LLMs 3
2.2 LLM Inference with Memory Connected through NVLinks 4
2.3 Model Quantization 6
3 Performance Characterization for Serving LLMs on a GPU-centric Memory Tiering System 8
3.1 Llama-3.1 8B 10
3.2 Llama-3.1 70B 10
3.3 Llama-3.1 405B 13
4 Future works 14
Bibliography 15

more