dCollection 디지털 학술정보 유통시스템

A Performance Characterization Study of LLM Inferences on GPU Memory Tiering

원문보기

주제(키워드) LLM inference , 거대 언어 모델 , 추론
주제(DDC) 006.31
발행기관 아주대학교 일반대학원
지도교수 Sang-Hoon Kim
발행년도 2025
학위수여년월 2025. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과
실제URI http://www.dcollection.net/handler/ajou/000000034358
본문언어 영어
저작권 아주대학교 논문은 저작권에 의해 보호받습니다.

초록/요약

This study investigates the performance of serving large language models (LLMs) with a focus on the high-bandwidth interconnect between GPU and CPU using a NVIDIA Grace Hopper Superchip. This hardware’s architecture features a GPU-centric memory tiering system, comprising a performance tier with GPU memory and a capacity tier with host memory. We revisit a conventional pipelined execution for LLM infer- ence, utilizing host memory connected via NVLinks alongside GPU memory. For the Llama-3.1 8B base (FP16) model, such a GPU-centric tiered memory system meets the target latency requirements for both prefill and decoding while improving throughput compared to the in-memory case, where all model weights are stored in GPU memory. However, even with a high-bandwidth interconnect, meeting latency constraints for large models like the 70B and 405B FP16 models remains challenging. To address this, we explore the efficacy of model quantization (e.g., AWQ) along with the pipelined execution. Our evaluation reveals that the model quantization and the pipelined execution are a viable solution for serving large models. For the Llama-3.1 70B and 405B AWQ models, we show that the pipelined execution achieves 1.6× and 2.9× throughput improvement, respectively, compared to the in-memory only case, while meeting the latency constraint.

1 Introduction 1
2 Challenges and opportunities 3
2.1 Limited GPU Memory for Serving LLMs 3
2.2 LLM Inference with Memory Connected through NVLinks 4
2.3 Model Quantization 6
3 Performance Characterization for Serving LLMs on a GPU-centric Memory Tiering System 8
3.1 Llama-3.1 8B 10
3.2 Llama-3.1 70B 10
3.3 Llama-3.1 405B 13
4 Future works 14
Bibliography 15

반출 Meta View 목록

아주대학교

검색 상세

A Performance Characterization Study of LLM Inferences on GPU Memory Tiering

초록/요약

목차