검색 상세

Design and Performance Analysis of Hybrid Caching Architectures in Distributed Storage for HPC Applications

초록/요약

High-performance computing (HPC) applications generate massive volumes of scientific data, placing significant pressure on distributed storage systems. As simulation scales continue to grow, backend latency, metadata amplification, and small-object write over- head increasingly limit end-to-end I/O throughput. Caching has therefore become a critical performance strategy, yet the relative effectiveness of different caching mecha- nisms—such as inbuilt distributed caching (IDC) in Ceph, external FUSE-based caching (EFC) through Alluxio, and multi-tier Hybrid designs—remains insufficiently understood under realistic, trace-driven HPC workloads. This thesis presents a comprehensive evaluation of three caching architectures for HPC storage systems: (i) Ceph tiering operating under writeback and readonly modes (IDC), (ii) Alluxio Tier-1 caching using four read/write policies (CACHE, CACHE_PROMOTE, ASYNC_THROUGH, CACHE_THROUGH), and (iii) a Hybrid two-tier design that layers Alluxio over a Ceph SSD cache tier. Using 100 GB of HACC-IO cosmology traces—chosen for their realistic pat- terns of sequential writes, metadata-intensive bursts, and read-after-write access—we an- alyze bandwidth, latency, IOPS, cache-hit ratio, promotion/eviction behavior, and back- end I/O amplification across cache sizes of 30, 50, and 100 GB. The results reveal two major asymmetries. First, for write-intensive phases, the Hy- brid configuration with Alluxio ASYNC_THROUGH and Ceph writeback achieves the high- est performance, outperforming EFC and IDC across all metrics by providing two-stage buffering and deep write coalescing. Second, for read-dominated phases, EFC using CACHE_PROMOTE yields the strongest Tier-1 locality, while Hybrid with Ceph writeback achieves near-EFC locality but with dramatically reduced backend HDD load, convert- ing most miss traffic to SSD-tier operations. Across all experiments, caching policy alignment was found to influence performance more strongly than cache capacity. Misaligned policies (e.g., CACHE_THROUGH with Ceph readonly) suppressed caching benefits entirely, whereas aligned policies produced sub- stantial performance gains. Overall, this thesis provides one of the first detailed, empirical characterizations of multi- tier caching interactions under realistic HPC workloads. The findings demonstrate that EFC is ideal for maximizing read locality, IDC is effective for backend absorption dur- ing writes, and the Hybrid configuration offers the most balanced solution for mixed workloads. These insights can guide the design of next-generation caching strategies for large-scale scientific computing infrastructures.

more

목차

1 Introduction 1
1.1 Motivation 2
1.2 Problem Statement 4
1.3 Research Objectives 5
1.4 Scope of Study 6
1.5 Contributions 6
1.6 Thesis Organization 7
2 Background and Motivation 9
2.1 Distributed Storage Systems in HPC 9
2.2 Inbuilt Distributed Caching (IDC): Ceph Tiering 10
2.3 External FUSE-Based Caching (EFC): Alluxio 12
2.4 Distributed Storage Performance Challenges 13
2.5 Qualitative Comparison of IDC and EFC 14
2.6 Related Work 14
2.6.1 Alternative HPC Storage Architectures 15
2.6.2 Research on Multi-Tier Caching 16
3 System Architecture and Caching Frameworks 19
3.1 Overview of the Storage Stack 19
3.2 Hybrid Caching Architecture 21
3.3 Data Flow in the Hybrid System 21
3.3.1 Read Path 22
3.3.2 Write Path 23
3.4 Policy Interaction and Design Implications 24
3.5 Summary 24
4 Research Methodology 26
4.1 Experimental Design 26
4.2 Testbed Environment 27
4.2.1 Hardware Configuration 27
4.2.2 Software Configuration 28
4.3 Workload Description 28
4.4 Caching Configurations 28
4.4.1 Inbuilt Distributed Caching (IDC) 28
4.4.2 External FUSE-Based Caching (EFC) 29
4.4.3 Hybrid Caching 29
4.5 Evaluation Metrics 29
4.6 Experimental Procedure 29
4.7 Summary 30
5 Results and Analysis 33
5.1 Sequential Write Performance 34
5.1.1 IDC Writeback Mode 34
5.1.2 EFC with ASYNC_THROUGH 34
5.1.3 EFC with CACHE_THROUGH 35
5.1.4 Hybrid: ASYNC_THROUGH + Writeback 35
5.1.5 Hybrid: CACHE_THROUGH + Readonly 36
5.2 Sequential Read Performance 37
5.2.1 EFC with CACHE_PROMOTE 37
5.2.2 EFC with CACHE 37
5.2.3 IDC Readonly and Writeback Modes 38
5.2.4 Hybrid Read Performance 38
5.3 Cache Hit and Promotion Behavior 40
5.4 Backend Disk Activity 41
5.5 Total Runtime 42
5.6 Summary of Findings 42
6 Discussion 43
6.1 Architectural Trade-offs: IDC vs. EFC 43
6.1.1 Ceph IDC: The Cost of Consistency 43
6.1.2 Alluxio EFC: The Benefit of Locality 44
6.2 The Hybrid Solution: Optimizing the Trade-offs 45
6.2.1 Synergy in the Write Path 45
6.2.2 Protection in the Read Path 45
6.3 Analysis of Policy Misalignment 46
6.4 Guidelines for HPC System Architects 46
6.5 Summary 47
7 Conclusion and Future Work 49
7.1 Summary of Findings 49
7.1.1 EFC (Alluxio) as a High-Locality Read Accelerator 50
7.1.2 IDC (Ceph Tiering) as a Backend Absorption Layer 50
7.1.3 Hybrid Caching as the Most Balanced Architecture 50
7.1.4 Policy Coordination is More Important Than Cache Size 51
7.1.5 HACC-IO as a Realistic Driver of Storage Behavior 52
7.2 Research Contributions 52
7.3 Limitations 53
7.4 Future Work 53
7.4.1 Broader HPC Application Coverage 54
7.4.2 Adaptive, Policy-Aware Multi-Tier Caching 54
7.4.3 Deeper Analysis of Object Granularity and Promotion 54
7.4.4 Hybrid Caches with More Than Two Tiers 55
7.4.5 Scalability and Multi-User Evaluation 55
7.4.6 Cache Consistency, Fault Tolerance, and Recovery 55
7.5 Final Remarks 56
References 57

more