검색 상세

An Efficient Fault-Tolerant and Reliable Data Integrity Framework for Object-Based Big Data Transfer Systems

초록/요약

Data has overwhelmed the digital world in terms of volume, variety, and velocity. Individuals, business organizations, computational science simulations, and experiments produce huge volumes of data on a daily basis. Often, this data is shared by data centers distributed geographically for storage and analysis. However, for transferring such huge volumes of data across geo-distributed data centers in a timely manner, data transfer tools are facing unprecedented challenges. Fault is one of the major challenges in distributed environments; hardware, network, and software might fail at any instant. Thus, high-speed and fault tolerant data transfer frameworks are vital for transferring data efficiently between the data centers. In this thesis, we propose a novel bloom filter-based data aware probabilistic fault tolerance (DAFT) mechanism to efficiently recover from such failures. We also propose a data and layout aware mechanism for fault tolerance (DLFT) to effectively handle the false positive matches of DAFT. We evaluate the data transfer and recovery time overheads of the proposed fault tolerance mechanisms on the overall data transfer performance. The experimental results demonstrate that the DAFT and DLFT mechanisms are very efficient in recovering from the faults while minimizing the memory, storage, computation, and recovery time overheads. Furthermore, we observe negligible impact on the overall data transfer performance. Protecting the integrity of data against the failures of various intermediate components involved in the end-to-end path of data transfer is a salient feature of big data transfer tools. Although most of these components provide some degree of data integrity, they are either too expensive or inefficient in recovering corrupted data. This necessitates the need to maintain application-level end-to-end integrity verification during data transfer. However, owing to the sheer size of the data, supporting end-to-end integrity verification with big data transfer tools incurs computational, memory, and storage overheads. In this thesis, we propose a cross-referencing bloom filter based data integrity verification framework for big data transfer systems. This framework has three advantages over state-of-the-art data integrity techniques: lower computation and memory overhead, and zero false-positive errors for a restricted number of elements. We evaluate the computation, memory, recovery time, and false-positive overhead of the proposed framework and compare them with state-of-the-art solutions. The evaluation results show that the proposed framework is very efficient in detecting and recovering from integrity errors while eliminating false-positives of the bloom filter data structure. In addition, we observe negligible computation, memory, and recovery overheads for all workloads.

more

목차

1 Introduction 1
1.1 Object based Big data Transfer Systems 2
1.2 Fault Tolerance 5
1.3 Data Integrity 8
1.4 Contributions of This Dissertation 10
2 Background 12
2.1 Big Data Transfer Tools 12
2.2 Motivation 15
2.3 Bloom Filter Data Structure 17
2.3.1 Performance Optimizations of Bloom Filter 19
2.4 Related Work 20
3 Data and Layout-Aware Fault Tolerance Framework 23
3.1 Illustration of Bloom Filter 23
3.2 Proposed Fault Tolerance Framework 26
3.2.1 System Architecture 26
3.2.2 Data Aware Fault Tolerance (DAFT) 28
3.2.3 Data and Layout Aware Fault Tolerance (DLFT) 31
3.3 Data and Layout Aware Fault Tolerance Evaluation 34
3.3.1 Testbed and Workload Specifications 34
3.3.1.1 Testbed 34
3.3.1.2 Workload 35
3.3.1.3 Thread Configuration 36
3.3.1.4 Recovery Time 36
3.3.2 Performance Evaluation 37
3.3.2.1 Data Transfer Time 37
3.3.2.2 Recovery Time 38
3.3.2.3 False Positive matches of DAFT 41
3.3.2.4 Recovery Time Overhead and False-Positives of DLFT 44
3.3.2.5 Space Overhead Analysis 45
4 Cross-Referencing Bloom Filter based Data Integrity Framework 47
4.1 System Architecture 47
4.2 Implementation Details 52
4.3 Memory Overhead Analysis 55
4.3.1 Memory Requirements of Data Integrity Solution based on Standard Bloom Filter 55
4.3.2 Memory Requirements of Proposed Data Integrity Solution 56
4.4 False-Positive Rate Analysis 57
4.4.1 False-Positive Rate Analysis of Data Integrity Solution based on Standard Bloom Filter 57
4.4.2 False-Positive Rate Analysis of Data Integrity Solution based on CRBF 57
4.5 Cross-Referencing Bloom Filter based Data Integrity Framework Evaluation 59
4.5.1 Testbed and Workload Specifications 59
4.5.1.1 Testbed 59
4.5.1.2 Workload 59
4.5.2 Performance Evaluation 60
4.5.2.1 Data transfer rate 60
4.5.2.2 Memory Overhead 61
4.5.2.3 Recovery Overhead and False-Positive error Analysis 63
5 Conclusions 65
References 67

more