aHDFS: An Erasure-Coded Data Archival System for Hadoop Clusters (Nov-2017)


The main aim of this project is to propose an erasure-coded data archival system called aHDFS for Hadoop clusters, where codes are employed to archive data replicas in the Hadoop distributed file system or HDFS.

Proposed system:

To tackle this problem, we develop an erasure-coded data archival system called aHDFS, which archives rare accessed data in large-scale data centers to minimize storage cost. One way to reduce storage cost is to convert a 3Xreplica-based storage system into an erasure-coded storage. It makes sense to maintain 3X replicas for frequently accessed data. Importantly, managing non-popular data using erasure-coded schemes facilitates savings in storage capacity without adversely imposing performance penalty. A significant portion of data in data centers are considered as non-popular data, because data have an inevitable trend of decreasing access frequencies. Evidence shows that most of data are accessed within a short duration of the data’s lifetime.

The following three factors motivate us to develop the erasure-code-based archival system for Hadoop clusters:
o A pressing need to lower storage cost,
o High cost-effectiveness of erasure-code storage and,
o The popularity of Hadoop computing platforms.