Paper accepted at Blockchain: Research and Applications
BlockHDFS: Blockchain-integrated Hadoop distributed file system for secure provenance traceability
Hadoop Distributed File System (HDFS) is one of the widely used distributed file systems in big data analysis for frameworks such as Hadoop. HDFS allows one to manage large volumes of data using low-cost commodity hardware. However, vulnerabilities in HDFS can be exploited for nefarious activities. This reinforces the importance of ensuring robust security to facilitate file sharing in Hadoop as well as having a trusted mechanism to check the authenticity of shared files. This is the focus of this paper, where we aim to improve the security of HDFS using a blockchain-enabled approach (hereafter referred to as BlockHDFS). Specifically, the proposed BlockHDFS uses the enterprise-level Hyperledger Fabric platform to capitalize on files' metadata for building trusted data security and traceability in HDFS.