Stackable Operator for Apache HDFS
The Stackable operator for Apache HDFS (Hadoop Distributed File System) is used to set up HFDS in high-availability mode. HDFS is a distributed file system designed to store and manage massive amounts of data across multiple machines in a fault-tolerant manner. The operator depends on the Stackable Operator for Apache ZooKeeper to operate a ZooKeeper cluster to coordinate the active and standby NameNodes.
Getting started
Follow the Getting started guide which will guide you through installing the Stackable HDFS and ZooKeeper operators, setting up ZooKeeper and HDFS and writing a file to HDFS to verify that everything is set up correctly.
Afterwards you can consult the Usage guide to learn more about tailoring your HDFS configuration to your needs, or have a look at the demos for some example setups.
Operator model
The operator manages the HdfsCluster custom resource. The cluster implements three roles:
-
DataNode - responsible for storing the actual data.
-
JournalNode - responsible for keeping track of HDFS blocks and used to perform failovers in case the active NameNode fails. For details see: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
-
NameNode - responsible for keeping track of HDFS blocks and providing access to the data.
The operator creates the following K8S objects per role group defined in the custom resource.
-
Service - ClusterIP used for intra-cluster communication.
-
ConfigMap - HDFS configuration files like
core-site.xml
,hdfs-site.xml
andlog4j.properties
are defined here and mounted in the pods. -
StatefulSet - where the replica count, volume mounts and more for each role group is defined.
In addition, a NodePort
service is created for each pod labeled with hdfs.stackable.tech/pod-service=true
that
exposes all container ports to the outside world (from the perspective of K8S).
In the custom resource you can specify the number of replicas per role group (NameNode, DataNode or JournalNode). A minimal working configuration requires:
-
2 NameNodes (HA)
-
1 JournalNode
-
1 DataNode (should match at least the
clusterConfig.dfsReplication
factor)
The operator creates a service discovery ConfigMap for the HDFS instance.
The discovery ConfigMap contains the core-site.xml
file and the hdfs-site.xml
file.
Dependencies
HDFS depends on Apache ZooKeeper for coordination between nodes. You can run a ZooKeeper cluster with the Stackable Operator for Apache ZooKeeper. Additionally, the Stackable Commons Operator, Stackable Secret Operator and Stackable Listener Operator are required.
Demos
Two demos that use HDFS are available.
hbase-hdfs-cycling-data loads a dataset of cycling data from S3 into HDFS and then uses HBase to analyze the data.
jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data showcases the integration between HDFS and Jupyter. New York Taxi data is stored in HDFS and analyzed in a Jupyter notebook.
Supported versions
The Stackable operator for Apache HDFS currently supports the HDFS versions listed below. To use a specific HDFS version in your HdfsCluster, you have to specify an image - this is explained in the Product image selection documentation. The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under Product image selection as well.
-
3.4.0 (LTS)
-
3.3.6 (deprecated) - Please note that there is a known issue related to NameNode bootstrapping which can happen in rare cases. It is therefore recommended to use
3.3.4
until the problem is resolved. -
3.3.4 (deprecated)
Useful links
-
The hdfs-operator GitHub repository
-
The operator feature overview in the feature tracker
-
The HdfsCluster CRD documentation