Setting up a Hadoop development environment can be a major time sink. Between setting up virtual machines, installing IDEs, installing Hadoop packages, and configuring the various daemons, a new development environment can easily take a full day or longer. It’s a one time drill and we’ve heard of some companies who consider this task an initiation rite. Still, as engineers, if we see something that can be automated, we automate it. We’ve also seen occasions where what you really want is a throw-away environment that you can spin up quickly, and spin down quickly later on.
The Kiji BentoBox provides a complete Hadoop development environment in an easy-to-use tarball. At its core, BentoBox includes a pseudo-distributed Hadoop and HBase cluster. When you start up a BentoBox, it automatically runs an HDFS instance, a MapReduce JobTracker and TaskTracker, a ZooKeeper instance, and an HBase cluster, all within the confines of a process on one node. In the rest of this article, we’ll go through a tour of how BentoBox works, and how you can use it to speed up your Hadoop and HBase development processes.
Configuring Environment Variables
In order to use a BentoBox, you need to have a few environment variables set up to make sure that your development environment accesses the correct configuration files and home directories. Conveniently, the BentoBox provides an environment script to do just that:
jubjubbird:kiji-bento-1.0.0-rc3 natty$ source bin/kiji-env.sh Set KIJI_HOME=/Users/natty/Downloads/kiji-bento-1.0.0-rc3/bin/.. Added kiji and kiji-schema-shell binaries to PATH. Set BENTO_CLUSTER_HOME=/Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/.. Set HADOOP_HOME=/Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hadoop-2.0.0-mr1-cdh4.1.2 Set HADOOP_CONF_DIR=/Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hadoop-2.0.0-mr1-cdh4.1.2/conf Set HBASE_HOME=/Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hbase-0.92.1-cdh4.1.2 Set HBASE_CONF_DIR=/Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hbase-0.92.1-cdh4.1.2/conf Added Hadoop, HBase, and bento-cluster binaries to PATH.
The above command can be added to your .profile or .bashrc files to make sure that all the variables are set when a new terminal is opened. The environment script also adds Hadoop, HBase, and Kiji scripts to your path, so you can run normal shell commands and be sure that the Bento cluster is being accessed.
(Auto-)Configuring a BentoBox
Setting up a BentoBox is a quick exercise, since it auto-configures all of the clusters that it runs. When the BentoBox is started, it will run some sanity checks to make sure that the appropriate ports are open, and if the default ports are available, it will generate a set of configuration files.
jubjubbird:kiji-bento-1.0.0-rc3 natty$ bento start Configuring bento-cluster... Checking if the Hadoop/HBase default ports are open... All default ports open. Using these in the Hadoop/HBase configuration for your cluster. Writing Hadoop configuration files to: /Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hadoop-2.0.0-mr1-cdh4.1.2/conf Writing HBase configuration files to: /Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hbase-0.92.1-cdh4.1.2/conf Configuration files successfully written! After your clusters have (re)started, you can visit their web interfaces: HDFS NameNode: http://localhost:50070 MapReduce JobTracker: http://localhost:50030 HBase Master: http://localhost:60010 Starting bento-cluster... Waiting for clusters to start... bento-cluster started.
If the default ports are not available, the BentoBox will prompt for a set of usable ports. In the event that the ports need to be updated later on, you can do this using the bento config command:
jubjubbird:kiji-bento-1.0.0-rc3 natty$ bento config --prompt Running bento-cluster port configuration utility. Hadoop configuration directory is: /Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hadoop-2.0.0-mr1-cdh4.1.2/conf HBase configuration directory is: /Users/natty/kiji-bento-1.0.0-rc3/cluster/bin/../lib/hbase-0.92.1-cdh4.1.2/conf Please enter a value for each port, or press enter to use suggestion. HDFS NameNode  HDFS NameNode UI  MapReduce JobTracker  MapReduce JobTracker UI  HBase Master UI  Zookeeper client port  Writing bento-managed configuration. After your clusters have started, you can visit their web interfaces: HDFS NameNode: http://localhost:50072 MapReduce JobTracker: http://localhost:50032 HBase Master: http://localhost:60012 Configuration complete.
Using the Bento Clusters
Once a Bento cluster is up and running, you can use it the same way you use any real cluster. You can access the Hadoop and HBase clusters via the standard Hadoop command-line utilities, as well as the Kiji CLI commands and the Kiji Schema Shell for managing table schemas.
If you ever need to start from a blank Hadoop cluster, the relevant files to remove are located in $BENTO_HOME/cluster/state. Simply stop the cluster with `bento stop`, remove the state directory, and restart the cluster. The Bento cluster will restart with a fresh, empty HDFS.
There are two main mechanisms for developing an application on top of a BentoBox. Each BentoBox comes packaged with the necessary Hadoop, HBase, and Kiji libraries for building an application. You can include these libraries in an application classpath manually, although the recommended method is to use Maven projects to manage application dependencies.
By using a BentoBox, you can drastically reduce the time necessary to set up a Hadoop and HBase development environment. Once the BentoBox is downloaded, setting it up and starting the cluster takes minutes, and you can be off and running quickly. You can download your BentoBox here and get started building Kiji applications right away.