Hadoop, Hive, HBase & PostgreSQL on Ubuntu

Today I got Hive working on Hadoop. There were a few speed bumps along the way so I'm writing this to help anyone else who travels down this path. I'll assume you're installing Cloudera CDH3, following their instructions, and generally know what you're doing. This will only describe specific errors I ran into.

I have a 3 node Hadoop cluster running Cloudera CDH3 on Ubuntu Linux. Two of the machines (data nodes) are old Pentium 4 single core 1 GHz with 512MB RAM running Ubuntu 10.10 [they belong in a museum but amazingly, Ubuntu runs well and they make usable Hadoop data nodes]. The main machine (name node) is a Pentium 4 dual core 3.2 GHz with 4 GB RAM running Ubuntu 12.04. I installed CDH3 (Hadoop & HBase) via tarballs a while back and updated it to CDH3U5 today.

NOTE: my machines have a "hadoop" user and all the Hadoop files & executables are owned by and run as this user.

THE UPGRADE

I wanted to upgrade to CDH4 and spent a couple hours trying to do so. But I couldn't get it to work. Eventually I discovered Cloudera supports CDH4 on Ubuntu only on 64 bit! My old machines are 32 bit so that was out of the question. Instead, I updated to the latest version of CDH3. Since my prior version was also CDH3, the upgrade was simply unpacking the tarballs and copying my config over.

ADDING HIVE

I downloaded the CDH3U5 Hive tarball (0.7.1) and expanded it. I also created $HIVE_HOME, and added $HIVE_HOME/bin to the "hadoop" user's shell. I already have PostgreSQL 9.1 installed and use it for several things including some Ruby projects, so I wanted to use it for the Hive metastore. Here's what I did to configure it:

Connect to Postgres as "sa" (I used the pgAdmin console) and create the Hive users:

  • hive: Hive admin user, use this to run the create table schema
  • hiveuser: Hive login, use this for hive process at runtime

  • I followed the Cloudera install procedure, allowing "hive" to create tables but restricting "hiveuser".

    Next connect to Postgres as "hive" (again I used the pgAdmin console), create a DB called "metastore" and run this script to create the Hive schema:
    $HIVE_HOME/scripts/metastore/upgrade/postgres/hive-schema-0.7.0.postgres.sql

    Create the $HIVE_HOME/conf/hive-site.xml file. Start with hive-default.xml, rename it to hive-site.xml, then edit it and create or replace the following:
    <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://clements3:5432/metastore</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hiveuser</value>
    </property>
    <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hiveuser123</value>
    </property>
    <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
    </property>
    <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.metastore.local</name>
    <value>true</value>
    </property>

    Next I attempted to run Hive. Simply run "hive" - it lives in the $HIVE_HOME/bin directory. This worked - I got a hive prompt. Next I tried the simplest Hive command I can imagine: describe databases; This failed. I'll describe the errors one by one.

    error:
    FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
    NOTE: the hive log (/tmp/hadoop/hive.log) had a strange error:
    Caused by: java.lang.UnsupportedClassVersionError: org/postgresql/Driver : Unsupported major.minor version 51.0

    cause:
    postgresql client JAR was built with JDK 1.7, but Hadoop was running on JDK 1.6

    solution:
    JAR that does not work: postgresql-9.1-903.jdbc3.jar
    Crack it open and read META-INF/MANIFEST.MF. It says it was build with JDK 1.7.
    JAR that does work: postgresql-9.1-903.jdbc4.jar
    Crack it open and read META-INF/MANIFEST.MF. It says it was build with JDK 1.6.
    The postgresql driver download site is wrong. It says jdbc3 is built with JDK 1.6 and jdbc4 is built with JDK 1.7, but the reverse is true.
    NOTE: Java 1.7 is version 51.0, Java 1.6 is version 50. This is helpful in decoding the above "Unsupported major.minor" error message.

    error:
    FAILED: Error in metadata: MetaException(message:Got exception: java.net.ConnectException Call to clements3/192.168.1.27:9000 failed on connection exception: java.net.ConnectException: Connection refused)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

    cause:
    Hadoop wasn't running.

    solution:
    start hadoop before starting Hive

    error:
    FAILED: Error in metadata: MetaException(message:Got exception: javax.jdo.JDODataStoreException Error executing JDOQL query "SELECT "THIS"."NAME" AS NUCORDER0 FROM "DBS" "THIS" WHERE (LOWER("THIS"."NAME") LIKE ? ESCAPE '\\' ) ORDER BY NUCORDER0 " : ERROR: invalid escape string
    Hint: Escape string must be empty or one character..)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

    cause:
    PostgreSQL version 9.1 uses "standard conforming strings" by default. For Hive to work you need to turn this off.

    solution:
    edit postgresql.conf and add the following line:
    standard_conforming_strings = off

    After all that, Hive is working!

    ADDING HBASE

    Hive + HDFS is the first step. Hive + HBase is the next step. The great thing about HBase is it works in two modes:

  • Deep: as a BigTable implementation: a low-latency multidimensional Map for put/get calls
  • Flat: as a Hive table: for full table scans and joins with other Hive tables
  • I already had HBase running on my mini-cluster. Getting Hive to work on HBase required a few more steps.

    First, I created an HBase table "census": primary row key is "WA", "CA", etc. Column family yearPop stores KV pairs, where Key is year and Value is population for that year. I loaded some data into the table, about 9 states with 3 entries each for 2000, 2011 and 2012.

    Next I created a Hive table for this HBase table, like this:

    create external table census(key string, pop2000 int, pop2011 int, pop2012 int)
    stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    with serdeproperties("hbase.columns.mapping" = ":key,yearPop:2000,yearPop:2011,yearPop:2012")
    tblproperties("hbase.table.name" = "census");

    This ran with no errors. So I tested it with a simple query:

    select * from census;

    This returned all 9 rows of the table. So I tried another query:

    select count(*) from census;

    The job fired off but all the Mappers failed. The job error log showed a "cannot instantiate class" exception. Long story short, I tracked down each class to the JAR that contains it. Each was defined in a JAR file in the $HIVE_HOME/lib directory. I ended up adding 3 JARs to the $HADOOP_HOME/lib directory, each from the $HIVE_HOME/lib directory. From the $HADOOP_HOME/lib directory I did this:

    ln -s ../../hive/lib/hbase-0.90.6-cdh3u5.jar hbase.jar
    ln -s ../../hive/lib/hive-hbase-handler-0.7.1-cdh3u5.jar hive-hbase-handler.jar
    ln -s ../../hive/lib/zookeeper-3.3.1-jar zookeeper.jar

    After this, the above job succeeded. Next I tried a slightly more complex Hive query:

    select key, (pop2011 / pop2000) as grow1, (pop2012 / pop2011) as grow2
    from census
    order by grow2;

    This query succeded too.
    NOTE: during the Mapper task the 2 slow nodes in my Hadoop cluster failed to connect to HBase. They got some kind of timeout. The log files suggest they simply aren't responding fast enough because they're old, slow machines that belong in a museum. The neat thing is, Hadoop automatically rescheduled the task on one of the faster machines and it succeeded. This is an example of the fault tolerance benefits of Hadoop's distributed processing and was fun to see in action.

    CONCLUSION

    I now have CDH3U5 running, with Hadoop, HBase and Hive all interconnected. Next step: Mahout!