Monday, 22 April 2013

MySQL Applier For Hadoop: Implementation

This is a follow up post, describing the implementation details of Hadoop Applier, and steps to configure and install it. Hadoop Applier integrates MySQL with Hadoop providing the real-time replication of INSERTs to HDFS, and hence can be consumed by the data stores working on top of Hadoop. You can know more about the design rationale and per-requisites in the previous post.

Design and Implementation:

Hadoop Applier replicates rows inserted into a table in MySQL to the Hadoop Distributed File System(HDFS). It uses an API provided by libhdfs, a C library to manipulate files in HDFS.

The library comes pre-compiled with Hadoop distributions.
It connects to the MySQL master (or read a binary log generated by MySQL) and:
  • fetches the row insert events occurring on the master
  • decodes these events, extracts data inserted into each field of the row
  • uses content handlers to get it in the format required and appends it to a text file in HDFS.

Schema equivalence is a simple mapping:

Databases are mapped as separate directories, with tables in them as sub-directories. Data inserted into each table is written into text files (named as datafile1.txt) in HDFS. Data can be in comma separated format; or any other delimiter can be used, that is configurable by command line arguments. 

The diagram explains the mapping between MySQL and HDFS schema.

MySQL to HDFS mapping
The file in which the data is stored is named datafile1.txt here; you can name is anything you want. The working directory where this datafile goes is base_dir/db_name.db/tb_name.
The timestamp at which the event occurs is included as the first field in each row inserted in the text file.

The implementation follows these steps:

- Connect to the MySQL master using the interfaces to the binary log

#include “binlog_api.h”

Binary_log binlog(create_transport(mysql_uri.c_str()));

- Register content handlers


Table_index is a sub class of Content_handler class in the Binlog API
Table_index table_event_hdlr;
Applier replay_hndlr(&table_event_hdlr, &sqltohdfs_obj);

- Start an event loop and wait for the events to occur on the master

while (true)
Pull events from the master. This is the heart beat of the event listener.
Binary_log_event *event;

- Decode the event using the content handler interfaces

class Applier : public mysql::Content_handler

Applier(Table_index *index, HDFSSchema *mysqltohdfs_obj)
   m_table_index= index;
   m_hdfs_schema= mysqltohdfs_obj;
mysql::Binary_log_event *process_event(mysql::Row_event *rev)
   int table_id= rev->table_id;
   typedef std::map<long int, mysql::Table_map_event *> Int2event_map;
   int2event_map::iterator ti_it= m_table_index->find(table_id);

- Each row event contains multiple rows and fields.
Iterate one row at a time using Row_iterator.

mysql::Row_event_set rows(rev, ti_it->second);
mysql::Row_event_set::iterator it= rows.begin();
mysql::Row_of_fields fields= *it;
long int timestamp= rev->header()->timestamp;
if (rev->get_event_type() == mysql::WRITE_ROWS_EVENT)
table_insert(db_name, table_name, fields, timestamp, m_hdfs_schema);
} while (++it != rows.end());

- Get the field data separated by field delimiters and row delimiters.

Each row contains a vector of Value objects. The converter allows us to transform the value into another representation.

mysql::Row_of_fields::const_iterator field_it= fields.begin();

mysql::Converter converter;
std::ostringstream data;
data << timestamp;
std::vector<long int>::iterator it;
std::string str;, *field_it);

data << sqltohdfs_obj->hdfs_field_delim;
data << str;
} while (++field_it != fields.end());
data << sqltohdfs_obj->hdfs_row_delim;

- Connect to the HDFS file system. 
If not provided, the connection information (user name, password host and port) are read from the XML configuration file, hadoop-site.xml.

HdfsFS m_fs= hdfsConnect(host.c_str(), port);

- Create the directory structure in HDFS. 
Set the working directory and open the file in append mode.

hdfsSetWorkingDirectory(m_fs, (stream_dir_path.str()).c_str());
const char* write_path= "datafile1.txt";
hdfsFile writeFile;

- Append data at the end of the file.

writeFile= hdfsOpenFile(m_fs, write_path, O_WRONLY|O_APPEND, 0, 0, 0);
tSize num_written_bytes = hdfsWrite(m_fs, writeFile, (void*)data, strlen(data));

Install and Configure:
Follow these steps to install and run the Applier:
1. Download a Hadoop release (I am using 1.0.4); configure and install (for the purpose of the demo, install it in pseudo distributed mode). Flag '' must be set to true while configuring HDFS(hdfs-site.xml). Since append is not supported in Hadoop 1.x, set the flag '' to true.

2. Set the environment variable $HADOOP_HOME to point to the Hadoop installation directory.

3. CMake doesn't come with a 'find' module for libhdfs. Ensure that the 'FindHDFS.cmake' is in the CMAKE_MODULE_PATH. You can download a copy here

4. Edit the file 'FindHDFS.cmake', if necessary, to have HDFS_LIB_PATHS set as a path to, and HDFS_INCLUDE_DIRS have the path pointing to the location of hdfs.h.
For 1.x versions, library path is $ENV{HADOOP_HOME}/c++/Linux-i386-32/lib , and header files are contained in $ENV{HADOOP_HOME}/src/c++/libhdfs. For 2.x releases, header files and libraries can be found in $ENV{HADOOP_HOME}/lib/native, and $ENV{HADOOP_HOME}/include respectively.

For versions 1.x, this patch will fix the paths:

  --- a/cmake_modules/FindHDFS.cmake
  +++ b/cmake_modules/FindHDFS.cmake
  @@ -11,6 +11,7 @@ exec_program(hadoop ARGS version OUTPUT_VARIABLE
   # currently only looking in HADOOP_HOME
   find_path(HDFS_INCLUDE_DIR hdfs.h PATHS
  +  $ENV{HADOOP_HOME}/src/c++/libhdfs/
     # make sure we don't accidentally pick up a different version
  @@ -26,9 +27,9 @@ endif()
   message(STATUS "Architecture: ${arch_hint}")
   if ("${arch_hint}" STREQUAL "x64")
  -  set(HDFS_LIB_PATHS $ENV{HADOOP_HOME}/lib/native)
  +  set(HDFS_LIB_PATHS $ENV{HADOOP_HOME}/c++/Linux-amd64-64/lib)
   else ()
  -  set(HDFS_LIB_PATHS $ENV{HADOOP_HOME}/lib/native)
  +  set(HDFS_LIB_PATHS $ENV{HADOOP_HOME}/c++/Linux-i386-32/lib)
   endif ()

5.Since libhdfs is JNI based API, it requires JNI header files and libraries to build. If there exists a module FindJNI.cmake in the CMAKE_MODULE_PATH and JAVA_HOME is set; the headers will be included, and the libraries would be linked to. If not, it will be required to include the headers and load the libraries separately (modify LD_LIBRARY_PATH).

6. Build and install the library 'libreplication', to be used by Hadoop Applier,using CMake.
  • Download a copy of Hadoop Applier from
  • 'mysqlclient' library is required to be installed in the default library paths. You can either download and install it (you can get a copy here), or set the environment variable $MYSQL_DIR to point to the parent directory of MySQL source code. Make sure to run cmake on MySQL source directory.
                $export MYSQL_DIR=/usr/local/mysql-5.6
  • Run the 'cmake' command on the parent directory of the Hadoop Applier source. This will generate the necessary Makefiles. Make sure to set cmake option ENABLE_DOWNLOADS=1; which will install Google Test required to run the unit tests.
              $cmake . -DENABLE_DOWNLOADS=1
  • Run 'make' and 'make install' to build and install. This will install the library 'libreplication' which is to be used by Hadoop Applier.
7. Make sure to set the CLASSPATH to all the hadoop jars needed to run Hadoop itself.

         $export PATH=$HADOOP_HOME/bin:$PATH

         $export CLASSPATH=$(hadoop classpath)

8. The code for Hadoop Applier can be found in /examples/mysql2hdfs, in the Hadoop Applier repository. To compile, you can simply load the libraries (modify LD_LIBRARY_PATH if required), and run the command “make happlier” on your terminal. This will create an executable file in the mysql2hdfs directory.

.. and then you are done!

Now run hadoop dfs (namenode and datanode), start a MySQL server as master with row based replication (you can use mtr rpl suite for testing purposes : $MySQL-5.6/mysql-test$./mtr --start --suite=rpl --mysqld=--binlog_format='ROW' --mysqld=--binlog_checksum=NONE), start hive (optional) and run the executable ./happlier, optionally providing MySQL and HDFS uri's and other available command line options. (./happlier –help for more info).

There are useful filters as command line options to the Hadoop applier.
Options Use
-r, --field-delimiter=DELIM

Use DELIM instead of ctrl-A for field delimiter. DELIM can be a string or an ASCII value in the format '\nnn' .
Escape sequences are not allowed.
Provide the string by which the fields in a row will be separated. By default, it is set to ctrl-A
-w, --row-delimiter=DELIM

Use DELIM instead of LINE FEED for row delimiter . DELIM can be a string or an ASCII value in the format '\nnn'
Escape sequences are not allowed.
Provide the string by which the rows of a table will be separated. By default, it is set to LINE FEED (\n)
-d, --databases=DB_LIST

DB-LIST is made up of one database name, or many names separated by commas.
Each database name can be optionally followed by table names.
The table names must follow the database name, separated by HYPHENS

Example: -d=db_name1-table1-table2,dbname2-table1,dbname3
Import entries for some databases, optionally include only specified tables.
-f, --fields=LIST

Similar to cut command, LIST is made up of one range, or many ranges separated by commas.Each range is one of:
N N'th byte, character or field, counted from 1
N- from N'th byte, character or field, to end of line
N-M from N'th to M'th (included) byte,
character or field
-M from first to M'th (included) byte, character or field
Import entries for some fields only in a table

-h, --help Display help
Integration with HIVE:
Hive runs on top of Hadoop. It is sufficient to install Hive only on the Hadoop master node.
Take note of the default data warehouse directory, set as a property in hive-default.xml.template configuration file. This must be the same as the base directory into which Hadoop Applier writes.

Since the Applier does not import DDL statements; you have to create similar schema's on both MySQL and Hive, i.e. set up a similar database in Hive using Hive QL(Hive Query Language). Since timestamps are inserted as first field in HDFS files,you must take this into account while creating tables in Hive.

SQL Query Hive Query
CREATE TABLE t ( time_stamp INT, i INT)

Now, when any row is inserted into table on MySQL databases, a corresponding entry is made in the Hive tables. Watch the demo to get a better understanding.
The demo is non audio, and is meant to be followed in conjunction with the blog.You can also create an external table in hive and load data into the tables; its your choice!
Watch the Hadoop Applier  Demo >>

Limitations of the Applier:
In the first version we support WRITE_ROW_EVENTS, i.e. only insert statements are replicated.
We have considered adding support for deletes, updates and DDL's as well, but they are more complicated to handle and we are not sure how much interest there is in this.
We would very much appreciate your feedback on requirements - please use the comments section of this blog to let us know!

The Hadoop Applier is compatible with MySQL 5.6, however it does not import events if binlog checksums are enabled. Make sure to set them to NONE on the master, and the server in row based replication mode.

This innovation includes dedicated contribution from Neha Kumari, Mats Kindahl and Narayanan Venkateswaran. Without them, this project would not be a success.

Give it a try! You can download a copy from and get started NOW!


  1. Julien Duponchelle22 May 2013 13:16

    Hi Shubhangi it's seem we are working on similar projects.

    I think my last commit can interest you, is the implementation of checksum:

    Perhaps we can exchange by mail (

    1. Hi Julien,

      It is a very interesting project, thank you for the offer.
      However, we have implemented checksum already, and also, our development is in C++ .

  2. Hi Shubhangi

    Great work... When will update/delete feature available?

  3. Great work!! :D

    One question: does binlog.wait_for_next_event(&event); needs 'event' memory be released?
    I find examples/mysql2hdfs/ releasing memory with 'delete event;', while the rest of the examples are not (memory leak?).

    1. Hi!

      Yes, it is the responsibility of the caller to release the memory allocated to the event, as done in examples/mysql2hdfs.

      Also, as you detect correctly, it is a bug in the example programs.
      It has been reported, and will be fixed in the release.

  4. Great Work!

  5. Hello Shubhangi !

    Nice project which is much needed and will eliminate few layers in complex workflows .

    waiting for the project to get mertured :)

    All the very best !!


  6. Shuhangi ,

    Please let me know if it is opensource , want to contribute :)


    1. Hi Sandeep!

      Thank you for trying out the product, and it is great to see you willing to contribute to it. Thanks!

      The product is GPL, so it is open source by the definition of the Open Source Initiative, and you can definitely contribute to it. To start with patches, you have to sign the OCA, Oracle contributor agreement.

      However,please note that we are refactoring the code, and I am not sure if your patches would work.

  7. Hi Shubangi,

    Interesting work !!

    I wanted to ask about how is the translation done from MySQL schema to Hive Schema. Does that have to be done as offline process for both the systems separately, or simply creating schema in 1 system, say MySQL, will allow us to replicate data to HDFS and also put auto-generated Hive schema on top of that?


    1. Hi Ravi!

      Thank you for trying out the applier, and bringing this to attention!

      The translation is to be done offline.
      You need to create similar schema's in 'both'; MySQL as well as Hive.

      Creating schema in MySQL(DDL operation) will not generate a schema in Hive automatically; it requires a manual effort. Please note replication of DDL statements is not supported in this release.

      Aso, the schema equivalence is required before starting the real time replication from MySQL to Hive.

      Please reply on the thread in case you have any other questions.

      Hope this helps,

  8. Please could you confirm that 'mixed' binlog format is not supported at this point?


    1. Hi!

      Yes, at this point, only row format is supported by the Applier. Mixed mode is not completely supported, i.e. in this mode, only the inserts which are mapped as (table map+row events) in MySQL will be replicated.

      Thank you for the question. Can I please request for a use case where this is a requirement? It can help us shape the future releases of the Applier.

      Thank you,

  9. Shubhangi,

    Great idea !
    You are filling another gap (the real-time integration) between RDBMS entre Hadoop.


  10. Shubhangi,

    I've cmake successfully, but run make failed,

    /opt/mysql-hadoop-applier-0.1.0/src/value.cpp: In function ?.int32_t mysql::calc_field_size(unsigned char, const unsigned char*, uint32_t)?.
    /opt/mysql-hadoop-applier-0.1.0/src/value.cpp:151: error: ?.YSQL_TYPE_TIME2?.was not declared in this scope
    /opt/mysql-hadoop-applier-0.1.0/src/value.cpp:157: error: ?.YSQL_TYPE_TIMESTAMP2?.was not declared in this scope
    /opt/mysql-hadoop-applier-0.1.0/src/value.cpp:163: error: ?.YSQL_TYPE_DATETIME2?.was not declared in this scope
    make[2]: *** [src/CMakeFiles/replication_static.dir/value.cpp.o] Error 1
    make[1]: *** [src/CMakeFiles/replication_static.dir/all] Error 2
    make: *** [all] Error 2


    1. Hi!

      Thank you for trying the Applier!

      Sorry, the issue with the compilation is because you are using the libmysqlclient library released with MySQL-5.5.

      Since the data types MYSQL_TYPE_TIME2, MYSQL_TYPE_TIMESTAMP2, MYSQL_TYPE_DATE_TIME2 are defined for the latest release of MySQL (5.6 GA)only, 'make' command fails.

      This is a bug, and has been reported.

      In order to compile, I suggest to please use either the latest released version of MySQL source code (5.6), or the latest GA release of connector C library( -6.1.1).
      You can get a copy of the connector here.

      Hope it helps.
      Please reply on the thread if you are still facing issues!


    2. Shubhangi,

      Thank you, I've successfully compiled out the applier after changed MySQL to 5.6. But run applier comes out an error says that "Can't connect to the master.",
      [root@localhost mysql2hdfs]# ./happlier --field-delimiter=, mysql://root@ hdfs://localhost:9000
      The default data warehouse directory in HDFS will be set to /usr/hive/warehouse
      Change the default data warehouse directory? (Y or N) N
      Connected to HDFS file system
      The data warehouse directory is set as /user/hive/warehouse
      Can't connect to the master.


    3. Hi Chuanling,

      The above error means that the applier is not able to connect to the MySQL master. It can be due to two reasons:
      - MySQL server is not started on port 3306 on localhost
      - You nay not have the permissions to connect to the master as a root user.

      To be sure, could you try opening a seprarate mysql client, and connect to it using the same params, i.e. user=root, host=localhost, port=3306?

      Hope it Helps.


    4. Hi Shubhangi,

      I tried this command,

      mysql -uroot -hlocalhost -P3306

      It can enter in MySQL


    5. Hi Chuanliang,

      That is good, my above suspicions were wrong. Sorry about that.

      Can you please check the following for MySQL server:
      - binlog_checksum is set to NONE
      - Start the server with the cmd line option --binlog_checksum=NONE

      - Logging into binary log is enabled:
      - Start the server with the cmd line option --log-bin=binlog-name

      Also, specify --binlog_format=ROW in order to replicate into HDFS.

      Thank you,

    6. Hi Shubhangi,

      Yes,I run mtr rpl suite as you've written in this post, data can be replicated to Hive. But his is a MySQL test run, in order to make the real server run:

      1.I config these options in /etc/my.cnf like this,


      2. "service mysql start" start mysql server

      3. It can produce binlog file under mysql/data/

      4. But when I use applier,
      ./happlier --field-delimiter=, mysql://root@ hdfs://localhost:9000

      errors occur:

      [root@localhost mysql2hdfs]# ./happlier --field-delimiter=, mysql://root@ hdfs://localhost:9000
      The default data warehouse directory in HDFS will be set to /usr/hive/warehouse
      Change the default data warehouse directory? (Y or N) N
      Connected to HDFS file system
      The data warehouse directory is set as /user/hive/warehouse
      # A fatal error has been detected by the Java Runtime Environment:
      # SIGSEGV (0xb) at pc=0x00007f1ed83b647d, pid=17282, tid=139770481727264
      # JRE version: 6.0_31-b04
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.6-b01 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # C [] std::string::compare(std::string const&) const+0xd
      # An error report file with more information is saved as:
      # /opt/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs/hs_err_pid17282.log
      # If you would like to submit a bug report, please visit:
      # The crash happened outside the Java Virtual Machine in native code.
      # See problematic frame for where to report the bug.


    7. Hi Chualiang!

      Good that you can run it using mtr. :)
      The problem is that the MySQL master is mis-configured, since you do not provide server-id in the cnf file. Please add that too in the conf file.

      The file /etc/my.cnf should contain at least

      server-id=2 #please note that this can be anything other than 1, since applier uses 1 to act as a slave (code in src/tcp_driver.cpp), so MySQL server cannot have the same id.

      Hope it helps.


    8. Hi Shubhangi,

      Thank you for your reply, with your suggestion the error is gone. :)

      And I have some other questions:

      1. How can Applier connect to a server with password?
      2. If I want to collect data from more than one MySQL Server in the same time, how can I implement it with Applier? To write a shell script to set up many Applier connections together? Can you give me some advice?


    9. Hi Chuanliang,

      Please find the answers below:

      1. You need to pass the MySQL uri to the Applier in the following format user[:password]@host[:port]
      For example: ./happlier user:password@localhost:13000

      2. Yes, that is possible. However, one instance of the applier can connect to only one MySQL server at a time. In order to collect data from multiple servers, you need to start multiple instances of the applier ( you can use the same executable happlier simultaneously for all the connections).

      Yes, you may write a shell script to start a pair of MySQL server and the applier, for collecting data from all of them).

      Also, I find it very interesting to improve the applier in order that a single instance can connect to more than one server at a time. We might consider this for the future releases. Thank you for the idea. If I may ask, can you please provide the use case where you require this implementation?

      Thank you,

    10. Hi Shubhangi,

      We are a game company that operates many mobile and internet games. Most of the games use MySQL as database. Data produced by games and players are daily growing. Game operation department needs to know information of games then make marketing decisions.

      In order to store and analyze the huge amount of data. We used Hadoop. First we used Sqoop to collect and import data from multiple MySQL server. And developed a system to manage all collecting tasks, like create tasks via plan, view the process and report. However, in order not to affect the running of game, collection always running at night. So when we get the status information about the games. There is so much delay. Then I found Applier, I think the real time data replicate manner is great, so I want to replace our collecting with Applier.

      This is our use case. :)

      I'm looking forward to see applier's future releases. And If the applier can support connecting multiple servers in a single instance. maybe you can also provide a tool to manage and control the process.

      Thank you,

    11. Hi Chuanliang!

      Thanks a lot for the wonderful explanation. The Applier is aimed to solve issues with exactly such delays involved in the operation.

      It is a very valid use case for the Applier, and I can clearly mark out the requirement of supporting data feed from multiple servers. Thanks once again, this will help us decide on the future road map for the Applier.

      Please stay tuned for the updates on the product, and provide us with any other feedbacks you have.


    12. Hi Shubhangi,

      Thank you too for providing us such an amazing product. If you have updates for the product, I'll be very pleased to try it. My E-mail,


  11. Hi when I'm executing the cmake command at step 6 I'm getting the following error. Please advice as I'm new to applier

    Warning: Source file "/home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/src/basic_transaction_parser.cpp" is listed multiple times for target "replication_shared".

    1. Hi Archfiend!

      Thank you for trying out the Applier!

      The warning is generated because of the inclusion of the name "basic_transaction_parser.cpp" twice in the cmake file while setting the targets for the library.( code: mysql-hadoop-applier-0.1.0/src/CMakeLists.txt : line no. 5 and line no. 7)

      This is our fault, thank you for pointing this out. This will be fixed in the next release.

      For now, to fix it I request you to do the following:

      -please modify the file mysql-hadoop-applier-0.1.0/src/CMakeLists.txt to remove any one of the two names (i.e. basic_transaction_parser.cpp , either from line no.7 or line no. 5)

      -execute rm CMakeCache.txt from the base dir (/home/thilanj/hadoop/mysql-hadoop-applier-0.1.0), if it exists

      - run cmake again.

      Thank you once again. Please let me know if it works out for you.


    2. Hi Shubhangi,

      Thanks for the fast reply. On a separate note, I'm using hortonworks hadoop. Is this compatible with it?



    3. Cont.

      I've been trying to execute "make -j8" command as the video tutorial. But getting the following set of errors, the files mentioned in the error log are already there but still getting this error. Please advice

      Error Log:

      Scanning dependencies of target replication_shared
      Scanning dependencies of target replication_static
      [ 3%] [ 7%] [ 10%] [ 14%] [ 17%] [ 21%] [ 25%] [ 28%] Building CXX object src/CMakeFiles/replication_shared.dir/access_method_factory.cpp.o
      Building CXX object src/CMakeFiles/replication_shared.dir/field_iterator.cpp.o
      Building CXX object src/CMakeFiles/replication_static.dir/access_method_factory.cpp.o
      Building CXX object src/CMakeFiles/replication_shared.dir/row_of_fields.cpp.o
      Building CXX object src/CMakeFiles/replication_static.dir/field_iterator.cpp.o
      Building CXX object src/CMakeFiles/replication_static.dir/row_of_fields.cpp.o
      Building CXX object src/CMakeFiles/replication_shared.dir/basic_transaction_parser.cpp.o
      Building CXX object src/CMakeFiles/replication_shared.dir/binlog_driver.cpp.o
      In file included from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/binlog_driver.h:25,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/access_method_factory.h:24,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/src/access_method_factory.cpp:20:
      /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/protocol.h:24:23: error: my_global.h: No such file or directory
      In file included from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/binlog_driver.h:25,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/access_method_factory.h:24,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/src/access_method_factory.cpp:20:
      /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/protocol.h:24:23: error: my_global.h: No such file or directory
      /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/protocol.h:25:19: error: mysql.h: No such file or directory
      /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/protocol.h:26:21: error: m_ctype.h: No such file or directory
      /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/protocol.h:27:24: error: sql_common.h: No such file or directory
      In file included from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/value.h:24,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/include/field_iterator.h:25,
      from /home/thilanj/hadoop/mysql-hadoop-applier-0.1.0/src/field_iterator.cpp:20:

    4. Hi Archfiend!

      Can you please mention how are you addressing the dependency on libmysqlclient- using MySQL source code, or the connector/C library directly?

      Please make sure of the following:
      If you are using MySQL source code for the mysqlclient library, make sure
      1. The MySQL source code is built (i.e. run the cmake and make command on MySQL source code)
      2. Set the environment variable MYSQL_DIR to point to the base directory of this source code (please note, donot give the path upto the lib directory, but only the base directory)
      3. Please check that the file 'my_global.h' is present in the path $MYSQL_DIR/include and the library in $MYSQL_DIR/lib
      4. Delete CMakeCache.txt from the Hadoop Applier base directory
      (rm CmakeCache.txt)
      5. Run cmake and make again.

      If you are using the library directly, make sure that you have the following:
      If not explicitly specified,
      1. The above mentioned files (my_global.h) must be in the standard header paths where the compiler looks for. (/usr/lib)
      2. The library should be in the standard library paths

      3. rm CMakeCache.txt
      4. cmake and make again

      Hope it helps!
      Please reply in case if you still face issues.

      Thank you,

  12. Have you considered replication to HBase, utilizing the versioning capability ( to allow a high fidelity history to be maintained to support time series analysis etc?

  13. Hi, when I install with "make happlier", I see error like :

    hadoop-2.2.0/lib/native/ could not read symbols: File in wrong format

    How fix it ?

    1. Hi,
      Thank you for trying out the applier!

      I am not sure, but looks like the linker error is because the library version may incompatible while compiling the happlier.

      Can you please make sure that the library is compiled for the type 32 bit (or 64 bit), same as the rest of your object files, i.e. the happlier and

      Please contiue the thread if you are still facing the issues.


  14. This comment has been removed by the author.

  15. Dear Shubhangi,
    When I run the command "make happiler", I see the result:
    [ 77%] Built target replication_static
    Scanning dependencies of target happlier
    [ 83%] Building CXX object examples/mysql2hdfs/CMakeFiles/happlier.dir/mysql2hdfs.cpp.o
    [ 88%] Building CXX object examples/mysql2hdfs/CMakeFiles/happlier.dir/table_index.cpp.o
    [ 94%] Building CXX object examples/mysql2hdfs/CMakeFiles/happlier.dir/hdfs_schema.cpp.o
    [100%] Building CXX object examples/mysql2hdfs/CMakeFiles/happlier.dir/table_insert.cpp.o
    Linking CXX executable happlier
    /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/ undefined reference to `awt_Unlock@SUNWprivate_1.1'
    /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/ undefined reference to `awt_GetComponent@SUNWprivate_1.1'
    /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/ undefined reference to `awt_Lock@SUNWprivate_1.1'
    /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/ undefined reference to `awt_GetDrawingSurface@SUNWprivate_1.1'
    /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/ undefined reference to `awt_FreeDrawingSurface@SUNWprivate_1.1'
    collect2: ld returned 1 exit status
    make[3]: *** [examples/mysql2hdfs/happlier] Error 1
    make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
    make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
    make: *** [happlier] Error 2

    Can you please tell me how to fix that error?


    1. I had the same issue building this on a headless linux VM. To resolve, I manually linked $JAVA_HOME/jre/lib//xawt/ to $JAVA_HOME/jre/lib/

      then installed Xvfb and Xtst. Then it built fine.

  16. Hi Sidus,

    Thank you for trying out the Applier.
    As I see it, the problems is while linking to libjawt libraries.

    Can you please make sure of the following:
    1. Do you have the JAVA_HOME set ?
    2. Do you have CLASS_PATH set to point to jars required to run Hadoop itself?
    (command ~: export CLASSPATH= $(hadoop classpath) )
    3. Can you please try running Hadoop and check if it runs fine?

    Hope that helps.
    Please reply in case it doesn't solve the issue.

    1. Hi,

      My JAVA_HOME is currently set to /usr/lib/jvm/java-6-openjdk

      And my CLASSPATH is:

      My Hadoop is running fine. My hadoop version is 1.2.1, is there any problem with it?

  17. Hi Sidus,

    The paths look correct.

    I am not sure whether there are issues with Hdoop-1.2.1, I have not tested it yet.

    May be installing Oracle JDK ( I use 1.7.0_03) instead of openJDK would help.

    Thank you,

  18. Greetings!

    First, thanks for the very interesting code.
    This will be very useful.

    At the moment there are some problems, one of which is that I found a case where it is skipping the wrong table. The issue seems to be that mysqld can change the numerical table_id associated with a given table (I think that in my particular case this was associated with a restart of mysqld and a re-reading of the logs from before the restart). Anyway, looking at the code from Table_index::process_event, which processes the TABLE_MAP_EVENTs, two issues arise:

    1) If the table_id associated with the map event is already registered, the code ignores the update. Should the update cause the map to, well, update?

    2) Is there a memory leak? Who deletes table map event objects.

    For reference, the code is quoted below.


    mysql::Binary_log_event *Table_index::process_event(mysql::Table_map_event *tm)
    if (find(tm->table_id) == end())

    /* Consume this event so it won't be deallocated beneath our feet */
    return 0;

    1. Hi Mike!

      Thank you for trying out the applier! It is very encouraging to see you take interest in the code base. This shall make us overcome the shortcomings faster.

      Regarding the question:

      1. Only committed transactions are written into the binary log. If the server restarts, we don't apply binary logs. Therefore, the table-id will not change, or updated, for a map event written in the binary log.

      2. Yes, there is a memory leak in the code here. This is a bug, thank you for pointing it out.

      You may report it on , under the category 'Server: Binlog'. Or, I can do it on your behalf. Once it is reported, we will be able to commit a patch for the same.

      Thank you once again.

  19. Greetings!

    Again, thank you for the very interesting and promising software.

    I am not an expert on MySQL, but I've been reading the documentation and the source code you provided.

    Looking at this URL:

    It apparently is the responsibility of the replication server to keep track of where it is in the bin-logs. Otherwise, if the replication server is restarted, it will begin reading as far back as it possibly can.

    The overloads to Binary_log_driver::connect seem to provision for this, but this feature does not seem to be used in the example code.

    Am I overlooking something, or might this be a future enhancement?

    Thank you.

    Mike Albert

    1. Hi Mike,

      I am sorry about the delay in the reply.
      Thank you for trying out the Applier and looking into the code base as well as the documentation, it will be very helpful to improve the Applier!

      Yes, it is the responsibility of the replication server to keep track of where it is in the bin-logs. As you mention correctly, if not kept track, the server, if restarted will begin reading from the start.

      The Applier currently suffers from this issue. If restarted, the Applier reads again from the first log. This is a feature enhancement, and should be addresssed. Thank you once again for pointing this out!

      You may please feel free to report it on , under the category 'Server: Binlog', marking the severity as 4(feature request). Or, I can do it on your behalf. Once it is reported, we will be able to commit a patch for the same.

      Thank you once again.


  20. Greetings,
    when I run "make" command i get the following error:

    mysql-hadoop-applier-0.1.0/src/tcp_driver.cpp:41:25: fatal error: openssl/evp.h: No such file or directory
    compilation terminated.
    make[2]: *** [src/CMakeFiles/replication_static.dir/tcp_driver.cpp.o] Error 1
    make[1]: *** [src/CMakeFiles/replication_static.dir/all] Error 2
    make: *** [all] Error 2

    I have tryed installing libssl-dev with no luck (some people fixed similar problems with this lib)

    Can you help me with this error.

    Thank you.

  21. Hi,

    IMHO installing libssl-dev should solve the above problem, but if its not working, then you can just comment those two lines in tcp_driver.cpp
    #include < openssl/evp.h >
    #include < openssl/rand.h >

    And try compiling your code, these header files were being used before but now its not required, it will be removed from the code in the next release.

    1. Thanks for the answer.
      I solved the problem reinstalling libssl-dev.

      But now i'm stuck at the make happlier. I get this error:

      Linking CXX executable happlier
      CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::HDFSSchema(std::basic_string, std::allocator > const&, int, std::basic_string, std::allocator > const&, std::basic_string, std::allocator > const&)':
      hdfs_schema.cpp:(.text+0xa0): undefined reference to `hdfsConnect'
      hdfs_schema.cpp:(.text+0xdb): undefined reference to `hdfsConnectAsUser'
      CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::~HDFSSchema()':
      hdfs_schema.cpp:(.text+0x2d0): undefined reference to `hdfsDisconnect'
      CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::HDFS_data_insert(std::basic_string, std::allocator > const&, char const*)':
      hdfs_schema.cpp:(.text+0x4a3): undefined reference to `hdfsSetWorkingDirectory'
      hdfs_schema.cpp:(.text+0x524): undefined reference to `hdfsExists'
      hdfs_schema.cpp:(.text+0x55a): undefined reference to `hdfsOpenFile'
      hdfs_schema.cpp:(.text+0x58d): undefined reference to `hdfsOpenFile'
      hdfs_schema.cpp:(.text+0x5fc): undefined reference to `hdfsWrite'
      hdfs_schema.cpp:(.text+0x680): undefined reference to `hdfsFlush'
      hdfs_schema.cpp:(.text+0x6d5): undefined reference to `hdfsCloseFile'
      collect2: ld returned 1 exit status
      make[3]: *** [examples/mysql2hdfs/happlier] Error 1
      make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
      make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
      make: *** [happlier] Error 2

      Any idea what could be the problem?



    2. Hi Carlos,

      Thank you for trying the applier.

      From the errors, it seems that the applier is not able find the shared library, ''.
      Which Hadoop version are you using? The library comes pre compiled for 32 bit systems with Hadoop, but you need to compile it for 64 bits.

      You may please try locating on your system (inside HADOOP_HOME) and make sure the path to it is in LD_LIBRARY_PATH.

      You may also check the contents of the file CMakeCache.txt, to see at what location is the applier trying to search the library.

      Hope that helps.

      Thank you,

  22. Hi,
    Great Work! could you plz answer these!
    Is Applier works on Hadoop 2.2.X?
    By when Applier will support updates and deletes?
    By when Applier will be ready for production systems?

  23. Hi Murali,

    Thanks for trying the applier.
    1) Yes it works with Hadoop 2.2.X, but you might need to change the library and include path in FindHDFS.cmake file.
    2) We have considered adding update and delete, but there are no concrete plans yet.
    3) I am sorry but we have not decided on that yet.

  24. # A fatal error has been detected by the Java Runtime Environment:
    # SIGSEGV (0xb) at pc=0x0000003c6909c47d, pid=4932, tid=140660958218016
    # JRE version: OpenJDK Runtime Environment (7.0_55-b13) (build 1.7.0_55-mockbuild_2014_04_16_12_11-b00)
    # Java VM: OpenJDK 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C [] std::string::compare(std::string const&) const+0xd

    To resolve it i had to change the log-bin configuration in my.cnf

  25. Hi Shubhangi,

    BUG 71277 has been fixed in v0.2.0.
    Is that available now for download?

  26. Hi everyone,

    I am trying to understand and use hadoop applier for my project. I ran through all the steps. However i am having some problems. I don't have a strong software background in general. So my apologies in advance if something seems trivial To make it easier i will list all the steps concisely here in regards to Install and configure tutorial.

    Here is my system setup:

    Hadoop Applier package : mysql-hadoop-applier-0.1.0
    Hadoop version : 1.2.1
    Java version : 1.7.0_51
    libhdfs: present
    cmake: 2.8
    libmysqlclient: mysql-connector-c-6.1.3-linux-glibc2.5-x86_64
    gcc : 4.8.2
    MySQL Server: 5.6.17 ( Downloaded as source code, then cmake ,make and install.
    FindHDFS.cmake: Downloaded online
    FindJNI.cmake: Already present in Cmake modules

    My env variables in bashrc are as follows:

    # JAVA HOME directory setup
    export JAVA_HOME="/usr/lib/java/jdk1.8.0_05"
    set PATH="$PATH:$JAVA_HOME/bin"

    export PATH

    export HADOOP_HOME="/home/srai/Downloads/hadoop-1.2.1"
    export PATH="$PATH:$HADOOP_HOME/bin"

    #Home Directiry configuration
    export HIVE_HOME="/usr/lib/hive"
    export "PATH=$PATH:$HIVE_HOME/bin"

    export MYSQL_DIR="/usr/local/mysql"
    export "PATH=$PATH:$MYSQL_DIR/bin"

    export PATH

    1) & 2) Hadoop is downloaded. I can run and stop all the hdfs and mapred daemons correctly. My hadoop version is 1.2.1. My $HADOOP_HOME environment variable is set in .bashrc file as "/home/srai/Downloads/hadoop-1.2.1"

    3) & 4) I downloaded a FindHDFS.cmake file and modified it according to the patch which was listed. I placed this file under the following directory "/usr/share/cmake-2.8/Modules". I thought if i place this under the module directory the CMAKE_MODULE_PATH will be able to find it. I am not sure if this is correct or how do i update CMAKE_MODULE_PATH in the CMAKELists.txt and where?

    5) FindJNI.cmake was already present in the directory /usr/share/cmake-2.8/Modules so i didn't change or modify it. My JAVA_HOME env variable is set in bashrc file as "/usr/lib/java/jdk1.8.0_05". I didn;t modify or touch LD_LIBRARY_PATH.

    6) Downloaded hadoop applier and mysql-connector-c. Since the tutorial says " 'mysqlclient' library is required to be installed in the default library paths", i moved the files of mysqlconnector-c to /usr/lib/mysql-connector-c. I also declared a variable $MYSQL_DIR to point to "/usr/local/mysql"

  27. I ran the cmake command on the parent directory of the hadoop applier source , however i get errors. Below is a complete log:

    sudo cmake . -DENABLE_DOWNLOADS=1 mysql-hadoop-applier-0.1.0
    [sudo] password for srai:
    -- Tests from subdirectory 'tests' added
    Adding test test-basic
    Adding test test-transport
    CMake Warning at examples/mysql2lucene/CMakeLists.txt:3 (find_package):
    By not providing "FindCLucene.cmake" in CMAKE_MODULE_PATH this project has
    asked CMake to find a package configuration file provided by "CLucene", but
    CMake did not find one.

    Could not find a package configuration file provided by "CLucene" with any
    of the following names:


    Add the installation prefix of "CLucene" to CMAKE_PREFIX_PATH or set
    "CLucene_DIR" to a directory containing one of the above files. If
    "CLucene" provides a separate development package or SDK, be sure it has
    been installed.

    -- Architecture: x64
    -- HDFS_LIB_PATHS: /c++/Linux-amd64-64/lib
    -- HDFS includes and libraries NOT found.Thrift support will be disabled (, HDFS_INCLUDE_DIR-NOTFOUND, HDFS_LIB-NOTFOUND)
    CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
    Please set them or make sure they are set and tested correctly in the CMake files:
    used as include directory in directory /home/srai/Downloads/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs
    linked by target "happlier" in directory /home/srai/Downloads/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs
    linked by target "happlier" in directory /home/srai/Downloads/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs

    -- Configuring incomplete, errors occurred!
    See also "/home/srai/Downloads/CMakeFiles/CMakeOutput.log".

    I don't understand how cmake is using my environment variables to find these files. As i mentioned i am a newbie so please if someone can help me compile the hadoop applier i will really appreciate it.


  28. Hi Suleman,

    Thank you for the detailed message, and thank you for trying out the applier!

    Everything looks fine except for one issue.
    The error is that cmake is unable to find the libraries correctly.
    The HDFS_LIB_PATHS is set as "/c++/Linux-amd64-64/lib", but it should be

    This implies that the variable HADOOP_HOME is not set, on the terminal where you are running cmake.

    Before executing the cmake command can you run
    echo $HADOOP_HOME
    and see that the output is
    /home/srai/Downloads/hadoop-1.2.1 ?

    Hope that helps. Please notify us in case you are still having an erro.

    Thank you,

  29. Hi Shubhangi,

    Thank you for your response.

    I double checked my hadoop_home path by doing echo $HADOOP_HOME and i see my output is /home/srai/Downloads/hadoop-1.2.1.

    I am not sure why $ENV{HADOOP_HOME}/src/c++/libhdfs/ does not prefix my hadoop home to this path. Can it be because CMAKE_MODULE_PATH cannot find FindHDFS.cmake and FindJNI.cmake? I put the FindHDFS.cmake in the modules under /usr/share/cmake-2.8/Modules and FindJNI.cmake was already there. Also I don't define or use LD_LIBRARY_PATH anywhere.

    I modified the FindHDFS.cmake as suggested in step 4 of the tutorial. This might seem silly , however i am not sure what this means or where this modification will go :

    --- a/cmake_modules/FindHDFS.cmake
    +++ b/cmake_modules/FindHDFS.cmake

    Also if you can elaborate a bit more on steps 7 and 8, i will really appreciate it.


    1. Hi,

      No, cmake is able to FindHDFS.cmake and and FindJNI.cmake, hence you are getting the suffix (src/c++/libhdfs). YOu have put it in the correct place.

      Not defining LD_LIBRARY_PATH is fine, it will not be a problem.

      The modification
      --- a/cmake_modules/FindHDFS.cmake
      +++ b/cmake_modules/FindHDFS.cmake
      is just an indication that you have to modify the file FindHDFS.cmake. It need not go anywhere.

      Steps 7 and 8:
      1. Run these two commands on the terminal:
      export PATH=$HADOOP_HOME/bin:$PATH
      export CLASSPATH=$(hadoop classpath)

      2. Run the command 'make happlier'.

      3. If the above command gives an error, modify LD_LIBRARY_PATH.
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/your path to the library libhdfs

      4. cd /home/srai/Downloads/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs
      5. ./happlier mysql://root@ hdfs://localhost:9000

      Hope that helps!


    2. Hi Shubhangi,

      Thanks for your response. I was able to run the 'cmake' command on the parent directory of the Hadoop Applier source. In my case i ran "sudo cmake . -DENABLE_DOWNLOADS=1 mysql-hadoop-applier-0.1.0 " and then "make happlier" from the terminal.

      I got the following output:

      -- Architecture: x64
      -- HDFS_LIB_PATHS: /c++/Linux-amd64-64/lib
      -- sh: 1: hadoop: not found
      -- HDFS_INCLUDE_DIR: /home/srai/Downloads/hadoop-1.2.1/src/c++/libhdfs
      -- HDFS_LIBS: /home/srai/Downloads/hadoop-1.2.1/c++/Linux-amd64-64/lib/
      -- JNI_INCLUDE_DIRS=/usr/lib/java/jdk1.8.0_05/include;/usr/lib/java/jdk1.8.0_05/include/linux;/usr/lib/java/jdk1.8.0_05/include
      -- JNI_LIBRARIES=/usr/lib/java/jdk1.8.0_05/jre/lib/amd64/;/usr/lib/java/jdk1.8.0_05/jre/lib/amd64/server/
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /home/srai/Downloads

      srai@ubuntu:~/Downloads$ make happlier
      Built target replication_static
      Built target happlier

      I then do : export PATH=$HADOOP_HOME/bin:$PATH
      export CLASSPATH=$(hadoop classpath)

      change my working directory to /home/srai/Downloads/mysql-hadoop-applier-0.1.0/examples/mysql2hdfs

      However when i try to do

      ./happlier --field-delimiter=, mysql://root@ hdfs://localhost:9000 ( taken from your video)

      i get bash: ./happlier: No such file or directory

      When running make happlier, it says built target happlier, so i dont know why its not being generated.Can you provide some insight?


    3. Hi,

      Can you check if happlier is at the following location: /home/srai/Downloads/examples/mysql2hdfs?

      $ cd /home/srai/Downloads/examples/mysql2hdfs
      $ ./happlier --field-delimiter=, mysql://root@ hdfs://localhost:9000

      Hope that Helps,