Apache Sqoop – Data Lake for Enterprises Book

Apache Sqoop is the one of the primary frameworks which has been widely used as it is a part of Hadoop ecosystem and has been very dominant for this capability. Apache Sqoop is one of the main technologies used to transfer data to and from structured data stores such as RDBMS and traditional data warehouses to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily. Sqoop helps in bulk transfer of data from these stores also integrates easily with Hadoop based systems like Apache Oozie, Apache HBase and Apache Hive.

Apache Sqoop could be employed for many of the data transfer requirement in a Data Lake, which does have HDFS as the main data storage for incoming data from various systems. Below points gives some of the cases where Apache Sqoop makes more sense:

  • For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technology in the data acquisition layer.
  • For transferring data from NoSQL data stores like MongoDB and Cassandra into Hadoop file system.
  • Enterprises having good amount of applications whose stores as based on RDBMS, Sqoop is a best option to transfer data into Data Lake.
  • Hadoop is a de-facto standard for storing massive data. Sqoop allows to transfer data easily into HDFS from traditional database with ease.
  • Use Sqoop when batch processing is acceptable and performance is required as it is able to split and parallelize data transfer.
  • Sqoop has concept of connectors and if your enterprise has diverse business applications with different data stores, Sqoop is an ideal choice.
Figure: Capability of Apache Sqoop in a Data Lake
Figure: Capability of Apache Sqoop in a Data Lake

Figure: Capability of Apache Sqoop in a Data Lake

Chapter 5 in the book “Data Lake for Enterprises” covers both theoretical and coding aspect of Apache Sqoop in purview of developing an Enterprise grade Data Lake.

More details on book can be found here.

Page Visitors: 881

The following two tabs change content below.
Tomcy John

Tomcy John

Blogger & Author at javacodebook
He is an Enterprise Java Specialist holding a degree in Engineering (B-Tech) with over 10 years of experience in several industries. He's currently working as Principal Architect at Emirates Group IT since 2005. Prior to this he has worked with Oracle Corporation and Ernst & Young. His main specialization is on various web technologies and acts as chief mentor and Architect to facilitate incorporating Spring as Corporate Standard in the organization.
Tomcy John

Latest posts by Tomcy John (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *