Chapter 6 in the book “Data Lake for Enterprises” aims to cover another technology being used in the Data Acquisition layer namely Apache Flume. After reading this chapter you will have clear idea on Flume usage in the architecture and also would have gained enough details on full working of Flume. You would also have hands on working with Flume and would also have progressed further in our journey to implement Data Lake and realize the Single Customer View (SCV) use case.
Stream data are the data which are generated by a variety of business application and external application (these days almost all social sites) continuously and in fast pace, usually having a small payload. These are real time data which comes one after the other and makes sense when processed in a sequential manner. For an enterprise analysing these data and then responding appropriately can be a business model and this can indeed transform their way of working. Looking at these data in real time fashion and then personalizing according to customer needs can indeed be very rewarding for the customer, but will also bring financial gains to the business and can also increase customer experience (intangible benefits).
Conceptual view of working of Flume is as shown in the below figure.
Apache Flume is a very important component in our Data Lake implementation and the main difference between Sqoop and Flume is as shown in the figure below.
Below figure shows how an advanced Flume architecture would look like in purview of a Data Lake for an enterprise.
Apache Sqoop is the one of the primary frameworks which has been widely used as it is a part of Hadoop ecosystem and has been very dominant for this capability. Apache Sqoop is one of the main technologies used to transfer data to and from structured data stores such as RDBMS and traditional data warehouses to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily. Sqoop helps in bulk transfer of data from these stores also integrates easily with Hadoop based systems like Apache Oozie, Apache HBase and Apache Hive.
Apache Sqoop could be employed for many of the data transfer requirement in a Data Lake, which does have HDFS as the main data storage for incoming data from various systems. Below points gives some of the cases where Apache Sqoop makes more sense:
For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technology in the data acquisition layer.
For transferring data from NoSQL data stores like MongoDB and Cassandra into Hadoop file system.
Enterprises having good amount of applications whose stores as based on RDBMS, Sqoop is a best option to transfer data into Data Lake.
Hadoop is a de-facto standard for storing massive data. Sqoop allows to transfer data easily into HDFS from traditional database with ease.
Use Sqoop when batch processing is acceptable and performance is required as it is able to split and parallelize data transfer.
Sqoop has concept of connectors and if your enterprise has diverse business applications with different data stores, Sqoop is an ideal choice.