Tag Archives: Data Lake

Apache Flume – Data Lake for Enterprises Book

Chapter 6 in the book “Data Lake for Enterprises” aims to cover another technology being used in the Data Acquisition layer namely Apache Flume. After reading this chapter you will have clear idea on Flume usage in the architecture and also would have gained enough details on full working of Flume. You would also have hands on working with Flume and would also have progressed further in our journey to implement Data Lake and realize the Single Customer View (SCV) use case.

Stream data are the data which are generated by a variety of business application and external application (these days almost all social sites) continuously and in fast pace, usually having a small payload. These are real time data which comes one after the other and makes sense when processed in a sequential manner. For an enterprise analysing these data and then responding appropriately can be a business model and this can indeed transform their way of working. Looking at these data in real time fashion and then personalizing according to customer needs can indeed be very rewarding for the customer, but will also bring financial gains to the business and can also increase customer experience (intangible benefits).

Conceptual view of working of Flume is as shown in the below figure.

Conceptual view of working of Flume
Conceptual view of working of Flume

Apache Flume is a very important component in our Data Lake implementation and the main difference between Sqoop and Flume is as shown in the figure below.

Sqoop and Flume
Sqoop and Flume

Below figure shows how an advanced Flume architecture would look like in purview of a Data Lake for an enterprise.

Advanced Flume Architecture
Advanced Flume Architecture

More details on book can be found here.

Share the post and help spread the word/work if you like it in as many social channels possible… 🙂

Thanks in advance

One of the co-authors of the book “Data Lake for Enterprises”.

Page Visitors: 410

Apache Sqoop – Data Lake for Enterprises Book

Apache Sqoop is the one of the primary frameworks which has been widely used as it is a part of Hadoop ecosystem and has been very dominant for this capability. Apache Sqoop is one of the main technologies used to transfer data to and from structured data stores such as RDBMS and traditional data warehouses to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily. Sqoop helps in bulk transfer of data from these stores also integrates easily with Hadoop based systems like Apache Oozie, Apache HBase and Apache Hive.

Apache Sqoop could be employed for many of the data transfer requirement in a Data Lake, which does have HDFS as the main data storage for incoming data from various systems. Below points gives some of the cases where Apache Sqoop makes more sense:

  • For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technology in the data acquisition layer.
  • For transferring data from NoSQL data stores like MongoDB and Cassandra into Hadoop file system.
  • Enterprises having good amount of applications whose stores as based on RDBMS, Sqoop is a best option to transfer data into Data Lake.
  • Hadoop is a de-facto standard for storing massive data. Sqoop allows to transfer data easily into HDFS from traditional database with ease.
  • Use Sqoop when batch processing is acceptable and performance is required as it is able to split and parallelize data transfer.
  • Sqoop has concept of connectors and if your enterprise has diverse business applications with different data stores, Sqoop is an ideal choice.
Figure: Capability of Apache Sqoop in a Data Lake
Figure: Capability of Apache Sqoop in a Data Lake

Figure: Capability of Apache Sqoop in a Data Lake

Chapter 5 in the book “Data Lake for Enterprises” covers both theoretical and coding aspect of Apache Sqoop in purview of developing an Enterprise grade Data Lake.

More details on book can be found here.

Page Visitors: 473