He is an Enterprise Java Specialist holding a degree in Engineering (B-Tech) with over 10 years of experience in several industries. He's currently working as Principal Architect at Emirates Group IT since 2005. Prior to this he has worked with Oracle Corporation and Ernst & Young. His main specialization is on various web technologies and acts as chief mentor and Architect to facilitate incorporating Spring as Corporate Standard in the organization.
View all posts by Tomcy John →
Recently I had a chance to read more on so called “Pyramid Principle”. Thanks to my mentor who wanted to discuss this very topic in my next mentoring session.
When i heard this topic for the first time, to be honest i was thinking of hierarchy in an organization which is often attributed to be of pyramid structure.
When i searched in Google, i got some very good pointers on what exactly this is and that moment itself i thought i should write a quick 100 word blog post on what i understood on this topic for my fellow colleagues.
Ok, coming to the point, the concept “Pyramid Principle” refers to an approach by which you can communicate something to someone, in a more methodical, concise and adoptable (yes, something which others are more happy to adopt) fashion.
Usually these kind of principle (complex as everyone would say.. :)) is employed to higher management who, in general doesn’t have much time with them. To be fair to them, they do process large amount of data and is entrusted to take top decisions under very little time (yes, that’s why they earn more money.. :)).
When you want to communicate something, usually follow the steps as below (advocated by Pyramid Principle):
Start with the answer (yes, your first slide can be the answer itself which the management has asked from you). This is often against your usual way of communication, as in the past, you give facts and figures first and then come to a particular conclusion. Yes, reverse the approach for you to be heard and accomplish what you would like to communicate. If the person whom you are communicating, has already parsed good amount of information in past on similar topic, just the first slide would be good enough for him/her to take a decision and move on.
After giving the answer, now its time to group things together and get into a bit more detail with very high quality facts and figures. The best way to get your higher management to listen to you is that, the facts should be grouped and it should not be more than three groups (just a rule which has seen success in past – scientifically proven as you can say). Grouping can be done in many different ways and in general can be classified as:
Time based – convey according to how it happened
Rank based – higher to low rank
Structured – break the main one into three main parts and present it
I think now you know why is it called “Pyramid Principle”. If not, what i understood is, start by giving pointed answer to a question (top of the pyramid) and then drill down as needed with more and more detail. The base of the pyramid would contain more finer details with more figures and facts.
The question is, should you use this for higher management only? The answer is, no. It can be even used when you write a simple mail (reply to a question obviously). This can also be used while answering a question from your higher management or even from your colleagues.
Simple, concise and to the point answer is always appreciated. Its a sign of leaders and i would persuade you to practice it right away.
If you like this topic, please share by clicking on various options in this blog post and help spread this.
When reviewing a product technically and architecturally, what are the important aspects that you can think off is listed below (with my experience). The list is just my compilation and in no way exhaustive. It also is not very structurally arranged but these aspects are quite important when such a review is being conducted. If this is useful information that you are looking for, please comment and i will make sure to expand each item more in detail, either as a new blog post or keep adding additional points in this same blog.
Chapter 6 in the book “Data Lake for Enterprises” aims to cover another technology being used in the Data Acquisition layer namely Apache Flume. After reading this chapter you will have clear idea on Flume usage in the architecture and also would have gained enough details on full working of Flume. You would also have hands on working with Flume and would also have progressed further in our journey to implement Data Lake and realize the Single Customer View (SCV) use case.
Stream data are the data which are generated by a variety of business application and external application (these days almost all social sites) continuously and in fast pace, usually having a small payload. These are real time data which comes one after the other and makes sense when processed in a sequential manner. For an enterprise analysing these data and then responding appropriately can be a business model and this can indeed transform their way of working. Looking at these data in real time fashion and then personalizing according to customer needs can indeed be very rewarding for the customer, but will also bring financial gains to the business and can also increase customer experience (intangible benefits).
Conceptual view of working of Flume is as shown in the below figure.
Apache Flume is a very important component in our Data Lake implementation and the main difference between Sqoop and Flume is as shown in the figure below.
Below figure shows how an advanced Flume architecture would look like in purview of a Data Lake for an enterprise.
Apache Sqoop is the one of the primary frameworks which has been widely used as it is a part of Hadoop ecosystem and has been very dominant for this capability. Apache Sqoop is one of the main technologies used to transfer data to and from structured data stores such as RDBMS and traditional data warehouses to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily. Sqoop helps in bulk transfer of data from these stores also integrates easily with Hadoop based systems like Apache Oozie, Apache HBase and Apache Hive.
Apache Sqoop could be employed for many of the data transfer requirement in a Data Lake, which does have HDFS as the main data storage for incoming data from various systems. Below points gives some of the cases where Apache Sqoop makes more sense:
For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technology in the data acquisition layer.
For transferring data from NoSQL data stores like MongoDB and Cassandra into Hadoop file system.
Enterprises having good amount of applications whose stores as based on RDBMS, Sqoop is a best option to transfer data into Data Lake.
Hadoop is a de-facto standard for storing massive data. Sqoop allows to transfer data easily into HDFS from traditional database with ease.
Use Sqoop when batch processing is acceptable and performance is required as it is able to split and parallelize data transfer.
Sqoop has concept of connectors and if your enterprise has diverse business applications with different data stores, Sqoop is an ideal choice.
In this article by Tomcy John, Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of Big Data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of Data Lake.
The concept of a Data Lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same Enterprise. These created information silos across various applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While Data Mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have Data Lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this chapter we will look at some of the criticals aspects of a Data Lake and understand why does it matter for an Enterprise.
If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms.
Lambda Architecture and Data Lake
Lambda Architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations.
The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (Batch and Speed – the curved parts of the Lambda Symbol) which then combines and served through the serving layer (the line merging from the curved part)
The main layers constituting the Lambda layer are shown below.
In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/ speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results.
A practical realization of such a Data Lake, can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of of ingestion layer, batch layer and speed layer springs into action.
Figure 03: Layers in Data Lake
Data Acquisition Layer
In an organization, data exists in various forms which can be classified as structured data, semi-structured data or as unstructured data.
One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below.
Figure 04: Data Acquisition Layer
The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture, and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.
The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as is the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics).
Figure 05: Message Queue
Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion.
Data Ingestion Layer
A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda Architecture. The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats.
Figure 06: Data Ingestion Layer
The batch processing layer of Lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of Lambda architecture.
While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A Map and Reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on “Divide and Rule” and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output.
Figure 07: Batch Processing
Speed (Near Real Time) Data Processing
This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below.
Figure 08: Speed (Near Real Time) Processing
Data Storage Layer
The data storage layer is very eminent in the Lambda Architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of Lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below.
Figure 09: Non-Indexed and Indexed Data Storage Examples
Lambda In Action
Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations.
Figure 10: Lambda in action
In this article we have briefly discussed a practical approach towards implementing a Data Lake for Enterprises by leveraging Lambda architecture pattern
Disclaimer: I am one of the co-authors of this book. Its shameless promotion of our own work.
Data is becoming important for many enterprises and it has now become pivotal in many aspects. In fact, companies are transforming themselves with data at its core. This book will start by introducing data, its relevance to enterprises, and how enterprises can make use of data to transform digitally. To make use of data, enterprises need repositories, and in this modern age, these aren’t called data warehouses; instead they are called Data Lake.
As we can see today, we have good number of use cases that are leveraging big data technologies. The concept of a Data Lake existed for quite sometime, but recently it has been getting real traction in enterprises. This book gives a hands-on, full-fledged, working Data Lake using the latest big data technologies, following well-established architectural patterns.
The book will bring Data Lake and Lambda architecture together and help the reader to actually operationalize these in their own enterprise. It will introduce a number of Big Data technologies at a high level, but will not be an authoritative reference on any of these topics, as they are vast in nature and worthy of a book in itself.
As part of Microservices Hackathon (a technical hackathon), was fortunate enough to work with some like minded folks and with the help of an existing project in Github, was able to create and extend a full-fledged microservices based project using Spring Boot, Spring Cloud and obviously Netflix-OSS.
The project can be found in the below link location: