Data, data everywhere – Prediction One

This blog is a continuation of my previous blog which can be read here.

In this blog, similar to the first one, I continue my rant and more importantly I predict some aspects which will become a norm going forward.

I am sure I am not the only one thinking this way…but being a blogger myself I thought I will just offload what I have in my mind…for my own reading when I become old and be proud that I once predicted and now it’s all happening 😀.

In regards to browser applications/websites, below is my prediction number one.


At the moment, many sites asks for your permission or rather display an attention message which says that they are using cookies to store information. This will and should be extended to other storage mechanisms like local storage, session storage, indexed db and so on.

Most importantly, this will not be a one time affair. What I mean by that is, today it just displays once and when you say yes (granted permission), they (websites) keep using that as an excuse and keeping storing things in it. Today it (websites) doesn’t state clearly as to what they are storing. Going forward I expect this will be a rule enforced which should convey to the user what is getting stored and this message need to pop up every-time when there is change to any of the properties being stored in these storage mechanisms.
I am sure, it might turn into a big nuisance but some people might not mind in doing this. If the website wants you to say yes to everything and permanently allow this, they will use the tactics of flooding you with such information till you say…don’t ask me for next 1 year…also it’s important to note… every-time there is a change it need to specifically say what is getting changed and what is new and old value along with reason for storing it. There should also be an option to not allow certain data if the user doesn’t give website owners permission.

That’s too much isn’t…wait this will be something which will be useful going forward. People are indeed realizing the value of privacy policies and they wouldn’t mind (according to me).


I already have my second prediction lined up in my head. Soon, i will try to offload that as well in another follow-up blog post.

Thanks for reading. Keep sending your views.

Note: I am in no way disregarding the importance of data lake for an enterprise, in which enterprise data is aggregated and collected in one place to do various analysis real-time as well as in batch mode. My book namely “Data Lake for Enterprises” in fact advocates that. But, i am against collecting huge amount of data from your customers and then using it for their own gain with properly educating the customer (customer rights in my view). If they willingly accept for transferring their data (so called privacy related data), i am all in for that.

Page Visitors: 216

Data, data everywhere…feeling a bit uncomfortable

Recently one of my colleagues came to me and said that he searched something on a popular search engine and after that everything that he did online (browsing other sites, social media etc.) seems to know this and started showing similar content what he was searching for earlier.

Even though the site domain varied, other sites knew what he searched for and started showing very personalized content (yes, i do know that if you are using Adsense, Google would have already figured out what to show so that user actually clicks on these advertisements). How is this possible? Do these sites having different domains share data between each other. Isn’t that, a domain don’t share anything with other domain holds good here. Isn’t that a very basic browser security?

One of my other colleague also once told me a similar incident in which she was looking for a piece of furniture. She had clear picture in mind on what she wants. She used image search in one of the popular search engine. But unfortunately she couldn’t get what she was looking for and gave up.

Few hours later she was browsing through some of the famous social media sites and BOOM. These sites starts showing exactly the image she was looking for. The exact furniture piece that she was looking for.

Do these sites sell personal data between each other and earn money..😀.
In both incidents it can be thought of in positive sense whereby they indeed were getting more relevant data that they are looking for.

BUT…they were both skeptical and was being fearful of how much each of those sites know about you as a person.
Most of these sites capture so much data from you without your knowledge. The so called behavioral data (what did you browse, when did you browse, what areas of the sites your clicked, touched even looked) and most of the data in regards to your machine (which operating system, system details etc.), browser (which browser, version, which features are available and so on) along with data which you have given full access to without knowing much about those privacy issues like location.

In near future I am sure that these big sites can be consulted to get a person’s good conduct certificate (which sites you are visiting, at what time of day you browse, while browsing at different times what are your browsing traits and so on). Also, looking at such data these big sites can predict in advance whether he/she has a criminal tendency or any other such traits which is very hard to get looking at someone on their face. For example, recently this person has started looking at some undesirable sites and also has been searching for content showing certain negative traits of a person.

These data collected never get erased even after you die and can still be even used and linked to your children’s account and even predict their behaviour and other personal characteristics. If though i laugh while i write, but they could link parent and children’s account and can state some characteristics of a kid much in advance. If father showing criminal traits, the child could also show a similar traits in the future.. :). Sorry i am taking this too far.

Have I started to make you think…if so, my post is a success. Let me know your views.

I am going to write few more posts in the same topic and also going to predict certain things which will become a norm going forward.

You would have already known about cookie policy…😀. Don’t laugh…

It’s just one storage mechanism in the browser…heard of local storage…session storage…indexed db….?

No one asks for permission when they want to write on these storage mechanisms…you yourself has already given permission for them to write onto your disk…the so called data which they will use it later on…I am not really saying it’s bad…but what’s the point of cookie policy…these aspects also should be regulated…I guess. Just a thought…


If you would like to read some of the predictions made, please follow below links:

For Prediction One click here.

For Prediction Two click here.

For Prediction Three click here.

For Prediction Four click here.

A thought on browser and its tracking can be read here.

Page Visitors: 276

Writing performant JavaScript – A Hacky thought using asm.js

Disclaimer: I haven’t tried this myself. Got this weird thought when going through some of the blogs.
The question is, how do you write a very performant Node JavaScript code?
If you have a node based microservice doing exactly just one thing and that too in the most efficient fashion. Once written such programs doesn’t have to be tweaked much and constitutes as a piece of framework based on which other components work (around it). It seems confusing, don’t worry, just think of a scenario in which you would want to write best performant Node code for your project.
Writing a high performant code is very difficult. One approach that I can think of is by generating the best possible JavaScript by converting well written C or C++ code using Emscripten. Crazy??
Sort of.. 🙂
Let me explain. ..
Have you heard of asm.js? If not, its a subset of JavaScript which is very efficient and can be one of the JS written very close to native code. It performs very close to how native code works. Generally its not hand written, rather it is generated.
The C or C++ code which is written (considering all the best practices) is passed to Emscripten which converts into asm.js code. Sound Greek?
Ok, let me get into a bit more detail… first you write C or C++ code and then it is converted to so called LLVM byte code. Now you will think what the hell is LLVM. LLVM (Low Level Virtual Machine) is what it started off years ago. But, now it has so many sub projects under its umbrella and it no more like its abbreviation. Using Clang you convert the C or C++ code to LLVM bytecode and then pass it onto Emcripten which converts LLVM bytecode to highly performant JavaScript code (asm.js).
So, this can be thought off as an approach to write some performing Node (JavaScript) code which runs itself as a microservice.
If you feel confused at this stage, dont worry. Just grasp some of the important points as below:
  • asm.js – strict subset of JavaScript which is highly performant. At the end of the day, its plain JavaScript.
  • LLVM – C or C++ code is converted to LLVM bytecode using Clang.
  • Emscripten – Takes in LLVM bytecode and converts into asm.js.
Let me know your thoughts using the comment section.
If you feel this is quite good thought, spread the word using various social networks by clicking on appropriate icons.
Shameless Advertisement/Promotion… 🙂
Would you like to read a book on Data Lake (Big Data)? I am co-author of a book named “Data Lake for Enterprises” published by Packt Publishing.
You can buy in Amazon here.
If you would like to see more on what is there in this book, please visit the book’s dedicated website here.

Page Visitors: 191

Best way to persuade/communicate – Pyramid Principle

Recently I had a chance to read more on so called “Pyramid Principle”. Thanks to my mentor who wanted to discuss this very topic in my next mentoring session.

When i heard this topic for the first time, to be honest i was thinking of hierarchy in an organization which is often attributed to be of pyramid structure.

When i searched in Google, i got some very good pointers on what exactly this is and that moment itself i thought i should write a quick 100 word blog post on what i understood on this topic for my fellow colleagues.

Ok, coming to the point, the concept “Pyramid Principle” refers to an approach by which you can communicate something to someone, in a more methodical, concise and adoptable (yes, something which others are more happy to adopt) fashion.

Usually these kind of principle (complex as everyone would say.. :)) is employed to higher management who, in general doesn’t have much time with them. To be fair to them, they do process large amount of data and is entrusted to take top decisions under very little time (yes, that’s why they earn more money.. :)).

When you want to communicate something, usually follow the steps as below (advocated by Pyramid Principle):

  • Start with the answer (yes, your first slide can be the answer itself which the management has asked from you). This is often against your usual way of communication, as in the past, you give facts and figures first and then come to a particular conclusion. Yes, reverse the approach for you to be heard and accomplish what you would like to communicate. If the person whom you are communicating, has already parsed good amount of information in past on similar topic, just the first slide would be good enough for him/her to take a decision and move on.
  • After giving the answer, now its time to group things together and get into a bit more detail with very high quality facts and figures. The best way to get your higher management to listen to you is that, the facts should be grouped and it should not be more than three groups (just a rule which has seen success in past – scientifically proven as you can say). Grouping can be done in many different ways and in general can be classified as:
    • Time based – convey according to how it happened
    • Rank based – higher to low rank
    • Structured – break the main one into three main parts and present it

I think now you know why is it called “Pyramid Principle”. If not, what i understood is, start by giving pointed answer to a question (top of the pyramid) and then drill down as needed with more and more detail. The base of the pyramid would contain more finer details with more figures and facts.

The question is, should you use this for higher management only? The answer is, no. It can be even used when you write a simple mail (reply to a question obviously). This can also be used while answering a question from your higher management or even from your colleagues.

Simple, concise and to the point answer is always appreciated. Its a sign of leaders and i would persuade you to practice it right away.

If you like this topic, please share by clicking on various options in this blog post and help spread this.

Page Visitors: 276

Checklist when you are reviewing a product – technically and architecturally

When reviewing a product technically and architecturally, what are the important aspects that you can think off is listed below (with my experience). The list is just my compilation and in no way exhaustive. It also is not very structurally arranged but these aspects are quite important when such a review is being conducted. If this is useful information that you are looking for, please comment and i will make sure to expand each item more in detail, either as a new blog post or keep adding additional points in this same blog.

  1. Technical Standards alignment
  2. Maintainability aspects (Architectural patterns)
  3. Code Review, Coding standards
  4. Documentation
    1. System Architecture (Architecture Documentation)
      1. Technology View (Version of all Software’s)
        1. Logical architecture (Technically fully explained)
        2. Third party products used, if so Licensing details
      2. Data View
      3. Deployment View
      4. System component Interaction (Component diagram)
    2. Detailed Design Document
    3. Code Documentation
    4. Road Map (Software and technology used)
    5. Details of various exposed web services
    6. Details of other exposed interfaces
  5. Issue tracking system
    1. Dump required, it gives you
      1. Project Health
      2. Various other matrixes
  6. Basic SDLC followed
  7. Basic Configuration management followed
    1. Source Control
    2. Build mechanism
    3. Deployment mechanism
  8. Modularity of code
    1. OSGI capability (Deploying, starting, restarting modules individually)
  9. Performance and availability
    1. Load testing data
    2. Typical deployment time
  10. Logging and Auditing
    1. Transaction auditing
    2. Transaction logging
  11. Non-Functional requirements
    1. Document detailing this
    2. Parameters considered
    3. Any drawbacks
  12. Security
    1. Aspects considered
  13. Architecture overview
    1. Various layers (Client layer, Protocol adapter layer, service layer, business service layer, persistence layer, external interface layer)
    2. Various technology used in each layer
    3. Presentation tier, business tier, database tier, enterprise storage
    4. Components (Functionality – Tools mapping)
      1. Persistence
      2. Transaction management
      3. Job Management
      4. Security
      5. Locking
      6. Audit
      7. Caching
      8. Logging
      9. Web Presentation
      10. Software Distribution
      11. Reports
      12. Health Check & Monitoring
    5. Interface and messaging
      1. Support (web Services, XML, Proprietary)
      2. Modes supported (Email, FTP, MQ, TIBCO)
    6. Connection pooling
    7. Encryption
    8. Performance
    9. Distributed DB
    10. DB backup mechanism
    11. Inter module communication
      1. Dependency, coupling and cohesion
    12. ESB
  14. Architecture framework
    1. Objectives
    2. Approach
    3. Principles
  15. Customization carried for each client
    1. How is source code for each client maintained
    2. Code customization and reuse
    3. Product stack
  16. Standard SDLC in case of complex business process which encompasses multiple components/modules
  17. How are different modules maintained
    1. Teams
    2. Team size
    3. Team composition
  18. Business validation
    1. Approach followed
    2. Declarative or code based
  19. Any existing standards commonly available used while design. Eg. IATA
  20. Can existing application be migrated to this product
    1. SDLC followed
    2. Steps carried out
  21. Integration of system with external legacy systems
    1. Strategy followed
    2. Interface design mechanism
  22. Does it support user preferences?
    1. Favorite screens
    2. Various defaults like date formats, time formats etc.
  23. Application level basic setup configurations
    1. Configuration based
    2. Code based
  24. Authentication and authorization
    1. Level of authorization
    2. Screen based and functionality based
    3. Screen opening in view only mode
    4. Editable based on user role
  25. Internationalization
  26. Workflow
    1. Technology used
  27. Emails
    1. Technology used
  28. Branding for various customers
    1. SDLC followed
    2. How much time it takes to do minimal brand changes
    3. Can customers do the brand changes by their own
    4. Various mails and other configurations (user agreements, disclaimers) how can the customization done?
  29. Any content management system used?
  30. How is web session maintained?
    1. Offloaded to DB?
    2. Memory?
  31. Instant messenger support (web chat)
  32. Specific printers support (Dot matrix etc.)
  33. Barcode generation support
    1. Technology/third party software used
  34. How is various masters taken care?
    1. External sources
    2. Internally maintained
    3. If external customers require data to be sourced from external sources, is it supported?
  35. Different types of data integration mechanism used
    1. Web Services
    2. DB links etc.
  36. Business intelligence capabilities
  37. Data purging mechanism used
    1. Strategy followed
    2. Operational & archive DB
  38. Application hosting models used
  39. Details of exposed web services
  40. Testing capabilities
  41. Integration with ESB’s
  42. System exceptions, error handling and monitoring
    1. Exception classification
    2. Details available for debugging and root cause analysis
      1. User details
      2. Transaction details
      3. Severity
      4. Name of the server in clustered environment
      5. Transaction type – Asynchronous and synchronous
  43. Development environments, explain what is process followed
    1. Test
    2. Stage
    3. Production
  44. Transaction metering (quantity)
  45. TPS and Response time monitoring
  46. Clustering capabilities (Session replication)
    1. Scalability
    2. High-availability
    3. Load balancing
    4. Failover
    5. Fault tolerance
      1. Oracle data grid
      2. Storage level replication
  47. System performance and scalability
  48. Load test methodology – Process used
    1. Smoke test – to understand system behavior
    2. Single instance stress test – to understand the first breaking point
    3. Load test – simulating real life usage
    4. Endurance test – Assess the behavior of the application over longer periods
    5. Application profiling – to understand root cause of the problems caused
  49. Application benchmarking – How is it done?
    1. Users
    2. Machines
    3. CPU utilization
    4. Statistics
    5. SQL’s per second
    6. Transactions per second
    7. Availability percentage
    8. Business transactions per month
  50. Usability considerations
  51. Encryption methodologies used
    1. One way encryption – default algorithm used?
    2. Symmetric (private key) encryption – default algorithm used?
    3. Asymmetric (public key-private key) encryption – default algorithm used?
  52. User authentication mechanisms
    1. Active directory
    2. Single sign on mechanisms
    3. LDAP
  53. Certified platforms
    1. Operating systems
    2. Servers
    3. DB
    4. Browsers
  54. Recommended deployment models available
  55. Approximate planned outage
    1. Time
    2. Process
  56. Roadmap and planned software migrations
  57. Lessons learnt, how is it documented
  58. Bandwidth requirements
    1. Minimum
  59. Workstation configuration
    1. Minimum
  60. Source code maintenance
    1. Tools used
  61. Hosting models available
  62. Horizontal and vertical scaling capabilities
  63. Mobile support available?
  64. Rolling back of implemented delivery – Any process?
  65. Ensuring IT security standards – how is it achieved?
  66. For PCI relevant solution components is it certified according to PCI DSS?
  67. Training materials
    1. For train the trainer
  68. Disaster Recovery
    1. Deployment topologies
    2. Testing methodologies for testing DR

Page Visitors: 419

Apache Flume – Data Lake for Enterprises Book

Chapter 6 in the book “Data Lake for Enterprises” aims to cover another technology being used in the Data Acquisition layer namely Apache Flume. After reading this chapter you will have clear idea on Flume usage in the architecture and also would have gained enough details on full working of Flume. You would also have hands on working with Flume and would also have progressed further in our journey to implement Data Lake and realize the Single Customer View (SCV) use case.

Stream data are the data which are generated by a variety of business application and external application (these days almost all social sites) continuously and in fast pace, usually having a small payload. These are real time data which comes one after the other and makes sense when processed in a sequential manner. For an enterprise analysing these data and then responding appropriately can be a business model and this can indeed transform their way of working. Looking at these data in real time fashion and then personalizing according to customer needs can indeed be very rewarding for the customer, but will also bring financial gains to the business and can also increase customer experience (intangible benefits).

Conceptual view of working of Flume is as shown in the below figure.

Conceptual view of working of Flume
Conceptual view of working of Flume

Apache Flume is a very important component in our Data Lake implementation and the main difference between Sqoop and Flume is as shown in the figure below.

Sqoop and Flume
Sqoop and Flume

Below figure shows how an advanced Flume architecture would look like in purview of a Data Lake for an enterprise.

Advanced Flume Architecture
Advanced Flume Architecture

More details on book can be found here.

Share the post and help spread the word/work if you like it in as many social channels possible… 🙂

Thanks in advance

One of the co-authors of the book “Data Lake for Enterprises”.

Page Visitors: 574

Apache Sqoop – Data Lake for Enterprises Book

Apache Sqoop is the one of the primary frameworks which has been widely used as it is a part of Hadoop ecosystem and has been very dominant for this capability. Apache Sqoop is one of the main technologies used to transfer data to and from structured data stores such as RDBMS and traditional data warehouses to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily. Sqoop helps in bulk transfer of data from these stores also integrates easily with Hadoop based systems like Apache Oozie, Apache HBase and Apache Hive.

Apache Sqoop could be employed for many of the data transfer requirement in a Data Lake, which does have HDFS as the main data storage for incoming data from various systems. Below points gives some of the cases where Apache Sqoop makes more sense:

  • For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technology in the data acquisition layer.
  • For transferring data from NoSQL data stores like MongoDB and Cassandra into Hadoop file system.
  • Enterprises having good amount of applications whose stores as based on RDBMS, Sqoop is a best option to transfer data into Data Lake.
  • Hadoop is a de-facto standard for storing massive data. Sqoop allows to transfer data easily into HDFS from traditional database with ease.
  • Use Sqoop when batch processing is acceptable and performance is required as it is able to split and parallelize data transfer.
  • Sqoop has concept of connectors and if your enterprise has diverse business applications with different data stores, Sqoop is an ideal choice.
Figure: Capability of Apache Sqoop in a Data Lake
Figure: Capability of Apache Sqoop in a Data Lake

Figure: Capability of Apache Sqoop in a Data Lake

Chapter 5 in the book “Data Lake for Enterprises” covers both theoretical and coding aspect of Apache Sqoop in purview of developing an Enterprise grade Data Lake.

More details on book can be found here.

Page Visitors: 881