Skip to main content

Lambda architcture λ

· 3 min read

I've been doing some research recently about architectures for large scale data analysis systems. An idea that appears quite often when discussing this problem is lambda architecture.

Data aggregation

The idea is quite simple. Intuitive approach to analytics is to gather data as it comes and then aggregate data for better performance of analytic queries. E.g. when users do reports by date range, pre-aggregate all sales/usage numbers per day and then produce the result for given date range by making the sum of aggregates for each data in the range. If you have let's say 10k transactions per day, that approach will create only 1 record per day. Of course in reality you'd probably need many aggregates for different dimensions, to enable data filtering, but still you will probably have much less dimensions than the number of aggregated rows.

Aggregation is not the only way to increase query performance. It could be any kind of pre-computing like batch processing, indexing, caching etc. This layer in lambda architecture is called "serving layer" as it's goal is to server for the analytical queries as a fast source of information.

Speed layer

This approach has a significant downside - aggregated results are available after a delay. In the example above the aggregated data required to perform analytical queries, will be available on the next day. Lambda architecture mitigates that problem by introducing so called speed-layer. In our example that would be the layer keeping data for current day. The amount of data for a single day is relatively small and probably does not require aggregates to be queried or can fit into a relatively small number of fast and more expensive machines (e.g. using in-memory computing)

Queries

Analytical queries combine results from 2 sources: aggregates and speed layer. The speed layer can be also used to crate aggregates for the next day. Once data is aggregated it can be removed from speed layer to free the resources.

Master data

Let's do not forget that besides speed layer and aggregates, there is also a so called master data that contains all raw, not aggregated records. This dataset in lambda architecture should be append-only

Technology

This architecture is technology-agnostic. For example you can build all the layers on top of SQL servers. But typically a distributed file system like  HDFS would be used as master data. MapReduce pattern would be used for batch processing the maser data. Technologies like Apache HBase or ElephantDb would be used to query the serving layer. And Apache Storm would be used for the speed layer. Those choices could be quite common in the industry but technology stack can vary a lot from project to project or company to company.

Sources

http://lambda-architecture.net/ https://amplitude.com/blog/2015/08/25/scaling-analytics-at-amplitude https://databricks.com/glossary/lambda-architecture