What is a Data Lake & how is it different from a Data Warehouse?
Enterprises have loads of data which is often very varied and difficult to make sense of. A Data lake keeps that data in its purest form which can be used for different purposes.
Traditionally Data warehouses have helped us use the data that we have but in the age of Big Data, this model doesn't serve us well.
Data Warehouse
We take data from structured data sources, do some ETL and structuring and basis a predefined data model create data marts for reporting, OLAP cubes for slicing & dicing & visualization.
We take data from structured data sources, do some ETL and structuring and basis a predefined data model create data marts for reporting, OLAP cubes for slicing & dicing & visualization.
This process needs the understanding of the data that is coming in (Source, Type, Anomalies, Cardinality) along with the business requirement to make it work.
Also based on the fact that the business understands the requirements. So most of the time goes into understanding what, where and how of data than on actual analysis.
Data Lake
Here the data sources can be structured or unstructured. We extract and load all types of data to a raw data storage. This place is a persistent storage that can store data at scale (Volume).
Components of a data lake:
- A sandbox environment: For understanding & exploring data, creating prototypes & use cases.
- A batch processing engine: For converting raw data into structured data used for reporting.
- Real-time processing: Handles streaming data & processes it as well.
- Cataloging & Curating: The value of data basis its source, quality & lineage.Helps in deciding which data set to be used for a particular analysis. Helps in providing Meta-data.
Lambda architecture: The hybrid approach of batch & Real-time processing is called Lambda. Batch layer is the slow layer whereas the Speed layer takes care of fast incoming data. Once data is passed through the speed layer it goes for batch processing.
This makes room for more data to come in the speed layer. The batch layer & the Speed layer run together when queried for reporting.
Glossary
ETL: Extract, Transform, Load (ETL) first extracts the data from a pool of data sources, which are typically transactional databases. The data is held in a temporary staging database. Transformation operations are then performed, to structure and convert the data into a suitable form for the target data warehouse system. The structured data is then loaded into the warehouse, ready for analysis.
Persistent storage: Persistent storage means making data available even when power is off. Like a hard disk. The volume, requirement of availability & distributed compute makes it complex storage.
Batch processing: Typically uses Hadoop MapReduce
Immutable data: That cannot be changed once stored.
Master data set: This is where all batch process data is stored. This data is immutable.
Atomic data: That data which cannot be broken further. Not calculated metrics like revenue.
Timestamp: Recording event information called log for organizing data.
No comments:
Post a Comment