What is a Data Lake

  • Data lake is a central repository that holds big data from many different data sources
  • Can be structured, semi-structured or unstructured data
  • To ingest data as quickly as possible and make it available asap
  • Used extensively for machine learning and analytical solutions
  • Has to be secure and scale
  • Hardware should be inexpensive so you can store as much data as possible
  • Idea is to store as much data as possible so it can be made available to others, and they can make use of it later.
  • R&D on data products
  • Cannot always define the structure of the data


Data Lake vs Data Warehouse

Data LakeData Warehouse
Data is unstructured
Data is structured
Data Scientists or Data Analysts
Business Analysts
Stores data on the scale of petabytes
Used for batch processing, Business Intelligence, and Reporting
Stream Processing, Machine learning and real time analysis
Data size is generally small
Data is undefined, no relation between data
Data Warehouses contain historic and relational data


Gotchas of Data Lake

  • Starts with a good intention, but soon turns into a Data Swamp: Very hard to be useful
  • No versioning: Incompatible schemas between the data, and Different file types
  • No metadata associated; what's the usefulness of the data
  • Joins not possible between different data sets

ELT vs. ETL

ELTETL
Extract, Load, and Transform
Extract, Transform, and Load
Used for large amount of data
Used for small amount of data
Provides data lake support (Schema on read)Data warehouse solution
Data lake solution
Schema on write
Write the data first, and then determine the schema when loading the data
We define the schema, define the relationships, and then we load the data

Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)

  • GCP -> Cloud Storage
  • AWS -> S3
  • Azure -> Azure Blob