Understand Data Engineering
Data Engineering Recap
Understand Data Engineering
Understand Data Engineering
- Data Workflow: collection & storage -> preparation -> exploration & visualization -> experimentation & prediction
- Volume, Variety, Velocity, Veracity, Value (How much?, What kind?, How frequent?, How accurate?, How useful?)
- Data Pipeline: Ingest, Process, Store, Need pipelines, Automate flow, Provide accurate data
- Structured vs Semi-structured vs Unstructured Data
- Databases, Data Lakes > Data Warehouses; Data Catalog
- Processing Data: Batch (Spark, Hadoop, AWS EMR, Hive) vs Stream (Apache Nifi, Apache Storm, Kafka)
- Scheduling: Apache Airflow, Luigi
- Parallel Computing; Cloud Computing
- File Storage (AWS S3, Azure Blob Storage, GC Storage)
- Computation (AWS EC2, Azure VMs, GC Engine)
- Databases (AWS RDS, Azure SQL Db, GC SQL; AWS Redshift DW, Snowflake DW, GC Datastore NoSQL)
Learning (Postgre)SQL
- Relational Databases, Tables, Schemas, Data Types
- Queries
- SELECT, WHERE
- AS, DISTINCT, CREATE VIEW
- Filtering
- AND, OR, BETWEEN, IN, (NOT) LIKE, Widcards
- IS (NOT) NULL
- Aggregate Functions; Sorting and Grouping
- AVG(), SUM(), MIN(), MAX(), COUNT(), ROUND()
- ORDER BY(), GROUP BY(), HAVING
- INNER JOINs, USING
- LEFT, RIGHT, FULL JOIN; CROSS JOIN
- Self Joins ; Semi/ Anti Join
- UNION (ALL), INTERSECT, EXCEPT
- Subqueries inside SELECT, WHERE, FROM clauses
Relational Databases in SQL
- Data Types; CAST( - AS integer)
- CREATE TABLE
- ALTER TABLE - (ADD COLUMN, RENAME COLUMN, DROP COLUMN)
- ALTER TABLE - ALTER COLUMN - TYPE - (USING SUBSTRING firstname FROM 1 FOR 16)
- ALTER TABLE - ALTER COLUMN - (SET NOT NULL, DROP NOT NULL)
- INSERT INTO - SELECT DISTINCT
- CAST(fee AS integer)
- Only contain unique values
- ALTER TABLE - ADD CONSTRAINT uni_unq UNIQUE uni
- Primary Keys
- ALTER TABLE - ADD COLUMN id serial
- ALTER TABLE - ADD CONSTRAINT prof_pkey PRIMARY KEY (id)
- Surrogate Keys
- UPDATE cars SET id = CONCAT(make, model)
- ALTER TABLE - ADD CONSTRAINT id_pkey PRIMARY KEY (id)
- Foreign Keys
- ALTER TABLE - RENAME COLUMN - TO uni_id
- ALTER TABLE - ADD CONSTRAINT prof_fk FOREIGN KEY uni_id REFERENCES universities (id)
- Referential Integrity
- ALTER TABLE - ADD CONSTRAINT - FOREIGN KEY - REFERENCES - β¦
- β¦ON DELETE (NO ACTION, CASCADE, RESTRICT, SET NULL, SET DEFAULT)
Database Design
- Online Transaction Processing (OLTP - latest customer transactions) vs Online Analytical Processing (OLAP - most loyal customers)
- Structured Data: has schema, data types, relationships (SQL, tables in relational database)
- Semi-Structured Data: NoSQL, XML, JSON
- Unstructured Data: Photos, MP3, chat logs
- Data Warehouses
- denormalized schema, dimensional modeling, MPP > Data Marts
- Use AWS Redshift, Azure SQL DW, Gg Big Query
- Data Lakes
- cheaper; all types data
- Use Apache Spark, Hadoop
- ETL (Extract, Tranform, Load)
- (OLTP, APIs, Files, IoT Logs) -> Staging -> Data Warehouse -> ML, Data Marts, BI Tools
- ELT (Extract, Load, Transform)
- (OLTP, APIs, Files, IoT Logs) -> Data Lake-> DL, Warehouse, BI Tools
- Conceptual (Entity-relational diagrams), Logical (relational model, star schema), Physical (partitions, CPUs, indexes)
- Normalization (1NF, 2NF, 3NF)
- Using views; Materialized views
- CREATE VIEW -AS
- GRANT UPDATE ON ratings TO PUBLIC
- REVOKE INSERT ON films FROM db_user;
- UPDATE - SET - WHERE
- INSERT INTO - VALUES
- DROP VIEW - [CASCADE | RESTRICT]
- CREATE OR RPLACE VIEW - AS
- ALTER VIEW [IF EXISTS]
- CREATE ROLE - WITH
- ALTER ROLE
- Vertical and Horizontal Partitioning; Sharding
- Data Integration - Unified Data Model
- Sales financial data (PostgreSQL), Product behavioral data (MongoDB), Marketing contact data (CSV)
- Transformation tools - ETL
- Unified Data Model (RedShift)
- DBMS types
- SQL: PostgreSQL, Microsoft SQL Server, Oracle
- NoSQL:
- document store (MongoDB) - content management
- key-value store (Redis) - shopping cart online buyers
- columnar database (Cassandra) - big data analytics for speed
- graph database (Neo4j) - recommendations, social media data
Data Warehousing
- Data Lakes > Data Warehouses > Data Marts
- Business Requirements (DA, DS) -> Data Modeling (DE, DB Admin) -> ETL Design and Develop (DE, DB Admin) -> BI Application (DA/DS) -> Test & Deploy (DE)
- Architecture: Top-down (Inmon) vs Bottom-up (Kimball)
- OLAP (e.g. total amount of sales) vs OLTP (e.g. when a customer places a new order)
- Fact vs Dimension Tables
- Kimballβs 4 Step Process
- Slowly changing dimensions
- Row vs Column Store
- ETL: separate computer system, store transformed data, lower cost, easy for PII security because sensitive data had been excluded
- ELT: near real-time processes, copy of raw in DW, errors donβt require new data pulls
- On-premise vs Cloud (AWS Redshift, Azure Synapse Analytics, Snowflake, GBQ)
Learning Snowflake(SQL)
- Architecture and Competitors (Databricks, AWS Redshift, Google Big Query)
- Advanced SQL Concepts
- NATURAL, LATERAL JOINs
- Subqueries and CTEs
- VARIANT data type; Handling JSONs
- PARSE_JSON , OBJECT_CONSTRUCT
Learning Python
- Variables, Data Types
- Lists, Dictionaries, Sets and Tuples
- IF - ELSE, and, or, (not) in; FOR, range(); WHILE, break
- Functions, Modules, Packages
- Docstrings
- Default, Keyword, Arbitrary Arguments (*args, **kwargs)
- Lambda Functions -
(lambda x, y: x**y)(2, 3)
- Errors
- TypeError, ValueError, Tracebacks
- try-except; raise
Importing Data & Cleaning Data
- Reading, writing to a text file
- Importing using Numpy, Pandas
- np. loadtxt(filename, delimiter= β , β , skiprows=1, usecols=[0, 2])
- pd.read_csv(filename)
- pd.ExcelFile(filename)
- SQL querying in Python; with pandas
1 2
from sqlalchemy import create_engine engine = create_engine('sqlite:///db.sqlite')
- HTTP requests
1 2 3
from urllib.request import urlretrieve, urlopen, Request # --- or --- import requests
- Scraping the Web
1 2
from bs4 import BeautifulSoup import requests
- JSONs, APIs -> practice Twitter API
Data Ingestion with Pandas
ETL & ELT with Python
Learning Git
Learning Apache Airflow
Cloud Computing
- On-premise vs Cloud (IaaS vs PaaS vs SaaS)
- AWS (Cloud - S3, EC2, RDS; Data - Redshift, Kinesis, SageMaker)
- Azure (Cloud - Blob Storage, VMs, SQL Db; Data - Data Lake Storage, Stream Analytics, Machine Learning)
- Google (Cloud - GC Storage, GC Compute Engine, GC SQL; Data - Big Query, Dataflow, AutoML)
Containerization, Virtualization and Orchestration
- Docker, Kubernetes
This post is licensed under CC BY 4.0 by the author.