Skip to content


Repository files navigation

DSCI 6007: Distributed and Scalable Data Engineering

Welcome to Intro to Data Engineering!

See for the current syllabus.

Schedule Overview

Organized by weeks and days:
(subject to change;
see detailed schedule below

  1. Welcome to Data Engineering
    1. Data Engineering Overview
    2. How the Internet Works
    3. Virtualization
    4. Linux
  2. The Cloud
    1. The Cloud
    2. Deployment
    3. Big Data Architecture
    4. Review Day - Project Data Due
  3. Parallel Processing
    1. Functional Programming
    2. Threading
    3. Multiprocessing
    4. Massively Parallel Processing
  4. MapReduce (Divide-and-conquer for Distributed Systems)
    1. The MapReduce Algorithm & Hadoop
    2. MapReduce Design Patterns
    3. Spark Overview
    4. Review Day - Project Proposals Due
  5. SQL (The Lingua Franca of Data)
    1. SQL Fundamentals
    2. Extract Transform Load
    3. Relational Data Modeling
    4. Advanced Querying
  6. Spark (What to add to your LinkedIn profile)
    1. Spark DataFrames
    2. Spark SQL
    3. Review Day
    4. Intro to Spark ML
  7. Streaming (Everyone has to have real-time)
    1. More Spark ML
    2. Spark Streaming
    3. Probabilistic Data Structures
    4. Review Day
  8. Final Project Presentations

Detailed Schedule

Week 1: Welcome to Data Engineering

Day Readings Notes Assignment
Monday Data Engineering Overview 1. Intro to Data Engineering
2. Intro to the Cloud
Conencting to the Cloud with Python
Tuesday How the Internet Works How the Web Works Generating Reports
Thursday Virtualization Virtualization & Docker Your Very Own Web Server
Friday *NIX Linux Linux Intro

Week 2: The Cloud

Day Readings Notes Assignment
Monday Introduction to Clouds The Cloud & AWS Move your Linux machine to the Cloud
Tuesday Provisioning EC2 & cron Automate More
Thursday I ♥ Logs Apache Kafka Drinking from the Firehose
Friday Projects Project Proposal Proposal

Week 3: Parallel Processing

Day Readings Notes Assignment
Monday Functional Programming Fun with Toolz
Tuesday Threading and Webscraping Threading and Webscraping
Thursday Intro to Multiprocessing Multiprocessing Demonstration Multiprocessing
Friday Scaling Out Distributed Computing Embarrassingly Parallel

Week 4: MapReduce (Divide-and-conquer for Distributed Systems)

Day Readings Notes Assignment
Monday HDFS and MapReduce MapReduce Scaling Out
Tuesday MapReduce Design Patterns Hadoop Ecosystem Meet MrJob
Thursday Introduction to Spark Apache Spark Spark on EMR
Friday Designing Big Data Systems Review Final Project Proposal

Week 5: SQL (The Lingua Franca of Data)

Day Readings Notes Assignment
Monday SQL Basics Databases and SQL Squashing Birds
Tuesday Relational Design Relational Database Modeling Data Modeling Practice
Thursday Drivers and Workers SQL: Advanced Querying Feeding the Elephant
Friday Tuning SQL Data Systems Architecture Advanced Querying

Week 6: Spark (What to add to your LinkedIn profile)

Day Readings Notes Assignment
Monday Spark DataFrames Spark DataFrames
Tuesday Programming with RDDs Spark SQL Spark SQL


No description, website, or topics provided.






No releases published


No packages published