Module 1
Introduction to Data Engineering
Data engineering focuses on collecting, storing, and processing data for analytics. It supports data science roles like analysts and machine learning engineers by building data infrastructure. Data engineers use various tools for storage, pipelines, and processing, ensuring data is ready for machine learning and analytics.
Computer science fundamentals
Software development covers coding, Agile processes, DevOps, and essential tools like Git and Azure DevOps. Relational databases are crucial for data management, requiring SQL, data modeling, and normalization for efficient structuring.
Introduction to Python
This course covers Python fundamentals, starting with programming basics and setting up the development environment. Key topics include variables, loops, functions, data structures like lists, tuples, dictionaries, and sets, along with indexing, slicing, and file handling for JSON and CSV.
Python for Data Engineers
This module covers advanced Python concepts like classes, modules, exception handling, and logging. It also includes data engineering topics such as datetime, JSON processing, unit testing, and Pandas for data manipulation. Additionally, it explores NumPy and working with APIs, databases, and data sources/sinks.
Data Preparation & Cleaning For machine learning
This module covers the ML preparation checklist, handling missing values using Pandas and Imputers, and encoding categorical variables. It also includes outlier detection, feature scaling, and selection techniques like correlation matrices and RFECV. Finally, it explores ML model validation through theoretical and practical approaches.
Docker Fundamentals
This module covers Docker fundamentals, comparing it with virtual machines and explaining key concepts like images, containers, and registries. It includes hands-on practice with pulling images, running containers, Docker Compose, and building images. Finally, it explores Docker in production, covering deployment, security best practices, and container management with Portainer.
Successful Job Application
This module guides you through career paths, industry roles, and company types for data engineers. It covers building a strong personal brand on social media, documenting your work via GitHub and Medium, and finding quality datasets for projects. You’ll learn how to craft an effective CV, navigate the job market, and prepare for interviews, including technical assessments and salary negotiations.
SQL for Data Engineers
This module covers database fundamentals, including DBMS, SQLite, and the Chinook database. It introduces SQL basics like DML, DDL, joins, and aggregation, followed by advanced concepts such as transactions, subqueries, and window functions. Additionally, it provides insights into excelling as a data engineer, focusing on project phases, planning, platform design, risk management, testing, documentation, and operational improvements like monitoring and continuous enhancement.
Module 2
Platform & Pipeline Design Fundamentals
This module covers the fundamentals of data platform and pipeline design, including key blueprints and essential engineering tools. It explores various pipeline types such as push, pull, batch, and streaming, along with stream analytics and Lambda architecture. Additionally, it delves into data visualization using Hive, Spark on Hadoop, and Spark Thrift Server.
Platform & Pipeline Security
This module covers essential security concepts, including network security with firewalls, proxies, and bastion hosts. It explores access management through IAM and LDAP, along with secure data transmission using HTTPS, SSH, SCP, and security tokens.
Choosing Data Stores
This module covers the fundamentals of selecting data stores, comparing OLTP vs. OLAP and ETL vs. ELT processes. It explores relational databases, NoSQL options like document stores, time-series databases, search engines, wide-column stores, key-value stores, and graph databases. Finally, it discusses data warehouses and data lakes, helping you choose the right storage solution for your needs.
Data Modeling 1
This module covers the importance of data modeling and its role in structuring datasets effectively. It explores schema design across relational databases, wide-column stores, document stores, key-value stores, and data warehouses. A hands-on data modeling workshop is included to reinforce practical understanding.
Data Modeling 2 : Relational Data Modeling
This module covers the fundamentals of relational data models, MySQL installation, and Workbench setup. It guides you through conceptual data modeling, discovering entities and attributes, and defining normalized relationships. You’ll learn to resolve different relationship types and gain hands-on experience by creating an ER diagram, building a physical data model, and populating a MySQL database using Workbench.
Module 3
Fundamentals Tools
This module covers API fundamentals, including HTTP methods, response codes, and parameters. It guides you through setting up FastAPI with WSL2 and VS Code, designing schemas with OpenAPI and Swagger, and implementing CRUD operations. Finally, it explores deploying FastAPI using Docker, testing with Postman, and security best practices.
Apache Airflow Workflow Orchestration
This module introduces Apache Airflow, covering its architecture, key concepts, and example pipelines. It includes hands-on setup using Docker, configuring services like WeatherAPI and Postgres, and creating DAGs with TaskFlowAPI. You’ll also learn how to retrieve data via APIs, write to databases, and implement parallel processing for efficient workflow orchestration.
Apache Spark Fundamentals
This module introduces Apache Spark, covering its scalability, architecture, and deployment options. It includes setting up the environment with Docker and Jupyter Notebook, followed by hands-on coding with RDDs, DataFrames, transformations, and actions. Practical exercises cover JSON transformations, schema handling, SparkSQL, and working with RDDs.
Apache Spark Fundamentals
This module covers the basics of Databricks, including its benefits, pricing, and setup. It walks through creating a Databricks workspace, managing AWS resources, and handling data imports. You’ll learn to process and visualize data using ETL jobs, explore data with notebooks, and connect Databricks to external BI tools like Power BI for advanced analytics.
Apache Kafka
This module introduces Kafka and message queue concepts, covering topics, partitions, brokers, and Zookeeper. You’ll set up a development environment using Bitnami Docker and learn to work with Kafka producers, consumers, and offsets. Finally, it explores how Kafka integrates into data platforms for real-world applications.
MongoDB Fundamentals
This module covers the basics of MongoDB, including document structures, schema design, and relational schema comparisons. You’ll set up a development environment, work with Mongo-Express, and learn CRUD operations using PyMongo. Finally, it explores MongoDB’s role in data platforms and its relevance in data science workflows.
Log analysis with Elasticsearch
This module introduces Elasticsearch, covering its fundamentals, ETL and streaming log analysis, and problem-solving approaches. You’ll get hands-on experience with setting up Elasticsearch, using its APIs, writing logs, and creating indices with Python. Finally, you’ll analyze logs using Kibana visualizations and dashboards before wrapping up with a summary.
Snowflake for DatEngineers
This module covers Snowflake fundamentals, its role in data warehousing, and platform integration. You’ll set up a Snowflake account, create warehouses, and load CSV data using internal stages. Learn to visualize data with dashboards, connect Power BI, and automate tasks like imports and table refreshes. Finally, integrate Snowflake with AWS S3 using external stages and Snowpipe before concluding with a summary.
Dbt for Data Engineers
This course introduces dbt, a modern data transformation tool, and its integration with Snowflake. You’ll set up dbt Core, work with SQL and Python models, configure sources, and implement tests. The course also covers dbt Cloud, job automation, CI/CD integration with GitHub, and documentation best practices before concluding with future outlooks.
Module 4
Hands On Example Project
This project-based course integrates Kafka, Spark, MongoDB, FastAPI, and Streamlit to create a real-time data processing pipeline. You’ll learn how to prepare and stream data, build APIs, deploy services using Docker, and visualize data with Streamlit. The hands-on approach ensures practical experience with real-world data engineering and streaming use cases.
Data Engineering on AWS
This course provides a comprehensive hands-on guide to building a real-time data engineering platform using AWS services like Lambda, API Gateway, Kinesis, S3, DynamoDB, Redshift, and Glue. You’ll learn data ingestion, streaming, storage, processing, and visualization while working with real-world datasets.
Data engineering on Azure
This project focuses on building a real-time data pipeline in Azure using services like Azure Functions, Blob Storage, Event Hubs, and Cosmos DB. Data is ingested, processed, and stored using Azure Functions, which are integrated with API Management for secure access. Event Hubs capture streaming data, and processed information is written to Cosmos DB. Finally, Power BI connects to Cosmos DB to visualize insights in interactive dashboards.
Modern Data Warehouses & Data Lakes
This course covers modern data warehouses and data lakes, focusing on ETL/ELT processes and integration with cloud platforms. It includes hands-on implementation using GCP (BigQuery, Data Studio) and AWS (S3, Athena, Glue, Quicksight) to build data pipelines and dashboards. The course wraps up with a recap and a bonus lesson on configuring Redshift Spectrum with S3 for efficient querying.
Machine Learning & Containerization on AWS
This course covers modern data warehouses and data lakes, focusing on ETL/ELT processes and integration with cloud platforms. It includes hands-on implementation using GCP (BigQuery, Data Studio) and AWS (S3, Athena, Glue, Quicksight) to build data pipelines and dashboards. The course wraps up with a recap and a bonus lesson on configuring Redshift Spectrum with S3 for efficient querying.
Storing & Visualizing Time series data
This project focuses on building a time-series data pipeline using InfluxDB, Grafana, and external weather APIs. It covers schema design for relational and time-series databases, setting up an environment with Docker, and writing test and real-time air quality data to InfluxDB. Data visualization is implemented using Grafana dashboards, and user management is configured for different organizations. The project concludes with integrating weather API data and managing access permissions.
Contact tracing with Elasticsearch
This project involves building a data-driven application using Elasticsearch and Streamlit. It begins with setting up a relational vs. Elasticsearch-based dataset using San Francisco data, generating 100k fake users, and merging them for analysis. The Streamlit user interface is developed to query Elasticsearch with free text, zip codes, business IDs, and device tracking. The project concludes with a summary and future outlook on enhancements.
Data Engineering on Hadoop
This course introduces Big Data and Hadoop, covering its architecture, storage, and processing in HDFS. It explores Hadoop distributions, Apache Hive for data warehousing, and Apache Sqoop for importing/exporting data between RDBMS and Hadoop. Hands-on exercises help learners build practical skills in managing and processing large datasets efficiently.
Dockerized ETL with AWS TDengine and Grafana
This project focuses on building a time series data ETL pipeline using AWS, TDengine, and Grafana. It covers the fundamentals of time series databases, setting up a Weather API, and integrating TDengine with Python. The course also includes deploying a Lambda function using Docker, scheduling it with EventBridge, and visualizing data in Grafana for monitoring insights.
Azure Data pipelines with Terraform
This course introduces Azure cloud management and infrastructure automation using Terraform. It covers setting up Azure, managing resources, and deploying infrastructure with Terraform. You’ll learn about Terraform commands, project structure, backend deployment, and using modules for efficient infrastructure provisioning. The course also includes deploying services using a Service Principal for secure authenticatio