Data Engineering From Notebook To Production

Single Day

Q2 2024

Take a data project from prototype to production quality with the modern data stack

A high-level introduction to all of Data Engineering in three hours. In this fast-paced survey course, you’ll learn how to take a data project from prototype to production quality data platform. We’ll explore techniques to scale up to big data sets and scale out to more features and larger teams. We’ll discuss data architecture and design principles while exploring up-and-coming data tools.

  • Get a complete overview of today’s data engineering tools
  • Learn best practices for data architecture and design
  • Benefit whether you are new to data engineering or a practicing data engineer looking to keep up with current tools and methods

You’ll learn:

  • Essential tools and design principles for the modern data stack.
  • How to improve performance by matching storage format to query type.
  • Strategies for data observability and quality.

You’ll be able to:

  • Accelerate data warehouse development by writing reusable SQL with dbt.
  • Reduce time-to-market by running Jupyter Notebooks in production with Papermill.
  • Build robust, reliable data science pipelines with Apache Airflow.

Is this course for me?

  • Novice and experienced data engineers and MLOps engineers will learn cutting edge best practices in this rapidly changing field.
  • Back end engineers transitioning to data engineering will gain critical skills to work effectively on data intensive applications.
  • Data scientists, data analysts and analytics engineers will understand the data platform development process.
  • Chief Data Officers (CDOs) will gain necessary technical background to oversee data projects.
  • DevOps engineers will be able to deploy and monitor data systems.

Prerequisites

  • Familiarity with software engineering and architecture in the cloud environment.
  • Comfort with Python, SQL and Jupyter Notebooks is helpful but not required.
  • Some experience with back end development, data science or databases is required. This is not an introductory course for new developers.

Schedule

Section 1: Data Modeling

  • Course Overview
  • Principles of Data Engineering: Review of design principles for architecting solutions on the modern data stack.
  • ETL or ELT? When ingesting data, store raw data early for later transformation. Use Airbyte to extract data from cloud APIs and Benthos to load and hydrate data.
  • Data Lakes: Data lakes are the original source of truth and the input/output location for data pipelines. Choose a storage format that matches query types for improved performance.
  • Data Warehouses: Data cubes are dead. Long live the distributed data warehouse. Effective ways to use these workhorse data stores. Why all SQL is not the same.
  • dbt: Closer look at dbt, an exciting new tool for SQL-based data transformation that’s transforming data teams.

Section 2: Data Flow

  • Data Pipelines: Complex processing and transformation beyond what’s possible in a SQL-based data warehouse. Use Pandas APIs in Spark with Koalas.
  • Dashboards and Discovery: Dashboards are the UI of data systems. Build custom dashboards with Apache Superset. Communicate data status and timeliness to end users.
  • Workflow Orchestration: How does anything happen? Workflow orchestration is a better approach than fragile and frustrating cron jobs.
  • Apache Airflow: A closer look at Apache Airflow, the leading orchestration engine.
  • Streaming: The many varieties of “real time” and the (limited) role of streaming systems such as Kafka.

Section 3: Data in Production

  • Notebooks in Production: Use Papermill to run a Jupyter Notebook like a command line script and nbdev to create libraries.
  • Data Quality: Bad data is toxic and invisible. Maintain data quality with Great Expectations and deequ.
  • Testing: How to test software when there are no right answers. Write better tests with Coverage. Test DataFrames with pandera.
  • Debugging and Perform just connected connectedance Tuning: Long iterations and lack of reproducibility make debugging hard. Data lineage can help. Be a scientist when tuning performance.
  • Operations: Achieve reproducibility with data CI/CD. Practice DevOps, not ClickOps!
  • Machine Learning Operations and Data Engineering: More similar than you might think. Monitor models with Evidently AI and track experiments with MLflow.
  • The Future of Data: Emerging trends in data systems, including closing the loop with Reverse ETL.

Contact Me

I’ll get back to you shortly