Advanced3 days

Azure Databricks

Data engineering and machine learning with Apache Spark

Overview

Azure Databricks is the preferred platform for large-scale data engineering and machine learning workloads on Azure. This training covers Spark fundamentals, Delta Lake, structured streaming, MLflow for experiment tracking, and production deployment patterns.

What you'll learn

  • Provision and configure Databricks workspaces and clusters
  • Process large datasets efficiently using PySpark and Spark SQL
  • Build robust data pipelines using Delta Lake and medallion architecture
  • Process real-time data streams with Structured Streaming
  • Track ML experiments and deploy models using MLflow
  • Integrate Databricks with Azure Data Lake, Synapse, and Azure ML

Programme

Day 1 — Spark & Databricks fundamentals
  • Databricks architecture: clusters, notebooks, and the workspace
  • Apache Spark core concepts: RDDs, DataFrames, partitioning
  • PySpark: reading, transforming, and writing large datasets
  • Spark SQL: analytical queries at scale
  • Cluster configuration and cost optimisation
  • Hands-on: process a multi-GB dataset with PySpark transformations
Day 2 — Delta Lake & medallion architecture
  • Delta Lake: ACID transactions, time travel, and schema evolution
  • Medallion architecture: bronze, silver, and gold layers
  • Optimising Delta tables: Z-ordering, compaction, vacuuming
  • Delta Live Tables: declarative pipeline development
  • Structured Streaming: real-time data processing with Databricks
  • Hands-on: build a medallion lakehouse pipeline end to end
Day 3 — Machine learning & production patterns
  • MLflow: experiment tracking, model registry, and deployment
  • Feature engineering at scale with Databricks Feature Store
  • Training distributed ML models with Spark ML and scikit-learn
  • Model serving: REST endpoints from the Databricks model registry
  • CI/CD for Databricks: Repos, jobs, and automated testing
  • Hands-on: train, track, and deploy a classification model end to end

Who is this for?

  • Data engineers building large-scale data pipelines
  • Data scientists needing a scalable ML experimentation platform
  • Platform engineers evaluating Databricks for enterprise workloads

Prerequisites

  • Solid Python experience
  • Familiarity with SQL and data modelling
  • Basic understanding of distributed computing concepts is helpful

Tools & technologies covered

Azure DatabricksApache SparkPySparkDelta LakeMLflowStructured StreamingDelta Live Tables
Not sure which course fits your team?
Talk to us — we'll match you to the right training path.
Get in touch