Advanced3 days
Azure Databricks
Data engineering and machine learning with Apache Spark
Overview
Azure Databricks is the preferred platform for large-scale data engineering and machine learning workloads on Azure. This training covers Spark fundamentals, Delta Lake, structured streaming, MLflow for experiment tracking, and production deployment patterns.
What you'll learn
- Provision and configure Databricks workspaces and clusters
- Process large datasets efficiently using PySpark and Spark SQL
- Build robust data pipelines using Delta Lake and medallion architecture
- Process real-time data streams with Structured Streaming
- Track ML experiments and deploy models using MLflow
- Integrate Databricks with Azure Data Lake, Synapse, and Azure ML
Programme
Day 1 — Spark & Databricks fundamentals
- Databricks architecture: clusters, notebooks, and the workspace
- Apache Spark core concepts: RDDs, DataFrames, partitioning
- PySpark: reading, transforming, and writing large datasets
- Spark SQL: analytical queries at scale
- Cluster configuration and cost optimisation
- Hands-on: process a multi-GB dataset with PySpark transformations
Day 2 — Delta Lake & medallion architecture
- Delta Lake: ACID transactions, time travel, and schema evolution
- Medallion architecture: bronze, silver, and gold layers
- Optimising Delta tables: Z-ordering, compaction, vacuuming
- Delta Live Tables: declarative pipeline development
- Structured Streaming: real-time data processing with Databricks
- Hands-on: build a medallion lakehouse pipeline end to end
Day 3 — Machine learning & production patterns
- MLflow: experiment tracking, model registry, and deployment
- Feature engineering at scale with Databricks Feature Store
- Training distributed ML models with Spark ML and scikit-learn
- Model serving: REST endpoints from the Databricks model registry
- CI/CD for Databricks: Repos, jobs, and automated testing
- Hands-on: train, track, and deploy a classification model end to end
Who is this for?
- Data engineers building large-scale data pipelines
- Data scientists needing a scalable ML experimentation platform
- Platform engineers evaluating Databricks for enterprise workloads
Prerequisites
- Solid Python experience
- Familiarity with SQL and data modelling
- Basic understanding of distributed computing concepts is helpful
Tools & technologies covered
Azure DatabricksApache SparkPySparkDelta LakeMLflowStructured StreamingDelta Live Tables
Not sure which course fits your team?
Talk to us — we'll match you to the right training path.