Advanced3 days
Azure HDInsight
Managed open-source analytics with Hadoop, Spark, Kafka, and HBase
Overview
Azure HDInsight is Microsoft's fully managed cloud distribution of open-source analytics frameworks — including Apache Spark, Apache Hadoop, Apache Kafka, Apache HBase, and Interactive Query (Hive LLAP). Designed for organisations with existing open-source workloads or compliance requirements that demand OSS tooling, this training covers cluster design, PySpark data engineering, real-time streaming with Kafka, NoSQL with HBase, and enterprise security with the Enterprise Security Package (ESP). It also includes an honest evaluation of migration paths to Azure Databricks and Microsoft Fabric.
What you'll learn
- Provision and configure HDInsight clusters for Spark, Hadoop, Kafka, and HBase workloads
- Process large datasets using Apache Spark with PySpark and Hive SQL
- Design real-time streaming pipelines using Apache Kafka on HDInsight
- Store and query wide-column NoSQL data with Apache HBase and the Apache Phoenix SQL layer
- Secure HDInsight clusters using the Enterprise Security Package and Azure Active Directory integration
- Evaluate and plan migration paths to Azure Databricks and Microsoft Fabric
Programme
Day 1 — Cluster architecture & Apache Spark
- HDInsight architecture: cluster types, node roles, head nodes, and worker nodes
- Creating and configuring HDInsight clusters: sizing, auto-scale, and cost control strategies
- ADLS Gen2 as default storage for HDInsight clusters
- Apache Spark on HDInsight: PySpark data engineering, Jupyter notebooks, and Spark SQL
- Submitting batch jobs with spark-submit and monitoring with Apache Ambari
- Hands-on: process a large dataset with PySpark and write results as Delta files to ADLS Gen2
Day 2 — Hadoop, Hive LLAP, HBase & Kafka
- Apache Hadoop and HDFS in the Azure context: when Hadoop is still the right choice
- Interactive Query (Hive LLAP): sub-second SQL analytics on terabyte datasets
- Connecting Hive to Power BI via DirectQuery for live interactive reporting
- Apache HBase on HDInsight: wide-column storage, Phoenix SQL layer, and real-world use cases
- Apache Kafka on HDInsight: topic design, producers, consumers, and MirrorMaker 2 replication
- Hands-on: build a Kafka-to-HBase streaming pipeline with a Hive reporting layer
Day 3 — Enterprise security, governance & migration
- Enterprise Security Package (ESP): integrating HDInsight with Azure Active Directory
- Apache Ranger: column-level security, row-level filtering, and data masking policies
- Network security: VNet integration, private clusters, and hub-spoke network topology
- Monitoring with Apache Ambari, Azure Monitor, and Log Analytics workspaces
- When to migrate: moving Spark workloads to Azure Databricks or Microsoft Fabric
- Hands-on: configure ESP on a Kafka cluster and apply Apache Ranger security policies
Who is this for?
- Data engineers running open-source analytics workloads on Azure
- Platform architects managing existing HDInsight or on-premises Hadoop environments
- Teams responsible for migrating Hadoop or Spark workloads to Azure
- Engineers in regulated industries with OSS or open-standard requirements
Prerequisites
- Python or Scala programming experience
- Basic Linux command-line comfort
- Understanding of distributed computing concepts and SQL
Tools & technologies covered
Azure HDInsightApache SparkApache HadoopApache KafkaApache HBaseApache Hive LLAPApache RangerAzure Data Lake Storage Gen2Power BI
Not sure which course fits your team?
Talk to us — we'll match you to the right training path.