Step 1: Install Java
Java is an essential prerequisite for installing and running Apache Spark, as Spark is built on the Java Virtual Machine (JVM).
Apache Spark runs on Java 8/11/17. You can download the version 8 by going to https://www.java.com/en/download/manual.jsp link. Alternatively, for OpenJDK, visit the [OpenJDK download page](https://openjdk.java.net/install/index.html) to find the appropriate distribution.
Note: make sure to install your java in a folder path where folders name has no white spaces, otherwise you will get errors like
The system cannot find the path specified
or\Common was unexpected at this time, or
strange errors.
For example, you can change the default destination of the installation like in the below snapshot.
Set up java home environment variable
Search for ‘Environment Variables’ in windows search and below window will appear.
Click on Environment Variables -> User variables -> New -> Create -> variable name JAVA_HOME with variable value <full path of your Java installation>
Setup JAVA_HOME path variable
Go to Path variable under User variables -> New -> %JAVA_HOME%\bin
At this point you have downloaded and setup the Java.
You can verify that Java is installed and environment variable is properly setup by opening CMD and running java -version
command. The output would like like below;
java version "1.8.0_441"
Java(TM) SE Runtime Environment (build 1.8.0_441-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.441-b07, mixed mode)
Step 2: Download Apache Spark
Visit the official Apache Spark website at spark.apache.org/downloads.html.
Download the latest version of Apache Spark. I am downloading version 3.5.5 by clicking in step three.
Note the step 2 Apache Hadoop major version number because we need it later. For example for me is number Hadoop 3.
Step 3: Extract .tgz file
C:\spark
.Below is the snapshot after extracting.
Step 4: Download the winutils.exe
I am downloading or copying it in
C:\spark\hadoop\bin
folder that I have created. The full path of the file would be C:\spark\hadoop\bin\winutils.exe
Step 5: Setup Spark and Hadoop environment variables
Similar to JAVA_HOME environment variable, we need to now setup SPARK_HOME and HADOOP_HOME variables under User variables.
Search for ‘Environment Variables’ in windows search.
Click on Environment Variables -> User variables -> New -> Create -> variable name HADOOP_HOME with variable value <full path of your hadoop installation before bin folder>
User variables -> New -> Create -> variable name SPARK_HOME with variable value <full path of your spark installation before bin folder>
Finally setup Path variable like below
Step 6: Test Apache Spark setup
Go to CMD and type pyspark or spark-shell and you should see following results in PySpark.
C:\Users\MaxImtiaz>PySpark
Python 3.13.2 (tags/v3.13.2:4f8bb39, Feb 4 2025, 15:23:48) [MSC v.1942 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Unable to get Charset 'cp65001' for property 'sun.stderr.encoding', using default windows-1252 and continuing.
Unable to get Charset 'cp65001' for property 'sun.stderr.encoding', using default windows-1252 and continuing.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/18 20:52:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.5
/_/
Using Python version 3.13.2 (tags/v3.13.2:4f8bb39, Feb 4 2025 15:23:48)
Spark context Web UI available at http://host.docker.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1742327548139).
SparkSession available as 'spark'.