Apache SPARK: Setting up and Use with IPython Notebooks
When it comes to exploring and analysing large amounts of data, few tools beat the Apache Spark - IPython Notebook combination. However, in my journey to set up and use Apache Spark for my work, I often had to go through a lot of trial and error as the instructions often assume certain knowledge of Spark's internal settings. I had to plough through documentation in various places before I could get a reasonably working Spark + IPython setup going. So here I chronicle my experience in setting up Apache Spark for use with IPython notebooks, attempting at each step to explain the rationale and the settings used.
Environment Assumptions
Although my aims is to hand-hold the reader through the setup process, I would still have to make certain basic assumptions so that this post does not become overly long by trying to cover all the possible setup conditions.
Linux environment
I will be using Ubuntu 16.04
Other Linux distributions would do just fine
The steps would be quite similar in Mac OS as well
All commands are executed by the default
ubuntu
user
You can execute the commands as another user but I would recommend that the account be part of the
sudoers
group.
Actually that's about all I would assume.
Steps Overview
Preparing the linux environment
Installing Apache Spark
Starting Apache Spark with IPython
Step 1: Preparing the linux environment
Install JDK by typing:
sudo apt-get install default-jdk
Then add JAVA_HOME
to your environment by appending the following lines to .bashrc
file
export JAVA_HOME=/usr/bin/java
Next you would need to install Python. You can do it using Anaconda's Python distribution. I am using Python 3 here but Python 2 would do just fine as well.
Once you've downloaded the Anaconda Python package from here (say the package file name is Anaconda3-4.4.0-Linux-x86_64.sh
), install it by typing:
cd <folder_where_you_saved_anaconda> bash Anaconda3-4.4.0-Linux-x86_64.sh -p /usr/local/
The parameter -p
tells the script to install Anaconda Python into the /usr/local/
folder.
Optional step
When the installation is done, do a quick check on the ownership of the Anaconda folder by typing:
ls -al /usr/local
See if the owner and owner group are both ubuntu
(or whatever user name/group you are using). If not (it might say root
if you decide to fully automate this installation process with a script), change the ownership to yourself by typing:
sudo chown -r ubunut:ubuntu /usr/local/anaconda
This is only necessary if you intend to install extra packages using pip
rather than conda
(Anaconda's own package manager).
Step 2: Installing Apache Spark
Download the latest Spark package from the spark homepage. I recommend the pre-built versions.
Once downloaded untar the package into /usr/local
as well by typing:
tar -xvf <path_to_spark>.tar.gz -C /usr/local/
Then add SPARK_HOME
to your environment variables
export SPARK_HOME=/usr/local/<spark_root_folder>
And that's it! Spark is installed.
Step 3: Starting IPython Notebook with Apache Spark
To start Spark with IPython, type in the following command (or save it in a shell script for reuseability):
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" $SPARK_HOME/bin/pyspark --master local[2]
Let's break the command down. The first part, PYSPARK_DRIVER_PYTHON=jupyter
, tells Spark to use Jupyter as the driver Python. The second part passes options or arguments to the driver Python and tells Spark to start a Jupyter notebook without automatically opening a browser window. The third part invokes Spark in Python in local mode with 2 cores.
If you are deploying Spark to a cluster of nodes, change --master local[2]
to --master <ip_address_of_master_node>:<master_node_listening_port>
.
Other notes if deploying to a cluster...
-
Apache Spark configurations are stored in the following files in
$SPARK_HOME/conf
:spark-defaults.conf
spark-env.sh
spark-defaults.conf
stores the basic or cluster-wide configurationsspark-env.sh
stores the node-specific configuration. For example, if you want a particular worker node to have slightly different configuration from the default. This file is read when Spark is initiated on the node itself.Settings in
spark-env.sh
will override settings inspark-defaults.conf
-
It is useful to set
spark.pyspark.driver.python
andspark.pyspark.python
to the propery python pathThis is especially true when using Anaconda as the driver python. To make sure that Spark uses Anaconda instead of the default system Python
In this way, all the packages available in Anaconda will also be available to Spark. For example, one might want to use
numpy
functions in UDFs.If a worker node has a different python path, set it in
spark-env.sh
viaPYSAPRK_DRIVER_PYTHON
andPYSPARK_PYTHON