Skip to main content

Apache SPARK: Setting up and Use with IPython Notebooks

When it comes to exploring and analysing large amounts of data, few tools beat the Apache Spark - IPython Notebook combination. However, in my journey to set up and use Apache Spark for my work, I often had to go through a lot of trial and error as the instructions often assume certain knowledge of Spark's internal settings. I had to plough through documentation in various places before I could get a reasonably working Spark + IPython setup going. So here I chronicle my experience in setting up Apache Spark for use with IPython notebooks, attempting at each step to explain the rationale and the settings used.

Environment Assumptions

Although my aims is to hand-hold the reader through the setup process, I would still have to make certain basic assumptions so that this post does not become overly long by trying to cover all the possible setup conditions.

  1. Linux environment

  • I will be using Ubuntu 16.04

  • Other Linux distributions would do just fine

  • The steps would be quite similar in Mac OS as well

  1. All commands are executed by the default ubuntu user

  • You can execute the commands as another user but I would recommend that the account be part of the sudoers group.

Actually that's about all I would assume.

Steps Overview

  1. Preparing the linux environment

  2. Installing Apache Spark

  3. Starting Apache Spark with IPython

Step 1: Preparing the linux environment

Install JDK by typing:

sudo apt-get install default-jdk

Then add JAVA_HOME to your environment by appending the following lines to .bashrc file

export JAVA_HOME=/usr/bin/java

Next you would need to install Python. You can do it using Anaconda's Python distribution. I am using Python 3 here but Python 2 would do just fine as well.

Once you've downloaded the Anaconda Python package from here (say the package file name is Anaconda3-4.4.0-Linux-x86_64.sh), install it by typing:

cd <folder_where_you_saved_anaconda>

bash Anaconda3-4.4.0-Linux-x86_64.sh -p /usr/local/

The parameter -p tells the script to install Anaconda Python into the /usr/local/ folder.

Optional step

When the installation is done, do a quick check on the ownership of the Anaconda folder by typing:

ls -al /usr/local

See if the owner and owner group are both ubuntu (or whatever user name/group you are using). If not (it might say root if you decide to fully automate this installation process with a script), change the ownership to yourself by typing:

sudo chown -r ubunut:ubuntu /usr/local/anaconda

This is only necessary if you intend to install extra packages using pip rather than conda (Anaconda's own package manager).

Step 2: Installing Apache Spark

Download the latest Spark package from the spark homepage. I recommend the pre-built versions.

Once downloaded untar the package into /usr/local as well by typing:

tar -xvf <path_to_spark>.tar.gz -C /usr/local/

Then add SPARK_HOME to your environment variables

export SPARK_HOME=/usr/local/<spark_root_folder>

And that's it! Spark is installed.

Step 3: Starting IPython Notebook with Apache Spark

To start Spark with IPython, type in the following command (or save it in a shell script for reuseability):

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" $SPARK_HOME/bin/pyspark --master local[2]

Let's break the command down. The first part, PYSPARK_DRIVER_PYTHON=jupyter, tells Spark to use Jupyter as the driver Python. The second part passes options or arguments to the driver Python and tells Spark to start a Jupyter notebook without automatically opening a browser window. The third part invokes Spark in Python in local mode with 2 cores.

If you are deploying Spark to a cluster of nodes, change --master local[2] to --master <ip_address_of_master_node>:<master_node_listening_port>.

Other notes if deploying to a cluster...

  • Apache Spark configurations are stored in the following files in $SPARK_HOME/conf:

    • spark-defaults.conf

    • spark-env.sh

  • spark-defaults.conf stores the basic or cluster-wide configurations

  • spark-env.sh stores the node-specific configuration. For example, if you want a particular worker node to have slightly different configuration from the default. This file is read when Spark is initiated on the node itself.

  • Settings in spark-env.sh will override settings in spark-defaults.conf

  • It is useful to set spark.pyspark.driver.python and spark.pyspark.python to the propery python path

    • This is especially true when using Anaconda as the driver python. To make sure that Spark uses Anaconda instead of the default system Python

    • In this way, all the packages available in Anaconda will also be available to Spark. For example, one might want to use numpy functions in UDFs.

    • If a worker node has a different python path, set it in spark-env.sh via PYSAPRK_DRIVER_PYTHON and PYSPARK_PYTHON