Installing Docker

Installing the required software


In order to be able to run the examples and exercises from this course without going through all the trouble of setting a working Spark and BigDL environment we have prepared a pre-packaged solution so you can start trying the hands-on with a minimal amount of setup. We achieve this by the use of containers, pieces of software not quite entirely unlike virtual machines that allow to create virtual, isulated environments with all the required dependencies and software necessary for you to run the examples. In order to create and manage containers we will use Docker so you will need to install it if it is not installed in your system already. There is some amount of setup involved to get Docker up and running, so please follow the official instructions HERE to install it.


Preparing the Environment

Building the Docker Images


Once Docker is installed and running in your system you need to build the container images of our environment. Container images are like snapshots of a system at a given point in time and allow to easily replicate and restore runtime environments almost instantly, at the expense of having to build them first.

To save you the hassle of building each image manually we have created a build.sh script HERE that will build them in order, so you only have to run:

$ bash build.sh

Now you can use these images to deploy a containerized Spark cluster with a master node and 3 worker nodes or a BigDL Jupyter notebook server, so you can run the exercises.


Deploying the Spark Cluster

Setting up a containerized Spark cluster in your computer


In order to run the Spark exercises in a containerized cluster rather than in Standalone or Local mode we have created a spark_cluster.sh script HERE that allows you to create and manage said cluster.

You can start the cluster by passing the command deploy to the spark_cluster.sh script:

$ bash spark_cluster.sh deploy

It then creates and runs 4 containers (1 master + 3 workers) and does all the setup required for a running Spark environment. Once it's done it will display information about the running containers and their addresses:

----------------------------------------------------
Cluster Information - Manager: STANDALONE
----------------------------------------------------
- HDFS:
> Namenode running @ http://172.18.0.2:9870
> DataNode 1 running @ http://172.18.0.3:9864
> DataNode 2 running @ http://172.18.0.4:9864
> DataNode 3 running @ http://172.18.0.5:9864
> DataNode 4 running @ http://172.18.0.6:9864
----------------------------------------------------
- Spark Standalone:
> Master running @ http://172.18.0.2:8080
> Worker 1 running @ http://172.18.0.3:8081
> Worker 2 running @ http://172.18.0.4:8082
> Worker 3 running @ http://172.18.0.5:8083
> Worker 4 running @ http://172.18.0.6:8084

Getting the nodes' addresses

If at any point you need again the node information once the cluster is running you can print it again with the info command:

$ bash spark_cluster.sh info

Stopping the Spark cluster

Once you are done you can stop the cluster with the stop command

$ bash spark_cluster.sh stop

Starting a Jupyter notebook server for Spark

The cluster script is also capable of setting up a Jupyter notebook server in a separate container and connecting it to the cluster, which is the most convenient way to start working with the notebooks. You can start the Jupyter server with the jupyter command:

$ bash spark_cluster.sh jupyter

Once the container is ready it will print the URL through which you can connect to it

Important: As of the writing of this guide Jupyter Notebooks require authentication to log into them so we have set spark as the notebook's password

If you want to try to run the notebooks we provide instead of copying or writing the code yourself you will need to copy them to "/users/hadoop/spark-jupyter" using docker cp.

$ docker cp spark-jupyter:/home/hadoop/spark-jupyter


Preparing Intel BigDL

Setting up a Standalone BigDL Environment


Starting an Intel BigDL Jupyter notebook server

You can also quickly deploy a Jupyter notebook server with BigDL and all its required dependencies (although in Spark Standalone mode rather than in a cluster) through the start_bigdl.sh script HERE

$ bash start_bigdl.sh

Important: As of the writing of this guide Jupyter Notebooks require authentication to log into them so we have set spark as the notebook's password

If you want to try to run the notebooks we provide instead of copying or writing the code yourself you will need to copy them to "/users/bigdl/" using docker cp.

$ docker cp bigdl-jupyter:/home/bigdl

Stopping the BigDL Jupyter notebook server

In order to stop the BigDL Jupyter notebook server you simply have to stop the Docker container

$ docker stop bigdl-jupyter