Installing the required software
In order to be able to run the examples and exercises from this course without going through all the trouble of setting a working Spark and BigDL environment we have prepared a pre-packaged solution so you can start trying the hands-on with a minimal amount of setup. We achieve this by the use of containers, pieces of software not quite entirely unlike virtual machines that allow to create virtual, isulated environments with all the required dependencies and software necessary for you to run the examples. In order to create and manage containers we will use Docker so you will need to install it if it is not installed in your system already. There is some amount of setup involved to get Docker up and running, so please follow the official instructions HERE to install it.
Building the Docker Images
Once Docker is installed and running in your system you need to build the container images of our environment. Container images are like snapshots of a system at a given point in time and allow to easily replicate and restore runtime environments almost instantly, at the expense of having to build them first.
To save you the hassle of building each image manually we have created a build.sh script HERE that will build them in order, so you only have to run:
$ bash build.sh
Now you can use these images to deploy a containerized Spark cluster with a master node and 3 worker nodes or a BigDL Jupyter notebook server, so you can run the exercises.
Setting up a containerized Spark cluster in your computer
In order to run the Spark exercises in a containerized cluster rather than in Standalone or Local mode we have created a spark_cluster.sh script HERE that allows you to create and manage said cluster.
You can start the cluster by passing the command deploy to the spark_cluster.sh script:
$ bash spark_cluster.sh deploy
It then creates and runs 4 containers (1 master + 3 workers) and does all the setup required for a running Spark environment. Once it's done it will display information about the running containers and their addresses:
----------------------------------------------------
Cluster Information - Manager: STANDALONE
----------------------------------------------------
- HDFS:
> Namenode running @ http://172.18.0.2:9870
> DataNode 1 running @ http://172.18.0.3:9864
> DataNode 2 running @ http://172.18.0.4:9864
> DataNode 3 running @ http://172.18.0.5:9864
> DataNode 4 running @ http://172.18.0.6:9864
----------------------------------------------------
- Spark Standalone:
> Master running @ http://172.18.0.2:8080
> Worker 1 running @ http://172.18.0.3:8081
> Worker 2 running @ http://172.18.0.4:8082
> Worker 3 running @ http://172.18.0.5:8083
> Worker 4 running @ http://172.18.0.6:8084
If at any point you need again the node information once the cluster is running you can print it again with the info command:
$ bash spark_cluster.sh info
Stopping the Spark clusterOnce you are done you can stop the cluster with the stop command
$ bash spark_cluster.sh stop
Starting a Jupyter notebook server for SparkThe cluster script is also capable of setting up a Jupyter notebook server in a separate container and connecting it to the cluster, which is the most convenient way to start working with the notebooks. You can start the Jupyter server with the jupyter command:
$ bash spark_cluster.sh jupyter
Once the container is ready it will print the URL through which you can connect to it
Important: As of the writing of this guide Jupyter Notebooks require authentication to log into them so we have set spark as the notebook's password
If you want to try to run the notebooks we provide instead of copying or writing the code yourself you will need to copy them to "/users/hadoop/spark-jupyter" using docker cp.
$ docker cp
Setting up a Standalone BigDL Environment
You can also quickly deploy a Jupyter notebook server with BigDL and all its required dependencies (although in Spark Standalone mode rather than in a cluster) through the start_bigdl.sh script HERE
$ bash start_bigdl.sh
Important: As of the writing of this guide Jupyter Notebooks require authentication to log into them so we have set spark as the notebook's password
If you want to try to run the notebooks we provide instead of copying or writing the code yourself you will need to copy them to "/users/bigdl/" using docker cp.
$ docker cp
In order to stop the BigDL Jupyter notebook server you simply have to stop the Docker container
$ docker stop bigdl-jupyter
Copyright © Barcelona Supercomputing Center, 2019-2020 - All Rights Reserved - AI in DataCenters