miércoles, 26 de abril de 2017

Set up and initial commands of my Docker Spark Toy


I want to use Spark with Jupyter and Python in a simple way to start getting results very soon. I am a data scientist I want to do things with the data.

I am going to use Docker containers the idea is that I do not want to create a configure a VM for Spark, Anaconda and so on. Also I want to learn more Docker. So I am going to follow some tutorials to create my cool work environment.

IN ORDER TO TO DO THAT I COPY AND ADAPT this really awesome blog entrance: http://maxmelnick.com/2016/06/04/spark-docker.html. Also I am using this also cool image: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook where you can find a lot of information about possible uses and how to do them. This two pieces are amazin.

Description of the process step by step


I have ubuntu 16.04 and Docker installed (here if you want to install it: https://docs.docker.com/engine/installation/linux/ubuntu/#recommended-extra-packages-for-trusty-1404 ).
To avoid write many times sudo in front of every order to docker I create a Docker Group
sudo usermod -aG docker raf
In my ignorance I did not understand the previous line so I did some research which is in Annex A.

Create folder to save the code/notebooks to interact with Spark

cd /home/raf/Documents
mkdir spark-docker

Run docker container

Note: you should have the container!
For example we are going to work with this container: jupyter/pyspark-notebook which can be found in https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook. Then we can just write in the terminal: docker pull  jupyter/pyspark-notebook. For more detail see Annex B.

See your docker running container, stop them, delete them

This could be useful in case something goes wrong you can start quickly again.

To see the running containers: docker ps -a --no-trunc
To stop them docker stop ID-of-the -container/name-of-the -container
To delete them: docker rm ID-of-the -container/name-of-the -container

docker run -it -p 8888:8888 -v /home/raf/Documents/spark-docker --name spark_2   jupyter/pyspark-notebook

What’s going on when we run that command?
  • The -it runs the container in interactive mode..
  • The -p 8888:8888 makes the container’s port 8888 accessible to the host (i.e., your local computer) on port 8888. This will allow us to connect to the Jupyter Notebook server since it listens on port 8888.
  • The -v /home/raf/Documents/spark-docker allows us to map our spark-docker folder ( to the container’s /home/raf/Documents/spark-docker working directory (i.e., the directory the Jupyter notebook will run from). This makes it so notebooks we create are accessible in our spark-docker folder on our local computer. It also allows us to make additional files such as data sources (e.g., CSV, Excel) accessible to our Jupyter notebooks.
  • The --name spark2 gives the container the name spark, which allows us to refer to the container by name instead of ID in the future.
To stop them docker stop spark2
To delete them: docker rm spark2
  • The final part of the command, jupyter/pyspark-notebook tells Docker we want to run the container from the jupyter/pyspark-notebook image.
For more information about the docker run command, check out the https://docs.docker.com/engine/reference/run/#name---name.
As you can see you are provide with a url, you just have to go to your browser and start working. If you want to change the toker there is some clues in the github of the original:


Just write some code:
import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Everything it is fine. Enjoy and good luck!


  1. The image with some documentation: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Annex A. sudo usermod mystery

Apparently you should be very thoughtful when given access to particular capabilities, a way to do that in Ubuntu is through the creation of different groups.
 usermod - modify a user account
  -a, --append
          Add the user to the supplementary group(s). Use only with the -G

-G, --groups GROUP1[,GROUP2,...[,GROUPN]]]
          A list of supplementary groups which the user is also a member of.
          Each group is separated from the next by a comma, with no
          intervening whitespace. The groups are subject to the same
          restrictions as the group given with the -g option.

          If the user is currently a member of a group which is not listed,
          the user will be removed from the group. This behaviour can be
          changed via the -a option, which appends the user to the current
          supplementary group list.

Take a way: sudo usermod -aG docker raf add (a) the user raf (my user) to the group (G). sudo is because only a sudo administrator can make those kind of changes.

Specific references

Annex B: Your containers

You can see what do you have: docker images, if you do not have already what you want then you can search with: docker search key-word it will give you a list of images. You can also search for the images in google and you may see more details, such as the version of the software and some useful information and advice.

When you decide you like this image then, docker pull for example:
docker pull  jupyter/pyspark-notebook

Specific references

  1. http://blog.thoward37.me/articles/where-are-docker-images-stored/

martes, 25 de abril de 2017

Set up cluster : Spark 1.3.1 en Ubuntu 14.04

Set up cluster manually: Spark 1.3.1 en Ubuntu 14.04


When the cluster is running let configure Spark 1.3.1 and assuming Anaconda install to provide numpy.

Set up password-less SSH

Contact each node:
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@

In the master

ubuntu@master:~$ ssh-keygen -t rsa -P ""
ubuntu@master:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys
ubuntu@master:~$ ssh localhost

with $HOME = /home/ubuntu

On workers:

copy ~/.ssh/id_dsa.pub from your master to the worker, then use:
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys
chmod 644 /home/ubuntu/.ssh/authorized_keys

With Maven

Configuration file: conf/slaves file on your master

We need to add hostname to /etc/host
  1. sudo vim /etc/hosts
  2. Add Full Qualified HostName (FQHN) and  Hostname (to search for it  command “ “hostname -f”)
  3. Add your node: p1-spark-node-001.novalocal p1-spark-node-001

Start Master and Workers

run sbin/start-all.sh
the cluster manager’s web UI should appear at
http://masternode:8080 and show all your workers.

Start Master and Workers by Hand

Run some commands in the Master and in the Workers:


cd /opt/spark/bin/
./spark-class org.apache.spark.deploy.master.Master

Now you can check those urls: where you should send the jobs and the MasterUI.

I am going to visit the last one:
Remember the port should be open. As you remember is the internal IP of my master. Tunneling:
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem -L 8200:localhost:8080 ubuntu@
Now you can check in the browser: http://localhost:8200/


cd /opt/spark/bin/
./spark-class org.apache.spark.deploy.worker.Worker spark://

Now we can see in the browser the inclusion of the new worker:

To submit a job

You have to have a copy of all files in the some location in master and all workers. I have copied a folder with data and a python script in /opt/spark/bin:

cd /opt/spark/bin
./spark-submit --master spark:// paper_cluster_spark_to_run/recommendations.py

To run it you have to open another ssh:  ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@


  1. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.".