miércoles, 26 de abril de 2017

Set up and initial commands of my Docker Spark Toy


I want to use Spark with Jupyter and Python in a simple way to start getting results very soon. I am a data scientist I want to do things with the data.

I am going to use Docker containers the idea is that I do not want to create a configure a VM for Spark, Anaconda and so on. Also I want to learn more Docker. So I am going to follow some tutorials to create my cool work environment.

IN ORDER TO TO DO THAT I COPY AND ADAPT this really awesome blog entrance: http://maxmelnick.com/2016/06/04/spark-docker.html. Also I am using this also cool image: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook where you can find a lot of information about possible uses and how to do them. This two pieces are amazin.

Description of the process step by step


I have ubuntu 16.04 and Docker installed (here if you want to install it: https://docs.docker.com/engine/installation/linux/ubuntu/#recommended-extra-packages-for-trusty-1404 ).
To avoid write many times sudo in front of every order to docker I create a Docker Group
sudo usermod -aG docker raf
In my ignorance I did not understand the previous line so I did some research which is in Annex A.

Create folder to save the code/notebooks to interact with Spark

cd /home/raf/Documents
mkdir spark-docker

Run docker container

Note: you should have the container!
For example we are going to work with this container: jupyter/pyspark-notebook which can be found in https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook. Then we can just write in the terminal: docker pull  jupyter/pyspark-notebook. For more detail see Annex B.

See your docker running container, stop them, delete them

This could be useful in case something goes wrong you can start quickly again.

To see the running containers: docker ps -a --no-trunc
To stop them docker stop ID-of-the -container/name-of-the -container
To delete them: docker rm ID-of-the -container/name-of-the -container

docker run -it -p 8888:8888 -v /home/raf/Documents/spark-docker --name spark_2   jupyter/pyspark-notebook

What’s going on when we run that command?
  • The -it runs the container in interactive mode..
  • The -p 8888:8888 makes the container’s port 8888 accessible to the host (i.e., your local computer) on port 8888. This will allow us to connect to the Jupyter Notebook server since it listens on port 8888.
  • The -v /home/raf/Documents/spark-docker allows us to map our spark-docker folder ( to the container’s /home/raf/Documents/spark-docker working directory (i.e., the directory the Jupyter notebook will run from). This makes it so notebooks we create are accessible in our spark-docker folder on our local computer. It also allows us to make additional files such as data sources (e.g., CSV, Excel) accessible to our Jupyter notebooks.
  • The --name spark2 gives the container the name spark, which allows us to refer to the container by name instead of ID in the future.
To stop them docker stop spark2
To delete them: docker rm spark2
  • The final part of the command, jupyter/pyspark-notebook tells Docker we want to run the container from the jupyter/pyspark-notebook image.
For more information about the docker run command, check out the https://docs.docker.com/engine/reference/run/#name---name.
As you can see you are provide with a url, you just have to go to your browser and start working. If you want to change the toker there is some clues in the github of the original:


Just write some code:
import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Everything it is fine. Enjoy and good luck!


  1. The image with some documentation: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Annex A. sudo usermod mystery

Apparently you should be very thoughtful when given access to particular capabilities, a way to do that in Ubuntu is through the creation of different groups.
 usermod - modify a user account
  -a, --append
          Add the user to the supplementary group(s). Use only with the -G

-G, --groups GROUP1[,GROUP2,...[,GROUPN]]]
          A list of supplementary groups which the user is also a member of.
          Each group is separated from the next by a comma, with no
          intervening whitespace. The groups are subject to the same
          restrictions as the group given with the -g option.

          If the user is currently a member of a group which is not listed,
          the user will be removed from the group. This behaviour can be
          changed via the -a option, which appends the user to the current
          supplementary group list.

Take a way: sudo usermod -aG docker raf add (a) the user raf (my user) to the group (G). sudo is because only a sudo administrator can make those kind of changes.

Specific references

Annex B: Your containers

You can see what do you have: docker images, if you do not have already what you want then you can search with: docker search key-word it will give you a list of images. You can also search for the images in google and you may see more details, such as the version of the software and some useful information and advice.

When you decide you like this image then, docker pull for example:
docker pull  jupyter/pyspark-notebook

Specific references

  1. http://blog.thoward37.me/articles/where-are-docker-images-stored/

No hay comentarios:

Publicar un comentario