Set up and initial commands of my Docker Spark Toy
Motivation
I want to use Spark with Jupyter and Python in a simple way to start getting results very soon. I am a data scientist I want to do things with the data.
I am going to use Docker containers the idea is that I do not want to create a configure a VM for Spark, Anaconda and so on. Also I want to learn more Docker. So I am going to follow some tutorials to create my cool work environment.
IN ORDER TO TO DO THAT I COPY AND ADAPT this really awesome blog entrance: http://maxmelnick.com/2016/06/04/spark-docker.html. Also I am using this also cool image: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook where you can find a lot of information about possible uses and how to do them. This two pieces are amazin.
Description of the process step by step
Preliminars
I have ubuntu 16.04 and Docker installed (here if you want to install it: https://docs.docker.com/engine/installation/linux/ubuntu/#recommended-extra-packages-for-trusty-1404 ).
To avoid write many times sudo in front of every order to docker I create a Docker Group
sudo usermod -aG docker raf
Create folder to save the code/notebooks to interact with Spark
cd /home/raf/Documents
mkdir spark-docker
Run docker container
Note: you should have the container!
For example we are going to work with this container: jupyter/pyspark-notebook which can be found in https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook. Then we can just write in the terminal: docker pull jupyter/pyspark-notebook. For more detail see Annex B.
See your docker running container, stop them, delete them
This could be useful in case something goes wrong you can start quickly again.
To see the running containers: docker ps -a --no-trunc
To stop them docker stop ID-of-the -container/name-of-the -container
To delete them: docker rm ID-of-the -container/name-of-the -container
docker run -it -p 8888:8888 -v /home/raf/Documents/spark-docker --name spark_2 jupyter/pyspark-notebook
What’s going on when we run that command?
- The -it runs the container in interactive mode..
- The -p 8888:8888 makes the container’s port 8888 accessible to the host (i.e., your local computer) on port 8888. This will allow us to connect to the Jupyter Notebook server since it listens on port 8888.
- The -v /home/raf/Documents/spark-docker allows us to map our spark-docker folder ( to the container’s /home/raf/Documents/spark-docker working directory (i.e., the directory the Jupyter notebook will run from). This makes it so notebooks we create are accessible in our spark-docker folder on our local computer. It also allows us to make additional files such as data sources (e.g., CSV, Excel) accessible to our Jupyter notebooks.
- The --name spark2 gives the container the name spark, which allows us to refer to the container by name instead of ID in the future.
To stop them docker stop spark2
To delete them: docker rm spark2
- The final part of the command, jupyter/pyspark-notebook tells Docker we want to run the container from the jupyter/pyspark-notebook image.
For more information about the docker run command, check out the https://docs.docker.com/engine/reference/run/#name---name.
As you can see you are provide with a url, you just have to go to your browser and start working. If you want to change the toker there is some clues in the github of the original:
Test
Just write some code:
import pyspark
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Everything it is fine. Enjoy and good luck!
References
- Really cool blog entrance: http://maxmelnick.com/2016/06/04/spark-docker.html
- The image with some documentation: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
Annex A. sudo usermod mystery
Apparently you should be very thoughtful when given access to particular capabilities, a way to do that in Ubuntu is through the creation of different groups.
usermod - modify a user account
-a, --append
Add the user to the supplementary group(s). Use only with the -G
option.
-a, --append
Add the user to the supplementary group(s). Use only with the -G
option.
-G, --groups GROUP1[,GROUP2,...[,GROUPN]]]
A list of supplementary groups which the user is also a member of.
Each group is separated from the next by a comma, with no
intervening whitespace. The groups are subject to the same
restrictions as the group given with the -g option.
If the user is currently a member of a group which is not listed,
the user will be removed from the group. This behaviour can be
changed via the -a option, which appends the user to the current
supplementary group list.
A list of supplementary groups which the user is also a member of.
Each group is separated from the next by a comma, with no
intervening whitespace. The groups are subject to the same
restrictions as the group given with the -g option.
If the user is currently a member of a group which is not listed,
the user will be removed from the group. This behaviour can be
changed via the -a option, which appends the user to the current
supplementary group list.
Take a way: sudo usermod -aG docker raf add (a) the user raf (my user) to the group (G). sudo is because only a sudo administrator can make those kind of changes.
Specific references
Annex B: Your containers
You can see what do you have: docker images, if you do not have already what you want then you can search with: docker search key-word it will give you a list of images. You can also search for the images in google and you may see more details, such as the version of the software and some useful information and advice.
When you decide you like this image then, docker pull for example:
docker pull jupyter/pyspark-notebook
Specific references
- http://blog.thoward37.me/articles/where-are-docker-images-stored/
Comentarios
Publicar un comentario