martes, 25 de abril de 2017

Set up cluster : Spark 1.3.1 en Ubuntu 14.04

Set up cluster manually: Spark 1.3.1 en Ubuntu 14.04


When the cluster is running let configure Spark 1.3.1 and assuming Anaconda install to provide numpy.

Set up password-less SSH

Contact each node:
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@

In the master

ubuntu@master:~$ ssh-keygen -t rsa -P ""
ubuntu@master:~$ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys
ubuntu@master:~$ ssh localhost

with $HOME = /home/ubuntu

On workers:

copy ~/.ssh/ from your master to the worker, then use:
cat /home/ubuntu/.ssh/ >> /home/ubuntu/.ssh/authorized_keys
chmod 644 /home/ubuntu/.ssh/authorized_keys

With Maven

Configuration file: conf/slaves file on your master

We need to add hostname to /etc/host
  1. sudo vim /etc/hosts
  2. Add Full Qualified HostName (FQHN) and  Hostname (to search for it  command “ “hostname -f”)
  3. Add your node: p1-spark-node-001.novalocal p1-spark-node-001

Start Master and Workers

run sbin/
the cluster manager’s web UI should appear at
http://masternode:8080 and show all your workers.

Start Master and Workers by Hand

Run some commands in the Master and in the Workers:


cd /opt/spark/bin/
./spark-class org.apache.spark.deploy.master.Master

Now you can check those urls: where you should send the jobs and the MasterUI.

I am going to visit the last one:
Remember the port should be open. As you remember is the internal IP of my master. Tunneling:
ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem -L 8200:localhost:8080 ubuntu@
Now you can check in the browser: http://localhost:8200/


cd /opt/spark/bin/
./spark-class org.apache.spark.deploy.worker.Worker spark://

Now we can see in the browser the inclusion of the new worker:

To submit a job

You have to have a copy of all files in the some location in master and all workers. I have copied a folder with data and a python script in /opt/spark/bin:

cd /opt/spark/bin
./spark-submit --master spark:// paper_cluster_spark_to_run/

To run it you have to open another ssh:  ssh -i /home/raf/Documents/Cloud/rvf_keele_cloud.pem ubuntu@


  1. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.".

No hay comentarios:

Publicar un comentario