β

Deep Learning (Spark, Caffe, GPU)

C++博客-首页原创精华区 496 阅读

from http://docs.continuum.io/anaconda-cluster/examples/spark-caffe

Deep Learning (Spark, Caffe, GPU)

Description

To demonstrate the capability of running a distributed job in PySpark using a GPU, this example uses a neural network library, Caffe . Below is a trivial example of using Caffe on a Spark cluster; although this is redundant, it demonstrates the capability of training neural networks with GPUs.

For this example, we recommend the use of the AMI ami-2cbf3e44 and the instance type g2.2xlarge . An example profile (to be placed in ~/.acluster/profiles.d/gpu_profile.yaml ) is shown below:

name: gpu_profile
node_id: ami-2cbf3e44 # Ubuntu 14.04 - IS HVM - Cuda 6.5
user: ubuntu
node_type: g2.2xlarge
num_nodes: 3
provider: aws
plugins:
  - spark-yarn
  - notebook

Download

To execute this example, download the: spark-caffe.py example script or spark-caffe.ipynbexample notebook .

Installation

The Spark + YARN plugin can be installed on the cluster using the following command:

$ acluster install spark-yarn

Once the Spark + YARN plugin is installed, you can view the YARN UI in your browser using the following command:

$ acluster open yarn

Dependencies

First, we need to bootstrap Caffe and its dependencies on all of the nodes. We provide a bash script that will install Caffe from source: bootstrap-caffe.sh . The following command can be used to upload the bootstrap-caffe.sh script to all of the nodes and execute it in parallel:

$ acluster submit bootstrap-caffe.sh --all

After a few minues, Caffe and its dependencies will be installed on the cluster nodes and the job can be started.

Running the Job

Here is the complete script to run the Spark + GPU with Caffe example in PySpark:

# spark-caffe.py from pyspark import SparkConf from pyspark import SparkContext  conf = SparkConf() conf.setMaster('yarn-client') conf.setAppName('spark-caffe') sc = SparkContext(conf=conf)   def noop(x):     import socket     return socket.gethostname()  rdd = sc.parallelize(range(2), 2) hosts = rdd.map(noop).distinct().collect() print hosts   def caffe_process(x):     import os     os.environ['PATH'] = '/usr/local/cuda/bin' + ':' + os.environ['PATH']     os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:/home/ubuntu/pombredanne-https-gitorious.org-mdb-mdb.git-9cc04f604f80/libraries/liblmdb'     import subprocess     proc = subprocess.Popen('cd /home/ubuntu/caffe && bash ./examples/mnist/train_lenet.sh', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)     out, err = proc.communicate()     return proc.returncode, out, err  rdd = sc.parallelize(range(2), 2) ret = rdd.map(caffe_process).distinct().collect() print ret 

You can submit the script to the Spark cluster using the submit command.

$ acluster submit spark-caffe.py 

After the script completes, the trained Caffe model can be found at /home/ubuntu/caffe/examples/mnist/lenet_iter_10000.caffemodel on all of the compute nodes.



蔡东赟 2015-10-14 17:25 发表评论
作者:C++博客-首页原创精华区
专注于C++技术
原文地址:Deep Learning (Spark, Caffe, GPU), 感谢原作者分享。

发表评论