This guide explains how to launch a Globus Provision instance with a Condor cluster and a Globus Online (GO) transfer endpoint. As an example, we will use Globus Online to transfer a set of Project Gutenberg texts into the cluster, do some simple processing on them using the Condor cluster, and transfer the result out with Globus Online.
In general, the setup described in this guide may of interest if you have a dataset that you want to do some processing on and want compute resources right away to do that processing (e.g., to test your code interactively, without having your compute jobs waiting in a queue on your organization’s local cluster). This is also an example where Globus Provision and Globus Online complement each other nicely: Globus Provision can set up a compute cluster for you on EC2 in just a few minutes, and Globus Online provides a hassle-free way of getting data in and out of the cluster. Plus, Globus Provision takes care of setting up the endpoint on the cluster for you; no need to worry about GridFTP, certificates, etc.
If you are starting from scratch, you will need to install Globus Provision first, and make sure you have the necessary EC2 account information. Just read our main Quickstart Guide, and stop right before section Creating and launching a Globus Provision instance (although, if you have time, it doesn’t hurt to work through the entire Quickstart)
This guide also assumes that you have a GO account and that you are familiar, at least, with GO’s web interface. If this is not the case, sign up for a GO account and take a look at the Globus Online Quickstart Guide
Let’s start by taking a look at the topology file we will be using (save this file as go-condor-ec2.conf):
[general]
deploy: ec2
domains: simple
[domain-simple]
users: gp-user
gridftp: yes
nis: yes
filesystem: nfs
condor: yes
condor-nodes: 4
go-endpoint: go-user#gp-test
go-auth: go
[ec2]
keypair: gp-key
keyfile: ~/.ec2/gp-key.pem
username: ubuntu
ami: latest-32bit
instance-type: t1.micro
[globusonline]
ssh-key: ~/.ssh/id_rsa
As you can see, this file is similar to the Quickstart’s example file, except for a couple extra options. First of all, make sure the [ec2] options have valid values for your EC2 account.
Next, we specify that we want this Globus Provision instance to include a Condor cluster:
condor: yes
condor-nodes: 4
Since we’ll have multiple hosts in our topology, it will be convenient for them to have shared accounts and a shared filesystem. We do this with the following option:
nis: yes
filesystem: nfs
Finally, we need to specify a few Globus Online options:
[domain-simple]
go-endpoint: go-user#gp-test
go-auth: go
[globusonline]
ssh-key: ~/.ssh/id_rsa_nopass
The go-endpoint option specifies the name of the endpoint that will be created for the instance’s GridFTP server (notice how, as we did in the Quickstart, we’ve specified gridftp: yes). Make sure you replace go-user with the name of your Globus Online user.
The go-auth option specifies that Globus Online will use your GO identity when accessing this endpoint. In other words, if you are logged into a GO interface (such as the GO website, or the GO CLI), GO will not ask you for any additional account information or passwords; it will simply contact the endpoint with your GO identity. To specify what GO users should be authorized to access the endpoint, just specify them in the users option (make sure you, at least, add your GO user; otherwise, you will not be able to work through the remainder of the guide).
Finally, since Globus Provision will be contacting GO to create an endpoint on your behalf, you must authorize an SSH key in your GO account, and specify the path of that SSH key in the ssh-key option. If you have not authorized your SSH key in Globus Online, take a quick detour to Authorizing an SSH or X.509 Identity in GO to learn how to do this.
Ok, we’re ready to actually launch this topology. The first step is to create a Globus Provision instance with that topology:
gp-instance-create -c go-condor-ec2.conf
This should immediately return the following:
Created new instance: gpi-65f00474
The gp-instance-create command doesn’t actually deploy the topology, or create the associated Globus Online endpoints, but simply validates that the topology is correct, and creates an entry for it in a database. This entry is called an instance. You can think of the topology as a specification of what you want to deploy and the instance as one particular deployment of that topology.
To actually launch this instance, we use the gp-instance-start command (make sure you use the identifier returned by gp-instance-create, not the one used in these examples):
gp-instance-start gpi-65f00474
This command will take a few minutes to do its job and, for a while, all you will see is the following:
Starting instance gpi-65f00474...
In a separate console, you can track the progress of the deployment using this command:
gp-instance-describe gpi-65f00474
You should first see something like this:
gpi-65f00474: Starting
Domain 'simple'
simple-condor Starting
simple-server Starting
simple-condor-wn4 Starting
simple-condor-wn3 Starting
simple-condor-wn2 Starting
simple-condor-wn1 Starting
simple-gridftp Starting
This command is telling us not just the status of the entire instance (Starting) but also of each individual host in the topology’s domains. Notice how we have a host for the Condor head node (simple-condor), four Condor worker nodes (simple-condor-wn1, etc.), a GridFTP server (simple-gridftp), and an NFS/NIS server (simple-server). Since we’re using EC2 micro-instance (notice how we specified instance-type: t1.micro in the topology file), this example will only cost $0.14 per hour to run. In fact, if you have a new Amazon Web Services account, you may be able to take advantage of their Free Usage Tier and get 750 hours on EC2 completely free.
After a short while, the output of gp-instance-describe will look like this:
gpi-65f00474: Configuring
Domain 'simple'
simple-condor Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-server Configuring ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn4 Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn3 Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn2 Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn1 Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-gridftp Running (unconfigured) ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
At this point, all the hosts have started, and Globus Provision is in the process of configuring the simple-server host (Globus Provision can configure multiple hosts at the same time but, in this case, it cannot do so because we can’t configure the other hosts until the NFS/NIS server is runnign). Notice how, since all the hosts have started, we now know what their actual hostnames are. We will use this later to connect to that host.
When gp-instance-start finishes deploying the instance, it will show the following:
Starting instance gpi-65f00474... done!
Started instance in 2 minutes and 31 seconds
And gp-instance-describe will look like this:
gpi-65f00474: Running
Domain 'simple'
simple-condor Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-server Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn4 Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn3 Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn2 Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-condor-wn1 Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
simple-gridftp Running ec2-N-N-N-N.compute-1.amazonaws.com 10.X.X.X
You can use the Globus Online CLI to verify that the “gp-test” endpoint was correctly created:
$ ssh cli.globusonline.org endpoint-list -v gp-test
Name : gp-test
Host(s) : gsiftp://ec2-N-N-N-N.compute-1.amazonaws.com:2811
Subject(s) : /C=US/O=Globus Consortium/OU=Globus Connect Service/CN=f6ec9304-dc84-11e0-bc85-12313804ec2a
MyProxy Server : myproxy.globusonline.org
Credential Status : n/a
Credential Expires: n/a
Credential Subject: n/a
The host for the endpoint should match that of the simple-gridftp host.
Now, we will use Globus Online to transfer a dataset into the cluster we have just launched on EC2. You can transfer any dataset from an existing Globus Online endpoint, or transfer data from your laptop or other local machine using Globus Connect. As an example, we will use a collection of ebooks from Project Gutenberg. If you’d like to follow the example with the exact same dataset, you can download a tarball (31MB) with all the ebooks.
Once the dataset is available on a GO endpoint, you can transfer it to the cluster using the GO web interface or the CLI:
$ ssh cli.globusonline.org scp -r my-gc-endpoint:/~/ebooks/ gp-test:/nfs/scratch/
Task ID: 74f43426-dc8c-11e0-bc85-12313804ec2a
Type <CTRL-C> to cancel or bg<ENTER> to background
[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX] 94/94 2.14 mbps
Notice how we’re transferring the entire ebooks directory into the /nfs/scratch directory. This is a shared scratch directory that can be accessed from any of the hosts in the cluster.
Note
Because of the way that NFS directories are mounted on Globus Provision instances, and the way Globus Online’s scp commands checks if a directory exists, the above CLI command may fail. If so, do the transfer using the GO web interface, which will first try to list the contents of /nfs/scratch, ensuring that it will be mounted before the transfer.
Another quick workaround, in case you want to use the CLI command, is to run ls /nfs/scratch on the simple-gridftp node. This will force the scratch directory to be mounted before the transfer.
To process the data, we will need to log into the Condor hear node to launch a series of jobs. When using the simple topology file, your public SSH key will be authorized by default in all the users specified in the users option (in fact, their passwords will be disabled, and using an SSH key will be the only way of logging into the hosts).
So, you should be able to log into the Condor head node like this (make sure you substitute the hostname with the one returned by gp-instance-describe, and myuser with the username you specified in users):
ssh myuser@ec2-N-N-N-N.compute-1.amazonaws.com
Once you’ve logged in, run the following:
condor_status
You should see the following output:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
ec2-R-R-R-R.comput LINUX INTEL Unclaimed Idle 0.010 595 0+00:04:43
ec2-S-S-S-S.comput LINUX INTEL Unclaimed Idle 0.000 595 0+00:04:44
ec2-T-T-T-T.comput LINUX INTEL Unclaimed Idle 0.000 595 0+00:04:42
ec2-U-U-U-U.comput LINUX INTEL Unclaimed Idle 0.000 595 0+00:04:42
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 4 0 0 4 0 0 0
Total 4 0 0 4 0 0 0
This shows that the Condor head node is running, and that it is aware of the four worker nodes in our instance.
Now, we need to specify what jobs we’re going to run on the cluster. We’re going to do some simple processing: for each ebook, count the number of words and report the wordcount and the title of the book. To do so, save the following as wc.sh in your home directory (on the cluster):
#!/bin/bash
FILE=$1
TITLE="`grep -m1 "Title:" $FILE | cut -f2- -d" "`"
WORDS=`wc --words $FILE | cut -f1 -d" "`
echo $WORDS $TITLE
Make it executable:
chmod u+x wc.sh
Next, we need to prepare a Condor submission file. Since we have to process nearly 100 files, we’ll use a simple script to generate the submission file. Save the following as gen_condor.sh:
echo 'Universe = vanilla'
echo 'Executable = wc.sh'
echo 'transfer_executable = false'
echo 'Log = wc.log'
echo 'Output = wc.$(Process).out'
echo 'Error = wc.$(Process).error'
for f in `ls /nfs/scratch/ebooks/`;
do
echo "Arguments = /nfs/scratch/ebooks/$f"
echo 'Queue'
done
And run the following:
bash gen_condor.sh > wc.submit
Now, let’s submit the jobs to the Condor cluster:
condor_submit wc.submit
You can use the condor_q command to track the progress of the jobs:
$ condor_q
-- Submitter: ec2-N-N-N-N.compute-1.amazonaws.com : <10.X.X.X:PPPPP> : ec2-N-N-N-N.compute-1.amazonaws.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4.40 borja 9/11 16:57 0+00:00:00 R 0 0.0 wc.sh /nfs/scratch
4.41 borja 9/11 16:57 0+00:00:00 R 0 0.0 wc.sh /nfs/scratch
4.42 borja 9/11 16:57 0+00:00:00 R 0 0.0 wc.sh /nfs/scratch
4.43 borja 9/11 16:57 0+00:00:00 R 0 0.0 wc.sh /nfs/scratch
4.44 borja 9/11 16:57 0+00:00:00 I 0 0.0 wc.sh /nfs/scratch
4.45 borja 9/11 16:57 0+00:00:00 I 0 0.0 wc.sh /nfs/scratch
4.46 borja 9/11 16:57 0+00:00:00 I 0 0.0 wc.sh /nfs/scratch
4.47 borja 9/11 16:57 0+00:00:00 I 0 0.0 wc.sh /nfs/scratch
4.48 borja 9/11 16:57 0+00:00:00 I 0 0.0 wc.sh /nfs/scratch
[etc.]
After a short while, condor_q will report that there are no jobs running. That means the jobs have completed. We just need to concatenate the output of all the jobs to form the list of titles and wordcounts:
$ cat *.out | sort -gr > wc.txt
Let’s take a peek:
$ head wc.txt
1665403 The 2010 CIA World Factbook
904087 The Complete Works of William Shakespeare
824146 The King James Bible
568531 Les Miserables
565450 War and Peace
462169 The Count of Monte Cristo
428901 Don Quixote
383599 An Inquiry into the Nature and Causes of the Wealth of Nations
353778 The Brothers Karamazov
352684 Anna Karenina
Of course, the result file is small enough that you could just copy-paste it somewhere else, but you can also try transferring out of the cluster using Globus Online.
In this guide, you have launched a compute cluster on EC2 using Globus Provision, and staged data in and out with Globus Online. However, the kind of processing we’ve done is very simple, and could easily be done without the need for a Condor cluster. However, this guide lays the groundwork for you to experiment with other datasets, and with more complex forms of processing. Globus Provision and Globus Online take care of provisioning resources and moving data, so you can focus on your work!
If you’d like to learn more about Globus Provision, you may want to do the following:
Once you’re done tinkering, just log out of the Condor head node, and terminate your instance like this:
gp-instance-terminate gpi-65f00474
You will see the following:
Terminating instance gpi-65f00474... done!