SOFTWARE GNU Parallel

GNU Parallel is a great tool for executing commands in parallel on one or more nodes. If you put all of your commands in a file named commands.txt you can have GNU Parallel dynamically distribute the commands across all of the nodes and cores that were requested by a pbs job.

Single-node Examples

You have a file named commands.txt containing a list of commands and want to run one command per core:

$ module load parallel
$ parallel < commands.txt

GNU Parallel automatically identifes the number of cores on the node and runs one command per core. Use the –jobs option to specify a different number of concurrent commands:

$ module load parallel
$ parallel --jobs 2 < commands.txt

You want to run the same command (FastQC) on many (fastq) files, running one command per core:

$ module load parallel
$ module load fastqc
$ find ~/fastqfolder -name *.fastq | parallel "fastqc {}"

You want to run the same command (wc -l) on many files, running one command per core, saving the output to files named EXAMPLE.fastq.out in the same directory as the fastq files:

$ module load parallel
$ find ~/fastqfolder -name *.fastq | parallel "wc -l {} > {}.out"

You want to run the same command (wc -l) on many files, running one command per core, saving the output to files named EXAMPLE.fastq.out in a different directory:

$ module load parallel
$ find ~/fastqfolder -name *.fastq | parallel "wc -l {} > ~/output/{/}.out"

You want to run the same command (wc -l) on many files, running one command per core, saving the output to files named EXAMPLE.out in the current working directory:

$ module load parallel
$ find ~/fastqfolder -name *.fastq | parallel "wc -l {} > {/.}.out"

Multi-node Examples

If you pass GNU Parallel a file with a list of nodes it will run jobs on each node. The PBS environment variable PBS_NODEFILE points to a file that lists all nodes allocated to the current job, however each node is listed once for each core on the node. Therefore you need to either tell GNU Parallel to run one job per node, or remove duplicate node names from the node file.

You have a file named commands.txt containing a list of single-threaded commands and want to run one command per core on multiple nodes:

$ module load parallel
$ parallel --jobs 1 --sshloginfile $PBS_NODEFILE --workdir $PWD < commands.txt

Which is equivalent to:

$ module load parallel
$ sort -u $PBS_NODEFILE > unique-nodelist.txt
$ parallel --sshloginfile unique-nodelist.txt --workdir $PWD < commands.txt

You have a file named commands.txt containing a list of multi-threaded commands and want to run one command per node on multiple nodes:

$ module load parallel
$ sort -u $PBS_NODEFILE > unique-nodelist.txt
$ parallel --jobs 1 --sshloginfile unique-nodelist.txt --workdir $PWD < commands.txt

Loading modules

Multi-node jobs are a little tricky because the remote nodes do not inherit the environment from the head node, so any modules loaded by the pbs script won’t be present on the remote nodes. Also, the module command is really just a shell alias, and aliases don’t work in the non-interactive bash sessions that are created on the remote nodes. One workaround is to include this environment variable defenition in your PBS script after you have loaded your modules, but before you run GNU Parallel:

$ module load parallel
$ module load another_module
$ export PARALLEL="--workdir . --env PATH --env LD_LIBRARY_PATH --env LOADEDMODULES --env _LMFILES_ --env MODULE_VERSION --env MODULEPATH --env MODULEVERSION_STACK --env MODULESHOME --env OMP_DYNAMICS --env OMP_MAX_ACTIVE_LEVELS --env OMP_NESTED --env OMP_NUM_THREADS --env OMP_SCHEDULE --env OMP_STACKSIZE --env OMP_THREAD_LIMIT --env OMP_WAIT_POLICY"
$ parallel --jobs 1 --sshloginfile $PBS_NODEFILE --workdir $PWD < commands.txt