From NWChem
Viewed 1857 times, With a total of 8 Posts
|
Clicked A Few Times
Threads 2
Posts 11
|
|
9:43:15 AM PDT - Thu, Jul 16th 2015 |
|
Our supercomputer administrators have recently switched from PBS to SLURM. For now they are supporting PBS submissions to SLURM, but do not know their long-term plans for it. How difficult is it to add a new queueing system? Where do I find the scripts to make it happen?
Matthew Asplund
|
|
|
|
Clicked A Few Times
Threads 2
Posts 11
|
|
10:41:17 AM PDT - Thu, Jul 16th 2015 |
|
Follow-up to my own post
|
I have edited the QueueManager file to create a new SLURM set of commands, but I am mostly not certain if I have to edit something to make parsing the output data from the SLURM commands work.
Matthew Asplund
|
|
|
|
Gets Around
Threads 14
Posts 111
|
|
8:14:08 PM PDT - Mon, Jul 20th 2015 |
|
Matthew,
let me know how it goes. I'm (slowly) working on setting slurm on my cluster (debian jessie doesn't package SGE anymore) and will try to get ECCE working with it.
|
|
|
|
Gets Around
Threads 14
Posts 111
|
|
4:54:56 AM PDT - Tue, Jul 28th 2015 |
|
I've set up slurm on my cluster and have configured ECCE to work with it. See here: [1]
It works, but can probably be improved upon.
|
|
|
|
Clicked A Few Times
Threads 2
Posts 11
|
|
2:56:08 PM PDT - Wed, Jul 29th 2015 |
|
I actually edited the submit.site file to add explicit support for SLURM by adding the lines to the file
172 SLURM {
173 #SBATCH --time=$wallTime
174 #SBATCH --ntasks=$totalprocs
175 #SBATCH --nodes=$nodes
176 #SBATCH -C 'avx'
177 #SBATCH --mem-per-cpu=4096M
178 }
I am still having problems with job monitoring, so I will try putting your changes to eccejobmonitor to my installation.
|
|
|
|
Gets Around
Threads 14
Posts 111
|
|
4:30:25 PM PDT - Wed, Jul 29th 2015 |
|
Matt,
the key to getting the job monitoring to work is to edit apps/scripts/eccejobmonitor
Beware that $q contains the name of the queue manager in lower case, regardless of how you've defined it in QueueManagers
Other than that, it was pretty straightforward (setting up SLURM itself was a bigger challenge), and I've been using it for day and a bit now without issue.
|
Edited On 4:30:47 PM PDT - Wed, Jul 29th 2015 by Ohlincha
|
|
|
|
Clicked A Few Times
Threads 2
Posts 11
|
|
10:53:33 AM PST - Tue, Feb 9th 2016 |
|
So, I stopped playing with this, but am getting back to it. My problem right now is that I am getting an error "Unable to parse job id. Cannot monitor job." when I submit things. Now, when I run the sbatch command to submit a job, it returns output "Submitted batch job 9488438" (or whatever the job ID is). I tried writing a wrapper script to reduce the output to just the job id, but that didn't help. Is there a way to track what is actually happening during the submit process? I tried setting the ECCE_DEBUG and ECCE_RCOM_LOGMODE but that just outputs the ssh communication.
|
|
|
|
Gets Around
Threads 14
Posts 111
|
|
4:44:15 AM PST - Sun, Feb 28th 2016 |
|
Add a
#SBATCH --output=slurm.out
line so that messages get logged.
I read it as submission failing i.e. the jobs never run?
Log onto the submit node and run the submit_xxxxxx file manually. See what happens and if it runs. You might be able to narrow it down to either communication issues or something to do with slurm.
|
|
|
|
Clicked A Few Times
Threads 2
Posts 11
|
|
2:18:07 PM PDT - Tue, May 10th 2016 |
|
Actually, the jobs submit and run just fine, but I get an error
ERROR: Unable to parse job id. Cannot monitor job.
WARNING: Launch aborted...
So, it is in the submit step that things are failing.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC