ECCE and SLURM batch system

From NWChem

Viewed 1248 times, With a total of 8 Posts

Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics

Mattasplund Member
Profile
Send PM

Clicked A Few Times

Threads 1
Posts 9

8:43:15 AM PDT - Thu, Jul 16th 2015
Our supercomputer administrators have recently switched from PBS to SLURM. For now they are supporting PBS submissions to SLURM, but do not know their long-term plans for it. How difficult is it to add a new queueing system? Where do I find the scripts to make it happen? Matthew Asplund

Mattasplund Member
Profile
Send PM

Clicked A Few Times

Threads 1
Posts 9

9:41:17 AM PDT - Thu, Jul 16th 2015
Follow-up to my own post
I have edited the QueueManager file to create a new SLURM set of commands, but I am mostly not certain if I have to edit something to make parsing the output data from the SLURM commands work. Matthew Asplund

Ohlincha Member
Profile
Send PM

Gets Around

Threads 13
Posts 106

7:14:08 PM PDT - Mon, Jul 20th 2015
Matthew, let me know how it goes. I'm (slowly) working on setting slurm on my cluster (debian jessie doesn't package SGE anymore) and will try to get ECCE working with it.

Ohlincha Member
Profile
Send PM

Gets Around

Threads 13
Posts 106

3:54:56 AM PDT - Tue, Jul 28th 2015
I've set up slurm on my cluster and have configured ECCE to work with it. See here: [1] It works, but can probably be improved upon.

Mattasplund Member
Profile
Send PM

Clicked A Few Times

Threads 1
Posts 9

1:56:08 PM PDT - Wed, Jul 29th 2015
I actually edited the submit.site file to add explicit support for SLURM by adding the lines to the file 172 SLURM { 173 #SBATCH --time=$wallTime 174 #SBATCH --ntasks=$totalprocs 175 #SBATCH --nodes=$nodes 176 #SBATCH -C 'avx' 177 #SBATCH --mem-per-cpu=4096M 178 } I am still having problems with job monitoring, so I will try putting your changes to eccejobmonitor to my installation.

Ohlincha Member
Profile
Send PM

Gets Around

Threads 13
Posts 106

3:30:25 PM PDT - Wed, Jul 29th 2015
Matt, the key to getting the job monitoring to work is to edit apps/scripts/eccejobmonitor Beware that $q contains the name of the queue manager in lower case, regardless of how you've defined it in QueueManagers Other than that, it was pretty straightforward (setting up SLURM itself was a bigger challenge), and I've been using it for day and a bit now without issue.
Edited On 3:30:47 PM PDT - Wed, Jul 29th 2015 by Ohlincha

Mattasplund Member
Profile
Send PM

Clicked A Few Times

Threads 1
Posts 9

9:53:33 AM PST - Tue, Feb 9th 2016
So, I stopped playing with this, but am getting back to it. My problem right now is that I am getting an error "Unable to parse job id. Cannot monitor job." when I submit things. Now, when I run the sbatch command to submit a job, it returns output "Submitted batch job 9488438" (or whatever the job ID is). I tried writing a wrapper script to reduce the output to just the job id, but that didn't help. Is there a way to track what is actually happening during the submit process? I tried setting the ECCE_DEBUG and ECCE_RCOM_LOGMODE but that just outputs the ssh communication.

Ohlincha Member
Profile
Send PM

Gets Around

Threads 13
Posts 106

3:44:15 AM PST - Sun, Feb 28th 2016
Add a #SBATCH --output=slurm.out line so that messages get logged. I read it as submission failing i.e. the jobs never run? Log onto the submit node and run the submit_xxxxxx file manually. See what happens and if it runs. You might be able to narrow it down to either communication issues or something to do with slurm.

Mattasplund Member
Profile
Send PM

Clicked A Few Times

Threads 1
Posts 9

1:18:07 PM PDT - Tue, May 10th 2016
Actually, the jobs submit and run just fine, but I get an error ERROR: Unable to parse job id. Cannot monitor job. WARNING: Launch aborted... So, it is in the submit step that things are failing.

Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics

Who's here now Members 0 Guests 1 Bots/Crawler 0

AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC

Search

Navigation

SEARCH

TOOLBOX

LANGUAGES

Forum Menu

ECCE and SLURM batch system

From NWChem

Toolbox