Solved: Problems with tcsh /csh syntax for long $PATH -- potential fix

From NWChem

Viewed 13277 times, With a total of 10 Posts
Jump to: navigation, search

Gets Around
Threads 13
Posts 100
In working on getting set up on a ROCKS (CentOS 5.6-based) system I think I've stumbled across something which I think should be relatively easy to fix. This seems to affect tcsh but not bsd-csh.

I've got ECCE working perfectly now with my university cluster (either direct connection to front node although I have the 32 proc limit -- see other threads -- or via port redirection), which is running ScientificSL (Boron?) which is also CentOS/RHEL based.

I'm having some problems with the ROCKS cluster, however. I can do direct submissions to node via node hopping but can't submit to SGE on the front node because I get the behaviour below -- I also managed to submit to SGE on the frontnode by doing node hopping on the same node (mycluster.university.edu (eth0) -> mycluster.local (eth1)). That's all fine.

One of the issues I've found on the ROCKS cluster, but not Scientic Linux cluster, is that when I try to submit a job I get this
+go+if ($?PATH) setenv PATH /opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}
+go+setenv: Too many arguments.

The /opt/gridengine bit comes from the CONFIG.<> file (qmgrpath). Running the command in the remote terminal gives the same error, but only if ${PATH} is included. See below for more.

csh on both rocks/centos and scientific linux is really tcsh (softlinked /bin/csh -> tcsh). The tcsh versions (tcsh 6.14.00 (Astron) 2005-03-25) are identical on both clusters.

The reason for the 'too many arguments' seems to be that the PATH variable is already very long (1686 chars) on the ROCKS cluster but not on the Scientific Linux one (449 chars). So far, so bad.

There's a fix though: this can be dealt with by doing
setenv PATH "/opt/gridengine/bin/lx26-amd64/:/usr/bin/:/bin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/bin/X11:${PATH}"

instead.

I've tested the
setenv PATH "path1:path2:${PATH}"

syntax on bsd-csh (debian/wheezy standard csh) and in tcsh (centos -- on the ROCKS cluster) and it works on both, so patching shouldn't break anything.

I tried going through the ECCE source code myself, but didn't have too much luck in understanding what to change -- but it doesn't seem like it's something that SHOULD be too difficult to patch.

Interesting to note is that my debian systems, which use bsd-csh and not tcsh, have PATHs which contain up to 1562 chars without requiring "s.

Andy
Edited On 4:10:04 PM PDT - Sun, Mar 24th 2013 by Ohlincha

Gets Around
Threads 2
Posts 82
Hi Andy,

I think your suggested fix will correct the problem you are seeing, but my guess is that the underlying problem isn't the length of $PATH, but the same thing as the previous issue I was noting about having a space in the $PATH (the "channel 1" issue). Having paths with spaces is going to cause lots of havoc and somehow your path coming into ECCE is set that way as part of the environment on the cluster where you are launching ECCE jobs. You might want to spend a bit of time looking into that to see if you can track it down (it may be something setup by the administrators of that cluster as a default for everyone) because I can't think of a valid reason to have a directory specification with a space in it for a Linux box. It will actually work if all the spaces are properly escaped, but it's very easy not to do that and the result can only be trouble when running shell commands/scipts (inside or outside ECCE) as we've found.

Regardless, I don't see a reason why your suggested change is unsafe so I went ahead and made the change. It's probably a good idea anyway to quote the expression just to insure it's treated as a single value should there be spaces or other unusual syntax. I did this for when both $PATH and $LD_LIBRARY_PATH are specified via setenv since the latter could also be an issue.

There are now updated binary distributions and a new source code distribution for ECCE 6.3 with this change. This new ECCE 6.3 also includes the change I was mentioning last week where I don't try to find if xterm is already in $PATH before invoking it, which means you shouldn't see that little csh syntax error related to "channel 1". Actually, I could go back and put quotes around those as well and it would fix that syntax error too. But, I really don't think there is much value in the check for xterm anyway because with modern Linux operating systems, I can't imagine xterm not being in a standard directory. Finally note that there are no code changes related to the other couple fixes you requested (SCF max. iterations for DFT and order of atoms printed for MD systems). Those are going to take more time than I have available right now.

Thanks,
Gary

Gets Around
Threads 13
Posts 100
Cheers Gary,
I think the most important factor is that people know about the quirks ref SCF/DFT and atom order. Fixes can come a lot later. Since I'm not entirely foreign to python I might even be able to make an active contribution.

On behalf of the community, thank you for your hard work,
Andy

Gets Around
Threads 2
Posts 82
Thanks Andy.

Speaking of contributions, one thing I saw on your blog was that you had done some basic work at least to try to get ECCE to support Gaussian 09. That would actually be a great thing to get back into the ECCE distribution. We've had two other sites ask about it (I'm actually not sure if one of those wasn't you although I'm guessing not), but we don't have the funding to do it all here. As you probably found, it really is just a matter of figuring out if there are any deviations of the format of Gaussian input and output files for various theories and run types since the last Gaussian release we supported. We share all the same basic set of property parsing scripts between the different Gaussian versions (going back to G94 I think) and then just have unique ones when there are differences. Since the Gaussian developers know that a lot of add-on products depend upon them maintaining consistent file formats, it's usually not a big task. I'd like to maintain support for Gaussian 03 instead of replacing it, which means it's a bit more work than "remapping" Gaussian-03 to really invoke Gaussian-09 (not sure if that's how you did it already). Mostly it's a matter of finding those (none should involve C++ coding--all python, perl and some XML registry file changes) file by searching for strings like "g03" and then doing some copy/paste.

Gary

Gets Around
Threads 13
Posts 100
Gary,
I've looked at the perl scripts in the past and they felt somewhat impenetrable. I didn't ask about it before, but I've been following the discussion re being banned by gaussian (think it was part of the centos/rhel and ecce not starting -- I'm having problems with that too on one of clusters, but have given it a rest due to lecture preparations). I'll look, slowly at the g09 issue again and see if I understand it this time instead. I think my 6 months of troubleshooting various programs may have prepared me better for it now.

Btw, not even pre-version 5 gaussview can read the g09 files properly, so I think the new format may be significantly different. Or they broke it on purpose to keep the license money coming.

/Andy

Gets Around
Threads 2
Posts 82
Andy,

Yes, running g09 here is not an option. But, with the ECCE calculation import feature, that's not as big of a deal as it would seem. We can test parsing out g09 output by doing this. It just means that another site needs to perform the g09 runs and send the output files back. The trick is to make sure the output files correspond to the type of input file that ECCE can setup itself rather than being a more generic g09 output file. That's true for NWChem as well, but even more so for Gaussian because we support a much smaller subset of the code functionality.

That would be really unfortunate if the output format did change significantly. I found this page, http://www.gaussian.com/g_tech/g_ur/a_gdiffs09.htm, but as you might guess, it's more about the new chemistry functionality than low-level file format changes. As you imply, they probably prefer not to provide much/any documentation on that anyway so there is more motivation for people to license their products.

I won't disagree with you that the code registration scripts aren't completely transparent to put it midly. That was one of the many things with ECCE designed once back in the later 1990's and then never revisited. It really is in need of a major design refactoring and would benefit from everything learned on ECCE and a few other software projects here at the lab in the past 15 years or so. The compute host/server registration would benefit from the same as well. Several years ago now I gave a presentation to the NWChem development team on how codes are registered in ECCE. Those slides were turned into a PDF that has been uploaded to our ECCE website. So if you haven't seen those before, they will give you an overview of what is involved and maybe that would help from getting lost once you start diving into the detail of the scripts themselves. The URL for that is http://ecce.pnl.gov/docs/code_reg_slides.pdf. There is also a code registration manual on the website at http://ecce.pnl.gov/docs/2864B-code_reg.pdf. That one however is rather out of date, but may prove useful if not taken too literally.

Gary

Gets Around
Threads 13
Posts 100
Gary,
Cheers for the links to the pdf links -- I've spent two hours today looking at the perl scripts, focussing on the gaussian-03.vib script. I have to admit defeat -- the programming looks clever, but obfuscated. I'm happy to provide g09 output files from ECCE runs if need be, but I am no friend of perl.

I've tried the new ECCE binaries and tail -f when doing node hopping works perfectly. It doesn't quite work with port-forwarding from what I can see though, but it's not critical.

I've got a new problem though:
With the new binaries I'm seeing an odd behaviour: if I submit a job via node hopping to a system with SGE everything goes fine in the sense that the job gets queued. However, ECCE doesn't show that the job is queued i.e. I still see a Blue Triangle rather than a Pale Green Dot. I've posted an example screenshot at http://verahill.blogspot.com.au/2012/06/troubleshooting-ecce.html

This seems like a new regression. Once the job starts the icon changes to a solid green dot as it should. Same if I log in remotely and qdel the job -- the job is recognised as being deleted by ecce and updated as such.

Case 1. If I submit to SGE on the same computer as ECCE, it works as it should.
Case 2. If I submit to SGE on a remote site via port forwarding, it works as it should.
Case 3. If I submit to SGE via node-hopping, it doesn't work.

Also, I see it echo a whole lot of things, including perl scripts -- I don't see this happen in Cases 1 and 2 above.
...
end
task dft optimize
2063+0 records in
4+1 records out
2063 bytes (2.1 kB) copied, 0.18434 seconds, 11.2 kB/s
CMDSTAT=0
+go+#!/bin/csh
#  ECCE Submit Script
#  Generated Mon Jun 11 13:23:02 EST 2012 with ECCE Version v6.3.
# 
267 bytes (267 B) copied, 0.180516 seconds, 1.5 kB/s
CMDSTAT=0
+go+# parse Descriptor for NWCHEM output file
#
# Due to the way nwchem outputs U* theory mos, and the fact that we
# want to only parse the last one, the mo-related parsing is a little
# messy.  A separate entry is required for alpha and beta properties.
# This applies to MO MOBETA ORBOCC ORBOCCBETA...
# Symmetry has been included.
#
[EGRADVEC]
Script=nwchem.egradvec
Begin=task_gradient%begin%total gradient
Frequency=all
End=task
[END]
..
8359+0 records in
16+1 records out
8359 bytes (8.4 kB) copied, 0.363564 seconds, 23.0 kB/s
CMDSTAT=0
+go+###############################################################################
#
# Filename:
#
#       eccejobmonitor
#
# Abstract:
#
#	This program implements a server that extracts data


Finally, I'm guessing that the "eccejobmonitor_went_bye_bye" below might be part of the puzzle. The output below is from submitting a job, which I then qdel on the remote server.
CMDSTAT=0
+go+exit; echo GOODBYE
date; echo CMDSTAT=$status
Sun Jun 10 20:00:36 PDT 2012
CMDSTAT=0
+go+uname -a; echo CMDSTAT=$status
Linux rupert.university.edu 2.6.18-238.19.1.el5xen #1 SMP Fri Jul 15 08:16:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
CMDSTAT=0
+go+if (-d /home/jdoe/.andy/jobs/testing/someone/performance-1) echo TRUE
TRUE
+go+cd /home/jdoe/.andy/jobs/testing/someone/performance-1
+go+perl eccejobmonitor -configFile eccejobmonitor.conf -jobId 382 -bookmark 0; echo eccejobmonitor_went_bye_bye
Creating remote shell:
machine (system)
remote shell ()
local shell (csh)
user name ()
password is 0 characters
Remote shell command:
arg 0: csh
arg 1: -fc
arg 2: echo +hi+ && csh -f
End remote shell command
+hi+
unalias precmd; set prompt=+go+; unset echo
% +go+unalias *
+go+date; echo CMDSTAT=$status
Mon Jun 11 13:00:37 EST 2012
CMDSTAT=0
+go+if (-d /tmp/ecce_andy/jobs/performance-1__aBh6Wr) echo TRUE
TRUE
+go+cd /tmp/ecce_andy/jobs/performance-1__aBh6Wr
+go+eccejobmonitor_went_bye_bye
+go+date; echo CMDSTAT=$status
Sun Jun 10 20:01:18 PDT 2012
CMDSTAT=0
+go+date; echo CMDSTAT=$status
Sun Jun 10 20:01:18 PDT 2012
CMDSTAT=0
+go+/bin/rm -f eccejobmonitor eccejobmonitor.conf eccejobmonitor.propbuf *.desc; echo CMDSTAT=$status
CMDSTAT=0
+go+exit; echo GOODBYE
GOODBYE
exit
[jdoe@rupert ~]$ exit
logout
Connection to rupert.local closed.
+go+exit; echo GOODBYE
GOODBYE
exit
[jdoe@rupert ~]$ exit
logout
Connection to rupert.university.edu closed.
Transferred: sent 164576, received 197496 bytes, in 66.5 seconds
Bytes per second: sent 2474.2, received 2969.1
exit; echo GOODBYE






[when I posted what follows below originally I hadn't yet solved it -- someone reading this later might find this helpful]
In addition, I suddenly had problems importing ecce output files -- it said "ERROR: Setup parse script NWChem.expt does not exist or is not executable.", even though everything that should be, is in PATH and the file is executable.

Part of the problem was that the ecce_env script (which is unchanged from the previous version) wasn't evaluated correctly (debian/csh):
if ( `echo $PATH | grep -c "${ECCE_HOME}/scripts/parsers"` == 0 ) then
Word too long.
[..]
if ( `echo $PATH | grep -c "/usr/sbin"` == 0 ) then
Word too long.
[..]
if ( `echo $PATH | grep -c ":.:"` == 0 ) then
Word too long.
[..]
if ( -x /home/andy/.ecce/ecce-6.3e/apps/rhel5-gcc4.1.2-m64/3rdparty/system/bin/python && `echo $PATH | grep -c "${ECCE_HOME}/${ECCE_SYSDIR}3rdparty/system/bin"` == 0 ) then
Word too long.


The problem was bsd-csh which can only handle 1024 chars per line -- the word too long was referring to the length of $PATH. tcsh ins't supposed to have these limitations

The fix was simple (on debian):
sudo apt-get install tcsh
sudo update-alternatives --config csh
select tcsh
Edited On 12:40:10 AM PDT - Mon, Jun 11th 2012 by Ohlincha

Gets Around
Threads 13
Posts 100
I also noticed another thing: with the latest version of ECCE I seem to notice that 'noio' is now being included in the dft block when 'direct' is being chosen. Is that correct? Has this been changed in the newest version?
If memory serves me right noio can be a dangerous thing as it makes it difficult to restart calculations that crash -- although this is an experienced earned using only nwchem and not ecce which would presumably import data from the calculation continuously via the perl monitor script so it may not be much of an issue.
/Andy

Gets Around
Threads 2
Posts 82
Quote:Ohlincha Jun 10th 10:25 pm
I also noticed another thing: with the latest version of ECCE I seem to notice that 'noio' is now being included in the dft block when 'direct' is being chosen. Is that correct? Has this been changed in the newest version?
If memory serves me right noio can be a dangerous thing as it makes it difficult to restart calculations that crash -- although this is an experienced earned using only nwchem and not ecce which would presumably import data from the calculation continuously via the perl monitor script so it may not be much of an issue.
/Andy


Andy,

I took out "noio" as the direct DFT default and updated all the ECCE downloads. The comment in the $ECCE_HOME/scripts/parsers/ai.nwchem file that generates NWChem input files was that this was added just to save from writing out extra stuff the user probably doesn't care about for a direct calculation. But, I do agree with the philosophy that ECCE shouldn't haphazardly override NWChem defaults. Just search for "noio" in that ai.nwchem file and remove that line yourself and no need to download/install again. Plus, you can make any changes like that you like without even needing to recompile core ECCE code. If you think you made a change that the overall ECCE community would benefit from, let me know and I can make the change to our SVN repository and push out new downloads.

Gary

Just Got Here
Threads 0
Posts 1
Long Path Tool helped me in this situation.
http://PathTooDeep.com

Gets Around
Threads 2
Posts 82
Quote:Gabriel Nar Nov 10th 2:18 am
Long Path Tool helped me in this situation.
http://PathTooDeep.com


Thanks for that bit of useless spam regarding a Windows specific tool that isn't applicable to our Linux software

Gary


Forum >> ECCE: Extensible Computational Chemistry Environment >> General ECCE Topics



Who's here now Members 0 Guests 1 Bots/Crawler 0


AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC