Problems starting programs

Up: In case of trouble Next: General Previous: HPUX

General

Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

mpirun -np 2 cpi

A: On Intel Paragons and IBM SP1 and SP2, there are many mutually exclusive ways to run parallel programs; each site can pick the approach(es) that it allows. The script mpirun tries one of the more common methods, but may make the wrong choice. Use the -v or -t option to mpirun to see how it is trying to run the program, and then compare this with the site-specific instructions for using your system. You may need to adapt the code in mpirun to meet your needs.

2. Q: When trying to run a program with, e.g., mpirun -np 4 cpi, I get

usage : mpirun [options] <executable> [<dstnodes>] [-- <args>]

mpirun [options] <schema>

mpirun

mpich

which mpirun

mpirun

mpich

mpirun

alias mpirun /usr/local/mpi/bin/mpirun

mpirun -dbx -np 1 foo

dbx

dbx version 3.19 Nov  3 1994 19:59:46 
Unexpected argument ignored: -sr 
/scr/MPI/me/PId8704 is not an executable

dbx

-sr

dbx

mpirun

-dbx

-gdb

-xxgdb

-dbx

4. Q: When attempting to run cpilog I get the following message:

ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2

Consider adding these lines to your .login (assuming C shell):

setenv OPENWINHOME /usr/openwin 
    setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib

before

if ($?USER == 0 || $?prompt == 0) exit

A: If you opened the file before calling MPI_INIT, the behavior of MPI (not just mpich) is undefined. On the ch_p4 version, only process zero (in MPI_COMM_WORLD) will have the file open; the other processes will not have opened the file. Move the operations that open files and interact with the outside world to after MPI_INIT (and before MPI_FINALIZE).

6. Q: Programs seem to take forever to start.

A: This can be caused by any of several problems. On systems with dynamically-linked executables, this can be caused by problems with the file system suddenly getting requests from many processors for the dynamically-linked parts of the executable (this has been measured as a problem with some DFS implementations). You can try statically linking your application.

On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server.

Up: Problems starting programs Next: Workstation Networks Previous: Problems starting programs

Workstation Networks

Up: Problems starting programs Next: Intel Paragon Previous: General

mpirun

Permission denied

A: If you see something like this

% mpirun -np 2 cpi  
    Permission denied.

ch_p4

chameleon

rsh

tstmachines

-arch

sun4

tstmachines sun4

.rhosts

/etc/hosts.equiv

Using the secure server

rsh

remsh

If your system allows a .rhosts file, do the following:

.rhosts

chmod og-rwx .rhosts

.rhosts

host username

doe

a.our.org

b.our.org

.rhosts

a.our.org doe 
b.our.org doe

On networks where the use of .rhosts files is not allowed, (such as the one in MCS at Argonne), you should use the p4 server to run on machines that are not trusted by the machine that you are initiating the job from.

Finally, you may need to use a non-standard rsh command within MPICH. MPICH must be reconfigured with -rsh=command_name, and perhaps also with -rshnol if the remote shell command does not support the -l argument. Systems using Kerberos and/or AFS may need this.

2. Q: When I use mpirun, I get the message Try again.

A: If you see something like this

% mpirun -np 2 cpi  
    Try again.

rshd

The only fix for this is to have your system administrator look into the machine that is generating this message.

3. Q: When running the workstation version (-device=ch_p4), I get error messages of the form

stty: TCGETS: Operation not supported on socket

stty: tcgetattr: Permission denied

stty: : Can't assign requested address

.login

.cshrc

.profile

stty

tset

TERM

PROMPT

if ($?TERM) then 
        eval `tset -s -e^\? -k^U -Q -I $TERM` 
    endif

if ($?USER == 0 || $?prompt == 0) exit

.cshrc

after

LD_LIBRARY_PATH

4. Q: When using mpirun I get strange output like

arch: No such file or directory

.cshrc

which hostname

.cshrc

5. Q: When I try to run my program, I get

p0_4652:  p4_error: open error on procgroup file (procgroup): 0

mpirun

ch_p4

Try the following:

Run the program using mpirun and the -t argument:

mpirun -t -np 1 foo

-t

-echo

mpirun -echo -np 1 foo

lib/<architecture>/<device>

bin

6. Q: When trying to run a program I get this message:

icy%  mpirun -np 2 cpi -mpiversion 
    icy: icy: No such file or directory

/usr/lib/rsh

which rsh 
 ls /usr/*/rsh

/usr/lib

/usr/ucb

/usr/bin

/usr/lib

Another choice would be to add a link in a directory earlier in the search path to the remote shell. For example, I have /home/gropp/bin/sun4 early in my search path; I could use

cd /home/gropp/bin/sun4 
     ln -s /usr/bin/rsh  rsh

/usr/bin/rsh

7. Q: When trying to run a program I get this message:

trying normal rsh

-l

-rshnol

8. Q: When I run my program, I get messages like

| ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9

One temporary fix that you can use is to add the link-time option to force static linking instead of dynamic linking. For some Sun workstations, the option is -Bstatic.

9. Q: Programs never get started. Even tstmachines hangs.

Check first that rsh works at all. For example, if you have workstations w1 and w2, and you are running on w1, try

rsh w2 true

rsh w1 true

rsh

true

permission denied

krcmd: No ticket file (tf_util) 
rsh: warning, using standard rsh: can't provide Kerberos auth data.

rsh

rsh/rshd

10. Q: When running the workstation version (-device=ch_p4), I get error messages of the form

more slaves than message queues

mpich

configure -device=ch_p4 -comm=shared

11. Q: My programs seem to hang in MPI_Init.

A: There are a number of ways that this can happen:

tstmachines

select

mpich

Another is if you use the library -ldxml (extended math library) on Digital Alpha systems. This has been observed to case MPI_Init to hang. No workaround is known at this time; contact Digital for a fix if you need to use MPI and -ldxml together.

ch_p4

p0_2005:  p4_error: fork_p4: fork failed: -1 
              p4_error: latest msg from perror: Error 0

ch_p4

ch_tcp

size

swap -l

size

swap -l

A similar problem can happen on IBM SPx using the ch_eui or ch_mpl device; the cause is the same but it originates within the IBM MPL library.

13. Q: Sometimes, I get the error

Exec format error. Wrong Architecture.

sync

mpirun

sync

14. Q: There seem to be two copies of my program running on each node. This doubles the memory requirement of my application. Is this normal?

A: Yes, this is normal. In the ch_p4 implementation, the second process is used to dynamically establish connections to other processes.

Up: Problems starting programs Next: Intel Paragon Previous: General

Intel Paragon

Up: Problems starting programs Next: IBM RS6000 Previous: Workstation Networks

mpirun

A: Give mpirun the argument -paragontype nqs.

Up: Problems starting programs Next: IBM RS6000 Previous: Workstation Networks

IBM RS6000

Up: Problems starting programs Next: IBM SP Previous: Intel Paragon

ch_p4

% mpirun -np 2 cpi 
Could not load program /home/me/mpich/examples/basic/cpi  
Could not load library libC.a[shr.o] 
Error was: No such file or directory

mpich

xlC

util/machines/machines.rs6000

xlC

mpich

xlc

gcc

2. Q: When trying to run on an IBM RS6000 with the ch_p4 device, I got

% mpirun -np 2 cpi 
Could not load program /home/me/mpich/examples/basic/cpi  
Could not load library libC.a[shr.o] 
Error was: No such file or directory

xlC

util/machines/machines.rs6000

xlC

xlc

gcc

Up: Problems starting programs Next: IBM SP Previous: Intel Paragon

IBM SP

Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000

$ mpirun -np 2 hello ERROR: 0031-124 Couldn't allocate nodes for parallel execution. Exiting ... ERROR: 0031-603 Resource Manager allocation for task: 0, node: me1.myuniv .edu, rc = JM_PARTIONCREATIONFAILURE ERROR: 0031-635 Non-zero status -1 returned from pm_mgr_init

mpirun

poe

mpirun

dsh -av "ps aux | egrep -i 'poe|pmd|jmd'"

/tmp/jmd_err

2. Q: When trying to run on an IBM SPx, I get the message from mpirun:

ERROR: 0031-214  pmd: chdir </a/user/gamma/home/mpich/examples/basic> 
  ERROR: 0031-214  pmd: chdir </a/user/gamma/homempich/examples/basic>

mpirun

ksh

3. Q: When I tried to run my program on an IBM SPx, I got

ERROR : Cannot locate message catalog (pepoe.cat) using current NLSPATH 
INFO  : If NLSPATH is set correctly and catalog exists, check LANG or  
LC_MESSAGES variables 
(C) Opening of "pepoe.cat" message catalog failed

A: This is a problem in your system; contact your support staff. Have them look at (a) value of NLSPATH, (b) links from /usr/lib/nls/msg/prime to the appropriate language directory. The messages are not from MPICH; they are from the IBM POE/MPL code the the MPICH implementation is using.

4. Q: When trying to run on an IBM SP2, I get this message:

ERROR: 0031-124  Less than 2 nodes available from pool 0

MP_RETRY

MP_RETRYCOUNT

man poe

5. Q: When running on an IBM SP, my job generates the message

Message number 0031-254 not found in Message Catalog.

A: If your user name is eight characters long, you may be experiencing a bug in the IBM POE environment. The only fix at the time this was written was to use an account whose user name was seven characters or less. Ask your IBM representative about PMR 4017X (poe with userids of length eight fails) and the associated APAR IX56566.

Up: Problems starting programs Next: Programs fail at startup Previous: IBM RS6000