mpirun -np 2 cpieither I get an error message or the program hands.
A: On Intel Paragons and IBM SP1 and SP2, there are many mutually exclusive ways to run parallel programs; each site can pick the approach(es) that it allows. The script mpirun tries one of the more common methods, but may make the wrong choice. Use the -v or -t option to mpirun to see how it is trying to run the program, and then compare this with the site-specific instructions for using your system. You may need to adapt the code in mpirun to meet your needs.
2. Q:
When trying to run a program with, e.g., mpirun -np 4 cpi, I get
usage : mpirun [options] <executable> [<dstnodes>] [-- <args>]or
mpirun [options] <schema>A: You have a command named mpirun in your path ahead of the mpich version. Execute the command
which mpirunto see which command named mpirun was actually found. The fix is to either change the order of directories in your path to put the mpich version of mpirun first, or to define an alias for mpirun that uses an absolute path. For example, in the csh shell, you might do
alias mpirun /usr/local/mpi/bin/mpirun
mpirun -dbx -np 1 foodbx does start up but this message appears:
dbx version 3.19 Nov 3 1994 19:59:46 Unexpected argument ignored: -sr /scr/MPI/me/PId8704 is not an executableA: Your version of dbx does not support the -sr argument; this is needed to give dbx the initial commands to execute. You will not be able to use mpirun with the -dbx argument. Try using -gdb or -xxgdb instead of -dbx if you have the GNU debugger.
4. Q:
When attempting to run cpilog I get the following message:
ld.so.1: cpilog: fatal: libX11.so.4: can't open file: errno 2A: The X11 version that configure found isn't properly installed. This is a common problem with Sun/Solaris systems. One possibility is that your Solaris machines are running slightly different versions. You can try forcing static linking (-Bstatic on SunOS).
Consider adding these lines to your .login (assuming C shell):
setenv OPENWINHOME /usr/openwin setenv LD_LIBRARY_PATH /opt/SUNWspro/lib:/usr/openwin/lib(you may want to check with your system administrator first to make sure that the paths are correct for your system). Make sure that you add them before any line like
if ($?USER == 0 || $?prompt == 0) exit
A: If you opened the file before calling MPI_INIT, the behavior of MPI (not just mpich) is undefined. On the ch_p4 version, only process zero (in MPI_COMM_WORLD) will have the file open; the other processes will not have opened the file. Move the operations that open files and interact with the outside world to after MPI_INIT (and before MPI_FINALIZE).
6. Q:
Programs seem to take forever to start.
A: This can be caused by any of several problems. On systems with dynamically-linked executables, this can be caused by problems with the file system suddenly getting requests from many processors for the dynamically-linked parts of the executable (this has been measured as a problem with some DFS implementations). You can try statically linking your application.
On workstation networks, long startup times can be due to the time used to start remote processes; see the discussion on the secure server.
A:
If you see something like this
% mpirun -np 2 cpi Permission denied.when using the ch_p4 or chameleon device, it probably means that you do not have permission to use rsh to start processes. The script tstmachines can be used to test this. For example, if the architecture type (the -arch argument to configure) is sun4, then try
tstmachines sun4If this fails, then you may need a .rhosts or /etc/hosts.equiv file (you may need to see your system administrator) or you may need to use the p4 server (see Section Using the secure server ). Another possible problem is the choice of the remote shell program; some systems have several. Check with your systems administrator about which version of rsh or remsh you should be using.
If your system allows a .rhosts file, do the following:
host usernameFor example, if your username is doe and you want to user machines a.our.org and b.our.org, your .rhosts file should contain
a.our.org doe b.our.org doeNote the use of fully qualified host names (some systems require this).
On networks where the use of .rhosts files is not allowed, (such as the one in MCS at Argonne), you should use the p4 server to run on machines that are not trusted by the machine that you are initiating the job from.
Finally, you may need to use a non-standard rsh command within MPICH. MPICH must be reconfigured with -rsh=command_name, and perhaps also with -rshnol if the remote shell command does not support the -l argument. Systems using Kerberos and/or AFS may need this.
2. Q:
When I use mpirun, I get the message Try again.
A:
If you see something like this
% mpirun -np 2 cpi Try again.it means that you were unable to start a remote job with the remote shell command on some machine, even though you would normally be able to. This may mean that the destination machine is very busy, out of memory, or out of processes. The man page for rshd may give you more information.
The only fix for this is to have your system administrator look into the machine that is generating this message.
3. Q:
When running the workstation version (-device=ch_p4), I get
error messages of the form
stty: TCGETS: Operation not supported on socketor
stty: tcgetattr: Permission deniedor
stty: : Can't assign requested addressA: This means that one your login startup scripts (i.e., .login and .cshrc or .profile) contains an unguarded use of the stty or tset program. For C shell users, one typical fix is to check for the variables TERM or PROMPT to be initialized. For example,
if ($?TERM) then eval `tset -s -e^\? -k^U -Q -I $TERM` endifAnother solution is to see if it is appropriate to add
if ($?USER == 0 || $?prompt == 0) exitnear the top of your .cshrc file (but after any code that sets up the runtime environment, such as library paths (e.g., LD_LIBRARY_PATH).
4. Q:
When using mpirun I get strange output like
arch: No such file or directoryA: This is usually a problem in your .cshrc file. Try the shell command
which hostnameIf you see the same strange output, then your problem is in your .cshrc file.
5. Q:
When I try to run my program, I get
p0_4652: p4_error: open error on procgroup file (procgroup): 0A: This indicates that the mpirun program did not create the expected input to run the program. The most likely reason is that the mpirun command is trying to run a program build with device ch_p4 (workstation networks) as shared memory or some special system.
Try the following:
Run the program using mpirun and the -t argument:
mpirun -t -np 1 fooThis should show what mpirun would do (-t is for testing). Or you can use the -echo argument to see exactly what mpirun is doing:
mpirun -echo -np 1 fooIn general, you should select the mpirun in lib/<architecture>/<device> directory over the one in the bin directory.
6. Q:
When trying to run a program I get this message:
icy% mpirun -np 2 cpi -mpiversion icy: icy: No such file or directoryA: Your problem is that /usr/lib/rsh is not the remote shell program. Try the following:
which rsh ls /usr/*/rshYou probably have /usr/lib in your path ahead of /usr/ucb or /usr/bin. This picks the `restricted' shell instead of the `remote' shell. The easiest fix is to just remove /usr/lib from your path (few people need it); alternately, you can move it to after the directory that contains the `remote' shell rsh.
Another choice would be to add a link in a directory earlier in the search
path to the remote shell. For example, I have /home/gropp/bin/sun4
early in my search path; I could use
cd /home/gropp/bin/sun4 ln -s /usr/bin/rsh rshthere (assuming /usr/bin/rsh is the remote shell).
7. Q:
When trying to run a program I get this message:
trying normal rshA: You are using a version of the remote shell program that does not support the -l argument. Reconfigure with -rshnol and rebuild MPICH. You may suffer some loss of functionality if you try to run on systems where you have different user names.
8. Q:
When I run my program, I get messages like
| ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9A: You are trying to run on another machine with an out-dated version of the basic C library. For some reason, some manufacturers do not make the shared libraries compatible between minor (or even maintenance) releases of their software. You need to have you system administrator bring the machines to the same software level.
One temporary fix that you can use is to add the link-time option to force static linking instead of dynamic linking. For some Sun workstations, the option is -Bstatic.
9. Q:
Programs never get started. Even tstmachines hangs.
A:
Check first that rsh works at all. For example, if
you have workstations w1 and w2, and you are running on
w1, try
rsh w2 trueThis should complete quickly. If it does not, try
rsh w1 true(that is, use rsh to run true on the system that you are running on). If you get permission denied, see the help on that. If you get
krcmd: No ticket file (tf_util) rsh: warning, using standard rsh: can't provide Kerberos auth data.then your system has a faulty installation of rsh. Some FreeBSD systems have been observed with this problem. Have your system administrators correct the problem (often one of an inconsistent set of rsh/rshd programs).
10. Q:
When running the workstation version (-device=ch_p4), I get
error messages of the form
more slaves than message queuesA: This means that you are trying to run mpich in one mode when it was configured for another. In particular, you are specifying in your p4 procgroup file that several processes are to shared memory on a particular machine by either putting a number greater than 0 on the first line (where it signifies number of local processes besides the original one), or a number greater than 1 on any of the succeeding lines (where it indicates the total number of processes sharing memory on that machine). You should either change your procgroup file to specify only one process on line, or reconfigure mpich with
configure -device=ch_p4 -comm=sharedwhich will reconfigure the p4 device so that multiple processes can share memory on each host. The reason this is not the default is that with this configuration you will see busy waiting on each workstation, as the device goes back and forth between selecting on a socket and checking the internal shared-memory queue.
11. Q:
My programs seem to hang in MPI_Init.
A: There are a number of ways that this can happen:
Another is if you use the library -ldxml (extended math library) on Digital Alpha systems. This has been observed to case MPI_Init to hang. No workaround is known at this time; contact Digital for a fix if you need to use MPI and -ldxml together.
p0_2005: p4_error: fork_p4: fork failed: -1 p4_error: latest msg from perror: Error 0A: The executable size of your program may be too large. When a ch_p4 or ch_tcp device program starts, it creates a copy of itself to handle certain communication tasks. Because of the way in which the code is organized, this (at least temporarily) is a full copy of your original program and occupies the same amount of space. Thus, if your program is over half as large as the maximum space available, you wil get this error. On SGI systems, you can use the command size to get the size of the executable and swap -l to get the available space. Note that size gives you the size in bytes and swap -l gives you the size in 512-byte blocks. Other systems may offer similar commands.
A similar problem can happen on IBM SPx using the ch_eui or ch_mpl device; the cause is the same but it originates within the IBM MPL library.
13. Q:
Sometimes, I get the error
Exec format error. Wrong Architecture.A: You are probably using NFS (Network File System). NFS can fail to keep files updated in a timely way; this problem can be caused by creating an executable on one machine and then attempting to use it from another. Usually, NFS catches up with the existence of the new file within a few minutes. You can also try using the sync command. mpirun in fact tries to run the sync command, but on many systems, sync is only advisory and will not guarentee that the file system has been made consistent.
14. Q:
There seem to be two copies of my program running on each node. This doubles
the memory requirement of my application. Is this normal?
A: Yes, this is normal. In the ch_p4 implementation, the second process is used to dynamically establish connections to other processes.
A: Give mpirun the argument -paragontype nqs.
% mpirun -np 2 cpi Could not load program /home/me/mpich/examples/basic/cpi Could not load library libC.a[shr.o] Error was: No such file or directoryA: This means that mpich was built with the xlC compiler but that some of the machines in your util/machines/machines.rs6000 file do not have xlC installed. Either install xlC or rebuild mpich to use another compiler (either xlc or gcc; gcc has the advantage of never having any licensing restrictions).
2. Q:
When trying to run on an IBM RS6000 with the ch_p4 device,
I got
% mpirun -np 2 cpi Could not load program /home/me/mpich/examples/basic/cpi Could not load library libC.a[shr.o] Error was: No such file or directoryA: This means that MPICH was built with the xlC compiler but that some of the machines in your util/machines/machines.rs6000 file do not have xlC installed. Either install xlC or rebuild MPICH to use another compiler (either xlc or gcc; gcc has the advantage of never having any licensing restrictions).
$ mpirun -np 2 hello ERROR: 0031-124 Couldn't allocate nodes for parallel execution. Exiting ... ERROR: 0031-603 Resource Manager allocation for task: 0, node: me1.myuniv .edu, rc = JM_PARTIONCREATIONFAILURE ERROR: 0031-635 Non-zero status -1 returned from pm_mgr_initA: This means that either mpirun is trying to start jobs on your SP in a way different than your installation supports or that there has been a failure in the IBM software that manages the parallel jobs (all of these error messages are from the IBM poe command that mpirun uses to start the MPI job). Contact your system administrator for help in fixing this situation. You system administrator can use
dsh -av "ps aux | egrep -i 'poe|pmd|jmd'"from the control workstation to search for stray IBM POE jobs that can cause this behavior. The files /tmp/jmd_err on the individual nodes may also contain useful diagnostic information.
2. Q:
When trying to run on an IBM SPx, I get the message from
mpirun:
ERROR: 0031-214 pmd: chdir </a/user/gamma/home/mpich/examples/basic> ERROR: 0031-214 pmd: chdir </a/user/gamma/homempich/examples/basic>A: These are messages from tbe IBM system, not from mpirun. They may be caused by an incompatibility between POE, the automounter (especially AMD) and the shell, especially if you are using a shell other than ksh. There is no good solution; IBM often recommends changing your shell to ksh!
3. Q:
When I tried to run my program on an IBM SPx, I got
ERROR : Cannot locate message catalog (pepoe.cat) using current NLSPATH INFO : If NLSPATH is set correctly and catalog exists, check LANG or LC_MESSAGES variables (C) Opening of "pepoe.cat" message catalog failed(and other variations that mention NLSPATH and ``message catalog'').
A: This is a problem in your system; contact your support staff. Have them look at (a) value of NLSPATH, (b) links from /usr/lib/nls/msg/prime to the appropriate language directory. The messages are not from MPICH; they are from the IBM POE/MPL code the the MPICH implementation is using.
4. Q:
When trying to run on an IBM SP2, I get this message:
ERROR: 0031-124 Less than 2 nodes available from pool 0A: This means that the IBM POE/MPL system could not allocate the requested nodes when you tried to run your program; most likely, someone else was using the system. You can try to use the environment variables MP_RETRY and MP_RETRYCOUNT to cause the job to wait until the nodes become available. Use man poe to get more information.
5. Q:
When running on an IBM SP, my job generates the message
Message number 0031-254 not found in Message Catalog.and then dies.
A: If your user name is eight characters long, you may be experiencing a bug in the IBM POE environment. The only fix at the time this was written was to use an account whose user name was seven characters or less. Ask your IBM representative about PMR 4017X (poe with userids of length eight fails) and the associated APAR IX56566.