I had previously posted some notes on Compiling scalapack on SL5x, but I forgot to check whether the library works or not and now I need to use it. So I compiled up and ran the test code that shipped with scalapack.

[jtang@duo TESTING]$ mpirun -np 2 ./xdgsep
[duo:26526] *** An error occurred in MPI_Comm_group
[duo:26526] *** on communicator MPI_COMM_WORLD
[duo:26526] *** MPI_ERR_COMM: invalid communicator
[duo:26526] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 26525 on
node duo exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[duo:26524] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[duo:26524] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The error didn't make much sense, except it was blindingly obvious that I had not configured BLACS (hence MPIBLACS) correctly.

Going back to my BLACS build directory I run the tests

[jtang@duo EXE]$ mpirun -np 2 ./xFbtest_MPI-LINUX-0 
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=0, Contxt=-1, on line 18 of file 'blacs_set_.c'.

BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=1, Contxt=-1, on line 18 of file 'blacs_set_.c'.

 Sample BLACS tester run                                                         
==============================================
==============================================
BEGINNING BLACS TESTING, BLACS DEBUG LEVEL = 0
==============================================
==============================================
BLACS ERROR 'Illegal grid (2 x 2), #procs=2'
from {-1,-1}, pnum=0, Contxt=-1, on line -1 of file 'BLACS_GRIDINIT/BLACS_GRIDMAP'.

BLACS ERROR 'Illegal grid (2 x 2), #procs=2'
from {-1,-1}, pnum=1, Contxt=-1, on line -1 of file 'BLACS_GRIDINIT/BLACS_GRIDMAP'.

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 1473 on
node duo exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[duo:01472] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[duo:01472] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

It had turned out that MPIBLACS required a different TRANSCOMM setting, according to http://www.open-mpi.org/faq/?category=mpi-apps#blacs, after correcting this mistake I recompiled MPIBLACS and re-ran the tests and followed up by recompiling scalapack. The tests all pass after I corrected this issue.

On a side note, OpenMPI seemed to want default to the verbs backend for communications. To run OpenMPI compiled programs with TCP/SMP only communications you can do this

mpirun --mca btl tcp,self -np 4 --hostfile hostfile ./myapp

you could also edit /etc/openmpi-mca-params.conf and set this option

btl = tcp,self

this will make OpenMPI use only TCP/SMP communications by default.

I guess the moral of the story is to run the tests to make sure things are working as expected.

Bookmark and Share