As mentioned in the previous email, I'm trying to debug what seems to be an PARPACK related problem in a large application. I haven't made arpack-ng work yet, but I'll still try to post it here in case it is a known problem, and switch to arpack-ng when I have this running.
The application is calling PARPACK from C++. As this is a sub-problem of the main problem (the application have to solve a number of small sub-problems before it can attack the main problem), only a few of the total number of CPUs are participating in this communication (usually 1-2 out ~70), while the rest are waiting for the sub-problem to finish.
This usually works fine, but for *some* problems, PZNAUPD doesn't return in one of the nodes, leading to the rest of the nodes in the sub-communicator getting stuck at the next collective operation.
The code snippet in question looks like, including a lot of debugging output and general mangling trying to get it to work, looks like this:
(more text below):
It uses quite a few iterations to converge (more than what is stored in iparam(7) = iparam = m_maxit!?), but it normally does converge in the end, and the solutions looks good. However, sometimes it gets stuck at the very beginning of the loop:
> f = 1000 GHz (k0 = 20958, k = 20958)
> ncv = 40, nev = 8, lworkl = 5000
> Solving for propagating modes...
> Loop=0 ido=0 0/2
> PZNAUPD done, info= 0 0/2
> Loop=0 ido=0 1/2
This may happen after solving many other similar sub-problems successfully.
This is running on hopper.nersc.gov, using MPICH2 MPI and GCC 4.5.3, and old ARPACK/PARPACK (where in the sources directory can I find the version number? I didn't get it or set it up myself.). The arpack-ng in the other mail was compiled with OpenMPI on Fedora 16 (my development machine - want to get it going there before I try on the cluster).
One interesting thing is that if I change the number of CPUs (while keeping the number of nodes and thus the total memory the same, and usually NOT splitting these sub-problems over multiple CPUs), it works - so the problem specification should be good.