Quantcast

PZNAUPD sometimes stuck on 1 of 2 CPUs

classic Classic list List threaded Threaded
1 message Options
Kyrre Ness Sjøbæk Kyrre Ness Sjøbæk
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

PZNAUPD sometimes stuck on 1 of 2 CPUs

Hi,

As mentioned in the previous email, I'm trying to debug what seems to be an PARPACK related problem in a large application. I haven't made arpack-ng work yet, but I'll still try to post it here in case it is a known problem, and switch to arpack-ng when I have this running.

The application is calling PARPACK from C++. As this is a sub-problem of the main problem (the application have to solve a number of small sub-problems before it can attack the main problem), only a few of the total number of CPUs are participating in this communication (usually 1-2 out ~70), while the rest are waiting for the sub-problem to finish.

This usually works fine, but for *some* problems, PZNAUPD doesn't return in one of the nodes, leading to the rest of the nodes in the sub-communicator getting stuck at the next collective operation.

The code snippet in question looks like, including a lot of debugging output and general mangling trying to get it to work, looks like this:
(more text below):

>  int    ido      = 0;
>   char   bmat[2]  = "I";//"G"; //Was "I"
>   int    nloc     = A.local_nrows();
>   char   which[3] = "SR";
>   int    nev      = numEig;
>   double tol      = m_tol;
>   int    ncv      = std::max(m_ncv, std::max(2 * nev, static_cast<int>(minNcv)));
>   int    ldv      = nloc;
>   int    lworkl   = ncv * (3 * ncv + 5);
>   int    info     = 0;
>   std::vector<CplxType> resid(nloc);
>   std::vector<CplxType> v(ldv * ncv);
>   std::vector<int>      iparam(11);
>   std::vector<int>      ipntr(14);
>   std::vector<CplxType> workd(3 * nloc);
>   std::vector<CplxType> workl(lworkl);
>   std::vector<double>   rwork(ncv);
>   iparam[0] = 1;
>   iparam[2] = m_maxit;
>   iparam[6] = 1;//2; // Was 1
>
>   //std::cout << y->size() << " " << nloc << std::endl << std::flush;
>
>   if (commrank == 0) {
>     std::cout << "f = " << k0 *C / (2. * PI * 1.e9) << " GHz (k0 = " << k0 << ", k = " << k << ")" << std::endl;
>     std::cout << "ncv = " << ncv << ", nev = " << nev << ", lworkl = " << lworkl << std::endl;
>     std::cout << "Solving for propagating modes..." << std::endl << std::flush;
>   }
>
>   MPI_Barrier(comm);
>
>     //Communicator OK
> //  for (int r = 0; r < commsize; r++) {
> //    int sendBuff = commrank+r;
> //    int globalCommsize = 0;
> //    MPI_Allreduce(&sendBuff,&globalCommsize,1,MPI_INT,MPI_SUM, comm);
> //    if (commrank == r) {
> //      sendBuff = 0;
> //      for (int i =0; i < commsize; i++) sendBuff += i+r;
> //      std::cout << "commtest " << commrank << "/" << commsize << " globalCommsize=" << globalCommsize << " " << sendBuff <<std::endl << std::flush;
> //    }
> //    sleep(1);
> //    MPI_Barrier(comm);
> //  }
>   //std::cout << "Reproduce fComm? " << (comm == MPI_Comm_f2c(fcomm)) << std::endl << std::flush; //OK
>
>   size_t loopidx = 0;
>   while (ido != 99) {
>     std::cout << "Loop=" << loopidx << " ido=" << ido << " " << commrank << "/" << commsize << std::endl <<std::flush; loopidx++;
>     _NAUPD(&fcomm, &ido, bmat, &nloc, which,
>            &nev, &tol, &resid[0], &ncv, &v[0], &ldv,
>            &iparam[0], &ipntr[0], &workd[0], &workl[0], &lworkl,
>            &rwork[0], &info);
>     std::cout << "PZNAUPD done, info= " << info << " " << commrank << "/" << commsize << std::endl << std::flush;
>     MPI_Barrier(comm);
>     std::cout << "Past PZNAUPD-barrier "  << commrank << "/" << commsize << std::endl << std::flush;
>
>     if (info == 1) {
>       break;
>     }
>
>     if ((ido == 1) || (ido == -1)) {
>       if (!commrank) std::cout << "memcpy1"<< std::flush;
>       memcpy(y->data(), &workd[ipntr[0] - 1], nloc * sizeof(CplxType));
>       //if (!commrank && loopidx==1) std::cout << " printVec:" << std::endl << std::flush;
>       //if (loopidx==1){std::cout << std::endl; y->print(std::cout); std::cout << std::flush;}
>       if (!commrank) std::cout << " matrixmult"<< std::flush;
>       A.mult(*y, *x);
>       if (!commrank) std::cout << " linsolver"<< std::flush;
>       m_linearsolver->solve(B, *x, *y);
>       if (!commrank) std::cout << " memcpy2" << std::endl<< std::flush;
>       memcpy(&workd[ipntr[1] - 1], y->data(), nloc * sizeof(CplxType));
>     } else if(ido == 99) {
>       std::cout << "ido was 99 " << commrank << "/" << commsize << std::endl<< std::flush;
>     }
>     else if (ido != 99) {
>       std::cout << "Warning: unexpected ido value: " << ido << " " << commrank << "/" << commsize << std::endl<< std::flush;
>       break;
>     }
>   }
>   if (!commrank) std::cout << "Loop Done." << std::endl<< std::flush;

Typical good output looks like this:

> f = 1000 GHz (k0 = 20958.5, k = 20958.5)
> ncv = 40, nev = 8, lworkl = 5000
> Solving for propagating modes...
> Loop=0 ido=0 0/2
> Loop=0 ido=0 1/2
> PZNAUPD done, info= 0 1/2
> PZNAUPD done, info= 0 0/2
> Past PZNAUPD-barrier 0/2
> Past PZNAUPD-barrier 1/2
> memcpy1 matrixmult linsolver memcpy2
> Loop=1 ido=1 1/2
> Loop=1 ido=1 0/2
> PZNAUPD done, info= 0 1/2
> PZNAUPD done, info= 0 0/2
> Past PZNAUPD-barrier 0/2
> Past PZNAUPD-barrier 1/2
> memcpy1 matrixmult linsolver memcpy2

It uses quite a few iterations to converge (more than what is stored in iparam(7) = iparam[6] = m_maxit!?), but it normally does converge in the end, and the solutions looks good. However, sometimes it gets stuck at the very beginning of the loop:
> f = 1000 GHz (k0 = 20958, k = 20958)
> ncv = 40, nev = 8, lworkl = 5000
> Solving for propagating modes...
> Loop=0 ido=0 0/2
> PZNAUPD done, info= 0 0/2
> Loop=0 ido=0 1/2
This may happen after solving many other similar sub-problems successfully.

This is running on hopper.nersc.gov, using MPICH2 MPI and GCC 4.5.3, and old ARPACK/PARPACK (where in the sources directory can I find the version number? I didn't get it or set it up myself.). The arpack-ng in the other mail was compiled with OpenMPI on Fedora 16 (my development machine - want to get it going there before I try on the cluster).

One interesting thing is that if I change the number of CPUs (while keeping the number of nodes and thus the total memory the same, and usually NOT splitting these sub-problems over multiple CPUs), it works - so the problem specification should be good.

Cheers,
Kyrre Sjøbæk
Loading...