![]() ![]() |
Kyrre Ness Sjøbæk |
![]() |
Hi,
As mentioned in the previous email, I'm trying to debug what seems to be an PARPACK related problem in a large application. I haven't made arpack-ng work yet, but I'll still try to post it here in case it is a known problem, and switch to arpack-ng when I have this running. The application is calling PARPACK from C++. As this is a sub-problem of the main problem (the application have to solve a number of small sub-problems before it can attack the main problem), only a few of the total number of CPUs are participating in this communication (usually 1-2 out ~70), while the rest are waiting for the sub-problem to finish. This usually works fine, but for *some* problems, PZNAUPD doesn't return in one of the nodes, leading to the rest of the nodes in the sub-communicator getting stuck at the next collective operation. The code snippet in question looks like, including a lot of debugging output and general mangling trying to get it to work, looks like this: (more text below): > int ido = 0; > char bmat[2] = "I";//"G"; //Was "I" > int nloc = A.local_nrows(); > char which[3] = "SR"; > int nev = numEig; > double tol = m_tol; > int ncv = std::max(m_ncv, std::max(2 * nev, static_cast<int>(minNcv))); > int ldv = nloc; > int lworkl = ncv * (3 * ncv + 5); > int info = 0; > std::vector<CplxType> resid(nloc); > std::vector<CplxType> v(ldv * ncv); > std::vector<int> iparam(11); > std::vector<int> ipntr(14); > std::vector<CplxType> workd(3 * nloc); > std::vector<CplxType> workl(lworkl); > std::vector<double> rwork(ncv); > iparam[0] = 1; > iparam[2] = m_maxit; > iparam[6] = 1;//2; // Was 1 > > //std::cout << y->size() << " " << nloc << std::endl << std::flush; > > if (commrank == 0) { > std::cout << "f = " << k0 *C / (2. * PI * 1.e9) << " GHz (k0 = " << k0 << ", k = " << k << ")" << std::endl; > std::cout << "ncv = " << ncv << ", nev = " << nev << ", lworkl = " << lworkl << std::endl; > std::cout << "Solving for propagating modes..." << std::endl << std::flush; > } > > MPI_Barrier(comm); > > //Communicator OK > // for (int r = 0; r < commsize; r++) { > // int sendBuff = commrank+r; > // int globalCommsize = 0; > // MPI_Allreduce(&sendBuff,&globalCommsize,1,MPI_INT,MPI_SUM, comm); > // if (commrank == r) { > // sendBuff = 0; > // for (int i =0; i < commsize; i++) sendBuff += i+r; > // std::cout << "commtest " << commrank << "/" << commsize << " globalCommsize=" << globalCommsize << " " << sendBuff <<std::endl << std::flush; > // } > // sleep(1); > // MPI_Barrier(comm); > // } > //std::cout << "Reproduce fComm? " << (comm == MPI_Comm_f2c(fcomm)) << std::endl << std::flush; //OK > > size_t loopidx = 0; > while (ido != 99) { > std::cout << "Loop=" << loopidx << " ido=" << ido << " " << commrank << "/" << commsize << std::endl <<std::flush; loopidx++; > _NAUPD(&fcomm, &ido, bmat, &nloc, which, > &nev, &tol, &resid[0], &ncv, &v[0], &ldv, > &iparam[0], &ipntr[0], &workd[0], &workl[0], &lworkl, > &rwork[0], &info); > std::cout << "PZNAUPD done, info= " << info << " " << commrank << "/" << commsize << std::endl << std::flush; > MPI_Barrier(comm); > std::cout << "Past PZNAUPD-barrier " << commrank << "/" << commsize << std::endl << std::flush; > > if (info == 1) { > break; > } > > if ((ido == 1) || (ido == -1)) { > if (!commrank) std::cout << "memcpy1"<< std::flush; > memcpy(y->data(), &workd[ipntr[0] - 1], nloc * sizeof(CplxType)); > //if (!commrank && loopidx==1) std::cout << " printVec:" << std::endl << std::flush; > //if (loopidx==1){std::cout << std::endl; y->print(std::cout); std::cout << std::flush;} > if (!commrank) std::cout << " matrixmult"<< std::flush; > A.mult(*y, *x); > if (!commrank) std::cout << " linsolver"<< std::flush; > m_linearsolver->solve(B, *x, *y); > if (!commrank) std::cout << " memcpy2" << std::endl<< std::flush; > memcpy(&workd[ipntr[1] - 1], y->data(), nloc * sizeof(CplxType)); > } else if(ido == 99) { > std::cout << "ido was 99 " << commrank << "/" << commsize << std::endl<< std::flush; > } > else if (ido != 99) { > std::cout << "Warning: unexpected ido value: " << ido << " " << commrank << "/" << commsize << std::endl<< std::flush; > break; > } > } > if (!commrank) std::cout << "Loop Done." << std::endl<< std::flush; Typical good output looks like this: > f = 1000 GHz (k0 = 20958.5, k = 20958.5) > ncv = 40, nev = 8, lworkl = 5000 > Solving for propagating modes... > Loop=0 ido=0 0/2 > Loop=0 ido=0 1/2 > PZNAUPD done, info= 0 1/2 > PZNAUPD done, info= 0 0/2 > Past PZNAUPD-barrier 0/2 > Past PZNAUPD-barrier 1/2 > memcpy1 matrixmult linsolver memcpy2 > Loop=1 ido=1 1/2 > Loop=1 ido=1 0/2 > PZNAUPD done, info= 0 1/2 > PZNAUPD done, info= 0 0/2 > Past PZNAUPD-barrier 0/2 > Past PZNAUPD-barrier 1/2 > memcpy1 matrixmult linsolver memcpy2 It uses quite a few iterations to converge (more than what is stored in iparam(7) = iparam[6] = m_maxit!?), but it normally does converge in the end, and the solutions looks good. However, sometimes it gets stuck at the very beginning of the loop: > f = 1000 GHz (k0 = 20958, k = 20958) > ncv = 40, nev = 8, lworkl = 5000 > Solving for propagating modes... > Loop=0 ido=0 0/2 > PZNAUPD done, info= 0 0/2 > Loop=0 ido=0 1/2 This may happen after solving many other similar sub-problems successfully. This is running on hopper.nersc.gov, using MPICH2 MPI and GCC 4.5.3, and old ARPACK/PARPACK (where in the sources directory can I find the version number? I didn't get it or set it up myself.). The arpack-ng in the other mail was compiled with OpenMPI on Fedora 16 (my development machine - want to get it going there before I try on the cluster). One interesting thing is that if I change the number of CPUs (while keeping the number of nodes and thus the total memory the same, and usually NOT splitting these sub-problems over multiple CPUs), it works - so the problem specification should be good. Cheers, Kyrre Sjøbæk |
Free forum by Nabble | Edit this page |