MPI_Gatherv fails between bloch2wane and bloch2wanp

Post here questions linked with issue while running the EPW code

Moderator: stiwari

Post Reply
mjankousky
Posts: 2
Joined: Tue Apr 26, 2022 7:32 pm
Affiliation: Colorado School of Mines

MPI_Gatherv fails between bloch2wane and bloch2wanp

Post by mjankousky »

Dear EPW developers,

I hope that this message finds you well. I am having issues with either the bloch2wane routine or the bloch2wanp routine for a specific case, and I am hoping that you will be able to help me.

I am using epw version 5.4 and quantum espresso version 7.1, downloaded from https://gitlab.com/QEF/q-e/-/tree/merge_7.1rc2
When trying to compute electron mobility for zincblende GeC, the epw1 calculation crashes after the last bloch2wane step is written to epw1.out and before the first bloch2wanp step is written to epw1.out.
Below is the standard error message which refers to “MPI_Gatherv failed.”

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Abort(205092111) on node 192 (rank 192 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(416).....................: MPI_Gatherv failed(sbuf=0x7ffdb922c2c0, scount=5, MPI_CHAR, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_CHAR, root=0, comm=comm=0xc400000e) failed
MPIDI_SHMGR_Gather_generic(1491)......:
MPIDI_NM_mpi_gatherv(506).............:
MPIR_Gatherv_allcomm_linear_ssend(120):
MPIC_Ssend(269).......................:
MPID_Ssend(614).......................:
MPIDI_OFI_send_normal(412)............:
MPIDI_OFI_send_handler(704)...........: OFI tagged inject failed (ofi_impl.h:704:MPIDI_OFI_send_handler:Connection timed out)
Abort(272200975) on node 60 (rank 60 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
………
The error continues like that for a couple of pages then
……….
Abort(473527567) on node 166 (rank 166 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(416).....................: MPI_Gatherv failed(sbuf=0x7ffcb1e97040, scount=5, MPI_CHAR, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_CHAR, root=0, comm=comm=0xc400000e) failed
MPIDI_SHMGR_Gather_generic(1491)......:
MPIDI_NM_mpi_gatherv(506).............:
MPIR_Gatherv_allcomm_linear_ssend(120):
MPIC_Ssend(269).......................:
MPID_Ssend(614).......................:
MPIDI_OFI_send_normal(412)............:
MPIDI_OFI_send_handler(704)...........: OFI tagged inject failed (ofi_impl.h:704:MPIDI_OFI_send_handler:Connection timed out)
srun: error: c016: tasks 1-35: Killed
srun: error: c021: tasks 72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106: Killed
srun: error: c017: tasks 36-71: Killed
srun: error: c016: task 0: Killed

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

The case that fails has
nk1=nk2=nk3=22, nq1=nq2=nq3=11, total number of cores = 360 (36 cores/node, 10 nodes)

Using a less dense k and q grid, and fewer nodes for an identical system, the calculation succeeds
nk1=nk2=nk3=8, nq1=nq2=nq3=4, total number of cores = 180 (36 cores/node, 5 nodes)

The decision to move to 10 cores from 5 was because when using the denser coarse k-grid there was an apparent out of memory error earlier in the calculation. After wannierization and the statement that the quadrupole tensor was correctly read, but before q-points started to be listed in the output file, the job would hang (but not leave the queue), and an error similar to “srun: error: c023: tasks 36-71: Killed” would appear in the error file. Increasing number of nodes resolved this issue, but the calculation then started to have the MPI_Gatherv issue described above.
A similar calculation for a different system, computing the mobility of electrons for zincblende SnC on 10 nodes was successful with the same k-point and q-point density. The total number of electrons, k-points and q-points is equal for both systems.
I have also been able to successfully run the epw1 and epw2 calculations with the k-grids specified below for holes and electrons for other zincblende structures. I do not think that the issue is geometry, as forces and pressures for the scf calculation are approximately zero.

Relevant parameters:
We use structures relaxed in quantum espresso for these calculations. We use an energy cutoff of 100 Ry for each of these systems. Norm-conserving pseudopotentials generate by the ONCVPSP code and as tuned in the pseudo-dojo were used. The coarse k-grid for used to generate the wannier function interpolations of the band structure was 16x16x16 for holes and 22x22x22 for electrons. The coarse q-grids were taken to be half of the k-grids. Spin-orbit coupling effects, dipole-dipole, dipole-quadrupole, and quadrupole-quadrupole effects, and local velocity relaxation were considered. Quadrupoles were computed using the ABINIT code. Norm-conserving pseudopotentials generated by the ONCVPSP code and as tuned in the pseudo-dojo, but with no nonlinear core correction, were used to calculate the quadrupoles.

Thank you in advance for any advice you have to resolve this issue!

My epw1.in file is below:

--
&inputepw
prefix = 'gec'
amass(1) = 72.630
amass(2) = 12.01078,
outdir = './'

elph = .true.
epbwrite = .false.
epbread = .false.
epwwrite = .true.
epwread = .false.
etf_mem = 1
lpolar = .true. ! polar material
vme = 'wannier' !.true. in old version is equivalent
use_ws = .false. ! Gamma centered Wigner-Seitz cell

lifc = .false. ! false by default, if true, use interatomic force constants from q2r
asr_typ = 'simple' !simple is default, kind of acoustic sum rule
lphase = .true. ! different than default, fixes gauge for interpolated dynamical matrix and electronic hamiltonian

nbndsub = 8
bands_skipped = 'exclude_bands = 1-18' ! exclude valence manifold

wannierize = .true. ! set to false to restart
num_iter = 50000
iprint = 2
dis_win_max = 25.0 ! to disentangle wannierize should include some bands above those that are strictly required for 8 wannier functions in spin polarized case
dis_win_min = 13.3
dis_froz_min = 13.3
dis_froz_max = 21.5

proj(1) = 'f=0,0,0:sp3' ! changed to match SiC in arxiv

wdata(1) = 'bands_plot = .true.'
wdata(2) = 'begin kpoint_path'
!! path below is changed, leads to different indexing
wdata(3) = 'L 0.50 0.00 0.00 G 0.00 0.00 0.00'
wdata(4) = 'G 0.00 0.00 0.00 X 0.50 0.50 0.00'
wdata(5) = 'X 0.50 0.50 0.00 U 0.625 0.625 0.25'
wdata(6) = 'K 0.75 0.375 0.375 G 0.00 0.00 0.00'
wdata(7) = 'end kpoint_path'
wdata(8) = 'bands_plot_format = gnuplot'
wdata(9) = 'guiding_centres = .true.'
wdata(10) = 'dis_num_iter = 3000' ! changed from 5000
wdata(11) = 'num_print_cycles = 10'
wdata(12) = 'dis_mix_ratio = 1.0'
wdata(13) = 'conv_tol = 1E-12'
wdata(14) = 'conv_window = 4'
wdata(15) = 'use_ws_distance = T'

elecselfen = .false.
phonselfen = .false.
a2f = .false.

fsthick = 100
nstemp = 1
temps = 1
degaussw = 0.001

dvscf_dir = './save'

band_plot = .true. ! generate plots for dispersions requires LGXKG input to run

prtgkk = .false. ! prints electron-phonon vertex for each q- and k-point, slows down calculation, excluding for speed and brevity at this time.

filkf = './LGXKG4.txt'
filqf = './LGXKG4.txt'

nk1 = 22
nk2 = 22
nk3 = 22
nq1 = 11
nq2 = 11
nq3 = 11
/

Best,
Matt
Post Reply