Dear EPW developers,
I hope that this message finds you well. I am having issues with either the bloch2wane routine or the bloch2wanp routine for a specific case, and I am hoping that you will be able to help me.
I am using epw version 5.4 and quantum espresso version 7.1, downloaded from https://gitlab.com/QEF/qe//tree/merge_7.1rc2
When trying to compute electron mobility for zincblende GeC, the epw1 calculation crashes after the last bloch2wane step is written to epw1.out and before the first bloch2wanp step is written to epw1.out.
Below is the standard error message which refers to “MPI_Gatherv failed.”
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Abort(205092111) on node 192 (rank 192 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(416).....................: MPI_Gatherv failed(sbuf=0x7ffdb922c2c0, scount=5, MPI_CHAR, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_CHAR, root=0, comm=comm=0xc400000e) failed
MPIDI_SHMGR_Gather_generic(1491)......:
MPIDI_NM_mpi_gatherv(506).............:
MPIR_Gatherv_allcomm_linear_ssend(120):
MPIC_Ssend(269).......................:
MPID_Ssend(614).......................:
MPIDI_OFI_send_normal(412)............:
MPIDI_OFI_send_handler(704)...........: OFI tagged inject failed (ofi_impl.h:704:MPIDI_OFI_send_handler:Connection timed out)
Abort(272200975) on node 60 (rank 60 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
………
The error continues like that for a couple of pages then
……….
Abort(473527567) on node 166 (rank 166 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(416).....................: MPI_Gatherv failed(sbuf=0x7ffcb1e97040, scount=5, MPI_CHAR, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_CHAR, root=0, comm=comm=0xc400000e) failed
MPIDI_SHMGR_Gather_generic(1491)......:
MPIDI_NM_mpi_gatherv(506).............:
MPIR_Gatherv_allcomm_linear_ssend(120):
MPIC_Ssend(269).......................:
MPID_Ssend(614).......................:
MPIDI_OFI_send_normal(412)............:
MPIDI_OFI_send_handler(704)...........: OFI tagged inject failed (ofi_impl.h:704:MPIDI_OFI_send_handler:Connection timed out)
srun: error: c016: tasks 135: Killed
srun: error: c021: tasks 72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106: Killed
srun: error: c017: tasks 3671: Killed
srun: error: c016: task 0: Killed
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The case that fails has
nk1=nk2=nk3=22, nq1=nq2=nq3=11, total number of cores = 360 (36 cores/node, 10 nodes)
Using a less dense k and q grid, and fewer nodes for an identical system, the calculation succeeds
nk1=nk2=nk3=8, nq1=nq2=nq3=4, total number of cores = 180 (36 cores/node, 5 nodes)
The decision to move to 10 cores from 5 was because when using the denser coarse kgrid there was an apparent out of memory error earlier in the calculation. After wannierization and the statement that the quadrupole tensor was correctly read, but before qpoints started to be listed in the output file, the job would hang (but not leave the queue), and an error similar to “srun: error: c023: tasks 3671: Killed” would appear in the error file. Increasing number of nodes resolved this issue, but the calculation then started to have the MPI_Gatherv issue described above.
A similar calculation for a different system, computing the mobility of electrons for zincblende SnC on 10 nodes was successful with the same kpoint and qpoint density. The total number of electrons, kpoints and qpoints is equal for both systems.
I have also been able to successfully run the epw1 and epw2 calculations with the kgrids specified below for holes and electrons for other zincblende structures. I do not think that the issue is geometry, as forces and pressures for the scf calculation are approximately zero.
Relevant parameters:
We use structures relaxed in quantum espresso for these calculations. We use an energy cutoff of 100 Ry for each of these systems. Normconserving pseudopotentials generate by the ONCVPSP code and as tuned in the pseudodojo were used. The coarse kgrid for used to generate the wannier function interpolations of the band structure was 16x16x16 for holes and 22x22x22 for electrons. The coarse qgrids were taken to be half of the kgrids. Spinorbit coupling effects, dipoledipole, dipolequadrupole, and quadrupolequadrupole effects, and local velocity relaxation were considered. Quadrupoles were computed using the ABINIT code. Normconserving pseudopotentials generated by the ONCVPSP code and as tuned in the pseudodojo, but with no nonlinear core correction, were used to calculate the quadrupoles.
Thank you in advance for any advice you have to resolve this issue!
My epw1.in file is below:

&inputepw
prefix = 'gec'
amass(1) = 72.630
amass(2) = 12.01078,
outdir = './'
elph = .true.
epbwrite = .false.
epbread = .false.
epwwrite = .true.
epwread = .false.
etf_mem = 1
lpolar = .true. ! polar material
vme = 'wannier' !.true. in old version is equivalent
use_ws = .false. ! Gamma centered WignerSeitz cell
lifc = .false. ! false by default, if true, use interatomic force constants from q2r
asr_typ = 'simple' !simple is default, kind of acoustic sum rule
lphase = .true. ! different than default, fixes gauge for interpolated dynamical matrix and electronic hamiltonian
nbndsub = 8
bands_skipped = 'exclude_bands = 118' ! exclude valence manifold
wannierize = .true. ! set to false to restart
num_iter = 50000
iprint = 2
dis_win_max = 25.0 ! to disentangle wannierize should include some bands above those that are strictly required for 8 wannier functions in spin polarized case
dis_win_min = 13.3
dis_froz_min = 13.3
dis_froz_max = 21.5
proj(1) = 'f=0,0,0:sp3' ! changed to match SiC in arxiv
wdata(1) = 'bands_plot = .true.'
wdata(2) = 'begin kpoint_path'
!! path below is changed, leads to different indexing
wdata(3) = 'L 0.50 0.00 0.00 G 0.00 0.00 0.00'
wdata(4) = 'G 0.00 0.00 0.00 X 0.50 0.50 0.00'
wdata(5) = 'X 0.50 0.50 0.00 U 0.625 0.625 0.25'
wdata(6) = 'K 0.75 0.375 0.375 G 0.00 0.00 0.00'
wdata(7) = 'end kpoint_path'
wdata(8) = 'bands_plot_format = gnuplot'
wdata(9) = 'guiding_centres = .true.'
wdata(10) = 'dis_num_iter = 3000' ! changed from 5000
wdata(11) = 'num_print_cycles = 10'
wdata(12) = 'dis_mix_ratio = 1.0'
wdata(13) = 'conv_tol = 1E12'
wdata(14) = 'conv_window = 4'
wdata(15) = 'use_ws_distance = T'
elecselfen = .false.
phonselfen = .false.
a2f = .false.
fsthick = 100
nstemp = 1
temps = 1
degaussw = 0.001
dvscf_dir = './save'
band_plot = .true. ! generate plots for dispersions requires LGXKG input to run
prtgkk = .false. ! prints electronphonon vertex for each q and kpoint, slows down calculation, excluding for speed and brevity at this time.
filkf = './LGXKG4.txt'
filqf = './LGXKG4.txt'
nk1 = 22
nk2 = 22
nk3 = 22
nq1 = 11
nq2 = 11
nq3 = 11
/
Best,
Matt
MPI_Gatherv fails between bloch2wane and bloch2wanp

 Posts: 2
 Joined: Tue Apr 26, 2022 7:32 pm
 Affiliation: Colorado School of Mines