EPW stops when DEALLOCATE mapg, and other gmap crashes

Post here questions linked with issue while running the EPW code

Moderator: stiwari

adenchfi
Posts: 26
Joined: Wed May 22, 2019 7:53 pm
Affiliation:

EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by adenchfi »

Hello,
I'm using the development version of EPW. I've encountered a few issues. I'm not sure if they're edge cases or what, since these issues don't appear for the test suite examples. I can post the scf and nscf input files if necessary, but I only print my epw.in file at the bottom of my post.

Crash type 1: If I run epw with

Code: Select all

mpirun -np $SLURM_NTASKS epw.x -npools 18 -i ./epw.in > epw.out
with 1 node (36 processors) the calculation proceeds as normal until right after hitting the kmaps step. For the above run command, I get a crash on the

Code: Select all

DEALLOCATE(mapg, STAT = ierr)
line in rotate.f90. Here's the last part of my epw.out:

Code: Select all

     
     -------------------------------------------------------------------
     WANNIER      :   8549.48s CPU   8642.39s WALL (       1 calls)
     -------------------------------------------------------------------

     Calculating kgmap

     Progress kgmap: ########################################
     kmaps        :      0.14s CPU      0.26s WALL (       1 calls)
     Estimated size of gmap: ngxx = 7792
     Symmetries of Bravais lattice:   4
     Symmetries of crystal:           2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 101560 RUNNING AT bdw-0192
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 101560 RUNNING AT bdw-0192
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
I made sure to use the same number of pools in my scf and nscf calculations as EPW.

Crash type 2: The error is slightly different if I don't use the -npools 18 modifier: running

Code: Select all

mpirun -np $SLURM_NTASKS epw.x -i ./epw.in > epw.out
produces a different crash, where nothing gets written to stdout, and my epw.out looks like

Code: Select all

     
     Calculating kgmap

     Progress kgmap: ########################################
     kmaps        :      0.01s CPU      0.04s WALL (       1 calls)
     Estimated size of gmap: ngxx =  337

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 168548 RUNNING AT bdw-0533
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
where I assume it hasn't gotten as far. I believe it's crashing in something related to gmap as well though, as the epw.out printout seems to stop at

Code: Select all

CALL readgmap(nkstot)
in elphon_shuffle_wrap.f90 (I have iverbosity == 1).

Crash type 3:
The above epw.in file has my nscf run at 8x8x8 k-points. I was planning on doing 14x14x14 or 16x16x16 for the coarse k-mesh. I originally tried this, with crashes in pw2wan902epw at calculating the MMN file, whereas pw2wannier90 in Quantum Espresso doesn't crash. I don't get any stdout from those crashes. Running etf_mem = 1 didn't help. I did this without the

Code: Select all

-npools N  
command, but it worked in pw2wannier90.x (though it took awhile). To achieve exponential localization I needed something like 90 bands in my disentanglement window, so it's possibly a memory issue that pw2wannier90.x handles better than EPW's pw2wan2epw.

Keep in mind however I did not have these problems with EPW using norm-conserving pseudopotentials for LaTe3, but unfortunately those really don't work for the other rare earth systems I'm studying.

Here is my epw.in:

Code: Select all


--
&inputepw
  prefix = 'DyTe3'
  outdir = './'
  iverbosity  = 1

  elph        = .true.
  kmaps       = .false.
  epbwrite    = .true.
  epbread     = .false.
  epwwrite    = .true.
  epwread     = .false.
  use_ws      = .true.
  etf_mem     = 0       ! let's see if it works
  !vme         = .false.


  !lifc        = .false.
  !asr_typ     = 'simple'


  wannierize  = .false. ! this is for the electrons, which we do ourselves
  num_iter    = 500
  iprint      = 2
  nbndsub     =  8
  !nbndskip    =  28  ! now removed from EPW
  bands_skipped = 'exclude_bands : 1-14'

  dis_win_min  = 0.0
  dis_win_max  = 38.0
  dis_froz_min = 7.12
  dis_froz_max = 7.22

 wdata(1) = 'dis_num_iter = 10000'
  wdata(2) = 'dis_mix_ratio = 0.90'
  wdata(3) = 'dis_conv_tol = 1E-7'
  wdata(4) = 'translate_home_cell = .true.'
  wdata(5) = 'write_xyz = .true.'
  wdata(6) = 'write_hr = .true.'
  wdata(7) = 'hr_cutoff = 0.005'
  wdata(8) = 'write_rmn = .true.'
  wdata(9) = 'use_ws_distance = .true.'
  wdata(10) = 'num_print_cycles = 50'
  wdata(11) = 'kmesh_tol = 0.0000001'
  wdata(12) = 'fermi_surface_plot = .true.'
  wdata(13) = 'fermi_surface_num_points = 41'
  wdata(14) = 'fermi_energy = 7.1340'

  proj(1) = 'f=0.9192953735,0.0807145153,0.7500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(2) = 'f=0.0807046265,0.9192854847,0.2500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(3) = 'f=0.5807034106,0.4192860631,0.7500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(4) = 'f=0.4192965894,0.5807139369,0.2500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(5) = 'f=0.9192953735,0.0807145153,0.7500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(6) = 'f=0.0807046265,0.9192854847,0.2500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(7) = 'f=0.5807034106,0.4192860631,0.7500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(8) = 'f=0.4192965894,0.5807139369,0.2500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'


  elecselfen  = .true.
  phonselfen  = .false.
  a2f         = .false.
  prtgkk      = .false.
  nest_fn     = .true.   ! will give me the nesting function at least
  assume_metal = .true.
  !fsthick     = 0.7 ! eV
  eptemp      = 293 ! K
  degaussw    = 0.02 ! eV
  dvscf_dir   = '/lcrc/project/nickelates/adam-adenchfi/DyTe3_primitive/phon/save'

  !filkf       = './LGX.txt'
  !filqf       = './LGX.txt'

  nk1         = 8
  nk2         = 8
  nk3         = 8
  nkf1        = 8
  nkf2        = 8
  nkf3        = 8
 nq1         = 2
  nq2         = 2
  nq3         = 2
  nqf1        = 2
  nqf2        = 2
  nqf3        = 2
 /
  8 cartesian
   0.000000000   0.000000000   0.000000000
   0.000000000   0.000000000  -1.554493425
   0.000000000  -1.571586348   0.000000000
   0.000000000  -1.571586348  -1.554493425
  -0.500000000  -1.489927397   0.000000000
  -0.500000000  -1.489927397  -1.554493425
  -0.500000000  -3.061513745   0.000000000
  -0.500000000  -3.061513745  -1.554493425

hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by hlee »

Dear adenchfi:

Regarding your issues:
In EPW, there is one restriction between the number of cpus and that of k pools (-npools): the number of k pools should be the same as the number of cpus.
This restriction will be removed in EPW v6.0.

In other words, the above restriction means that in EPW G vectors are not distributed among cores while in pw2wannier90 they are distributed among cores.
Please check my previous post (viewtopic.php?p=3223#p3223).

Below is the excerpt:
As you can see, the main difference between pw2wannier90 and EPW is that in pw2wannier90 plane waves (FFT grids) are distributed, leading to the parallelization of the inverse Fourier transfrom (invFFT), but in EPW invFFT is not parallelized since plane waves (FFT grids) are not distributed.
Sincerely,

H. Lee

adenchfi
Posts: 26
Joined: Wed May 22, 2019 7:53 pm
Affiliation:

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by adenchfi »

Hello H. Lee,

Thank you for the reply! I understand now how this difference in G-vectors would produce this different behavior in execution. I feel silly about the -npools thing, but it seems to not have fixed anything. I know you need to use the same -npools for the scf, nscf, and epw runs, but do you also need to use the same -npools for the ph.in run?

I tried the same set of calculations, now with the prescription of -npools = ncpus. I get a stack trace (here's a subset, it repeated for all the processors, and before that was an indistinguishable memory trace):

Code: Select all

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

#7  0x4ED5D6 in __rotate_MOD_gmap_sym at rotate.f90:674 (discriminator 2)
#7  0x4ED5D6 in __rotate_MOD_gmap_sym at rotate.f90:674 (discriminator 2)
#9  0x4062F4 in MAIN__ at epw.f90:146
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#9  0x4062F4 in MAIN__ at epw.f90:146
#9  0x4062F4 in MAIN__ at epw.f90:146
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#9  0x4062F4 in MAIN__ at epw.f90:146
#9  0x4062F4 in MAIN__ at epw.f90:146
#9  0x4062F4 in MAIN__ at epw.f90:146
and in epw.out:

Code: Select all

    Calculating kgmap

     Progress kgmap: ########################################
     kmaps        :      0.40s CPU      0.44s WALL (       1 calls)
     Estimated size of gmap: ngxx =15584
     Symmetries of Bravais lattice:   4
     Symmetries of crystal:           2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1381 RUNNING AT bdw-0126
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
I re-ran the program to no avail; the only thing different was that stdout got

Code: Select all

*** Error in `epw.x': free(): invalid pointer: 0x000000000f0c21e0 ***
messages as well, at the same point in deallocating the gmap,

Code: Select all

#7  0x4ED5D6 in __rotate_MOD_gmap_sym at rotate.f90:674 (discriminator 2)
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#8  0x424C34 in elphon_shuffle_wrap_ at elphon_shuffle_wrap.f90:484
#9  0x4062F4 in MAIN__ at epw.f90:146
My exact run command was

Code: Select all

mpirun -np $SLURM_NTASKS pw.x -npools 36 -i ./scf.in > scf.out
mpirun -np $SLURM_NTASKS pw.x -npools 36 -i ./nscf.in > nscf.out
mpirun -np $SLURM_NTASKS epw.x -npools 36 -i ./epw.in > epw.out
where I'm indeed using 1 node, 36 cpus.

Since I'm using -npools=ncpus I'm not sure what the error is, but again, it seems to run to the DEALLOCATE mapg part and then not function.

I've compared epw.in files between my previously working EPW runs of norm-conserving LaTe3 and my current epw.in file for DyTe3 with PAW pseudopotentials, and don't see anything that should change the execution (the only differences in inputs are wannier-related and the phonon q-points since different lattice parameter, but I pulled those from ph.out).

Could there be an issue with ibrav? Do I need to have ibrav=0 in my nscf calculation or my phonon calculation? Does ibrav=14 + PAW not work?
Or is it the large memory requirements? I see

Code: Select all

     G-vector sticks info
     --------------------
     sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
     Sum        4823    2985   1117               143243    70047   16005
so maybe that's too much G-vecs? I'm not sure to be honest. I'll keep tinkering and double checking my input.

hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by hlee »

Dear adenchfi:

Thank you for your reply.

I guess that there are some issues with the number of symmetries.

However, in order to reproduce your problem, I need the following informations:
(1) I need to know which development version you are using: Could you let me know the "commit SHA" in your version?
(2) Did you perform test suites in the package? There is one test for PAW in EPW, epw_trev_paw . Is there any failure in this test?
(2) I need all inputs and full outpus (scf.in, nscf.in, ph.in, epw.in).
(3) In particular, I need the full epw.out with high verbosity.

PS) I think this is not related to your issue. But, as you probably know, in PAW case, in addition to dvscf files you should import dvscf_paw_*, including the term due to PAW, from the converged phonon calculations.
Check the following page:
https://gitlab.com/QEF/q-e/-/merge_requests/841

Added after the first post
Additionally, I need the value of ngxxf. You can find this value in the first line of *.kgmap. I helps much if you upload the file of *.kgmap.
Now I suspect the issue related to the out-of-bound of the array, but in order to identify the problem exactly, I need all informations mentioned above.

Sincerely,

H. Lee

adenchfi
Posts: 26
Joined: Wed May 22, 2019 7:53 pm
Affiliation:

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by adenchfi »

The epw_trev_paw test works fine (other than I had to manually do the pp.py part because the server's python script wanted the prefix to have 'prefix' instead of just prefix, e.g. apostrophes. I think it's a python2 vs python3 thing.
PS) I think this is not related to your issue. But, as you probably know, in PAW case, in addition to dvscf files you should import dvscf_paw_*, including the term due to PAW, from the converged phonon calculations.
Check the following page:
https://gitlab.com/QEF/q-e/-/merge_requests/841
Let me do this first. I had been using an older version of pp.py and manually copying over/renaming files. If that doesn't work then I'll reply with the other info, but that very well could be it. If so, another very silly mistake.

hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by hlee »

As I already said, I think that dvscf_paw_* is not related to your problem
I suspect that a problem is related to the out-of-bound in the array and it comes from the value of ngxxf, the first line of *.kgmap.
This number should be smaller than the number of soft grids on all cores.

In any case, I need all informations I mentioned in order to identify exactly the problem.

adenchfi
Posts: 26
Joined: Wed May 22, 2019 7:53 pm
Affiliation:

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by adenchfi »

Hi H. Lee,

I've rerun using the new pp.py to organize the phonon data, and no difference. So here's a data dump:

Code: Select all

scf.in
&CONTROL
                       title = 'DyTe3_prim'
                 calculation = 'scf'
                restart_mode = 'from_scratch'
                  pseudo_dir = '/home/adenchfield/my_pseudo'
                  prefix = 'DyTe3'
               etot_conv_thr = 0.0001
               forc_conv_thr = 0.001
                       nstep = 400
                     tstress = .true.
                     tprnfor = .true.
 /
 &SYSTEM
                   ibrav = 14
                   celldm(1) =     25.66241802
                   celldm(2) =      1.00000145
                   celldm(3) =      0.32084816
                   celldm(6) =     -0.94855396
                         nat = 8
                        ntyp = 2
                     ecutwfc = 46
                     ecutrho = 291
                     occupations='smearing'
                     degauss=0.001
 /
  &ELECTRONS
            electron_maxstep = 500
                    conv_thr = 1.0D-8
              diago_thr_init = 1e-4
                 mixing_mode = 'plain'
                 mixing_beta = 0.7
                 mixing_ndim = 8
             diagonalization = 'david'
 /
&IONS
                ion_dynamics = 'bfgs'
 /
ATOMIC_SPECIES
   Dy  162.5000000000  Dy.pbe-spdn-kjpaw_psl.1.0.0.UPF
   Te  127.6000000000  Te.pbe-n-kjpaw_psl.1.0.0.UPF
ATOMIC_POSITIONS crystal
Dy            0.1726236958        0.8273655930        0.7500000000
Dy            0.8273763042        0.1726344070        0.2500000000
Te            0.9192953735        0.0807145153        0.7500000000
Te            0.0807046265        0.9192854847        0.2500000000
Te            0.5807034106        0.4192860631        0.7500000000
Te            0.4192965894        0.5807139369        0.2500000000
Te            0.2919444833        0.7080622139        0.7500000000
Te            0.7080555167        0.2919377861        0.2500000000

K_POINTS automatic
18  18  18  0 0 0

Code: Select all

nscf.in
&CONTROL
                       title = 'DyTe3_prim'
                 calculation = 'nscf'
                restart_mode = 'from_scratch'
                  pseudo_dir = '/home/adenchfield/my_pseudo'
                  prefix = 'DyTe3'
               etot_conv_thr = 0.0001
               forc_conv_thr = 0.001
                       nstep = 400
                       verbosity='high'
                     tstress = .true.
                     tprnfor = .true.
 /
 &SYSTEM
                       ibrav = 14
                   celldm(1) =     25.66241802
                   celldm(2) =      1.00000145
                   celldm(3) =      0.32084816
                   celldm(6) =     -0.94855396
                         nat = 8
                        ntyp = 2
                     ecutwfc = 45
                     ecutrho = 290
                     occupations='smearing'
                     degauss=0.001
                     nbnd  = 104
                     nosym = .true.
 /
 &ELECTRONS
            electron_maxstep = 500
                    conv_thr = 1.0D-9
              diago_thr_init = 1e-5
                 mixing_mode = 'plain'
                 mixing_beta = 0.7
                 mixing_ndim = 8
             diagonalization = 'david'
 /
&IONS
                ion_dynamics = 'bfgs'
 /

ATOMIC_SPECIES
   Dy  162.5000000000  Dy.pbe-spdn-kjpaw_psl.1.0.0.UPF
   Te  127.6000000000  Te.pbe-n-kjpaw_psl.1.0.0.UPF

ATOMIC_POSITIONS crystal
Dy            0.1726236958        0.8273655930        0.7500000000
Dy            0.8273763042        0.1726344070        0.2500000000
Te            0.9192953735        0.0807145153        0.7500000000
Te            0.0807046265        0.9192854847        0.2500000000
Te            0.5807034106        0.4192860631        0.7500000000
Te            0.4192965894        0.5807139369        0.2500000000
Te            0.2919444833        0.7080622139        0.7500000000
Te            0.7080555167        0.2919377861        0.2500000000

K_POINTS crystal
512
  0.00000000  0.00000000  0.00000000  1.953125e-03
  0.00000000  0.00000000  0.12500000  1.953125e-03
  0.00000000  0.00000000  0.25000000  1.953125e-03
  0.00000000  0.00000000  0.37500000  1.953125e-03
  0.00000000  0.00000000  0.50000000  1.953125e-03
  0.00000000  0.00000000  0.62500000  1.953125e-03
  0.00000000  0.00000000  0.75000000  1.953125e-03
  0.00000000  0.00000000  0.87500000  1.953125e-03
...
(it's an 8x8x8 kmesh generated by kmesh.pl)

The nscf.in for the phonons is the same, except I do K_POINTS automatic 8 8 8 0 0 0

Code: Select all

ph.in
--
&inputph
  epsil    = .false.,
  fildyn   = 'DyTe3.dyn',
  prefix   = 'DyTe3'
  nq1 = 2
  nq2 = 2
  nq3 = 2
  recover=.true.
  max_seconds=252000
  !qplot    = .true.
  !q_in_band_form = .false.
  ldisp    = .true.
  fildvscf = 'dvscf'
  search_sym=.true.
  tr2_ph   =  1.0d-10,
  !alpha_mix = 0.15
 /

Code: Select all

--
&inputepw
  prefix = 'DyTe3'
  outdir = './'
  iverbosity  = 1

  elph        = .true.
  kmaps       = .false.
  epbwrite    = .true.
  epbread     = .false.
  epwwrite    = .true.
  epwread     = .false.
  use_ws      = .true.
  etf_mem     = 0       ! let's see if it works
  !vme         = .false.


  !lifc        = .false.
  !asr_typ     = 'simple'


  wannierize  = .true. ! this is for the electrons, which we do ourselves
  num_iter    = 500
  iprint      = 2

  nbndsub     =  8
  !nbndskip    =  28  ! now removed from EPW?
  bands_skipped = 'exclude_bands : 1-14'

  dis_win_min  = 0.0
  dis_win_max  = 38.0
  dis_froz_min = 7.12
  dis_froz_max = 7.22

  wdata(1) = 'dis_num_iter = 10000'
  wdata(2) = 'dis_mix_ratio = 0.90'
  wdata(3) = 'dis_conv_tol = 1E-5'
  wdata(4) = 'translate_home_cell = .true.'
  wdata(5) = 'write_xyz = .true.'
  wdata(6) = 'write_hr = .true.'
  wdata(7) = 'hr_cutoff = 0.005'
  wdata(8) = 'write_rmn = .true.'
  wdata(9) = 'use_ws_distance = .true.'
  wdata(10) = 'num_print_cycles = 50'
  wdata(11) = 'kmesh_tol = 0.0000001'
  wdata(12) = 'fermi_surface_plot = .true.'
  wdata(13) = 'fermi_surface_num_points = 41'
  wdata(14) = 'fermi_energy = 7.1340'

  proj(1) = 'f=0.9192953735,0.0807145153,0.7500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(2) = 'f=0.0807046265,0.9192854847,0.2500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(3) = 'f=0.5807034106,0.4192860631,0.7500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(4) = 'f=0.4192965894,0.5807139369,0.2500000000:px:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(5) = 'f=0.9192953735,0.0807145153,0.7500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(6) = 'f=0.0807046265,0.9192854847,0.2500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(7) = 'f=0.5807034106,0.4192860631,0.7500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'
  proj(8) = 'f=0.4192965894,0.5807139369,0.2500000000:pz:z=0.70710678,0,0.70710678:x=0.70710678,0,-0.70710678'


  elecselfen  = .true.
  phonselfen  = .false.
  a2f         = .false.
  prtgkk      = .false.
  nest_fn     = .true.   ! will give me the nesting function at least
  assume_metal = .true.
 !fsthick     = 0.7 ! eV
  eptemp      = 293 ! K
  degaussw    = 0.02 ! eV

  dvscf_dir   = '/lcrc/project/nickelates/adam-adenchfi/DyTe3_primitive/phon/save'

  !filkf       = './LGX.txt'
  !filqf       = './LGX.txt'

  nk1         = 8  ! currently running 8x8x8, just making sure this time actually works. Might need 8x8x8 to match phonon computation
  nk2         = 8
  nk3         = 8

  nkf1        = 8
  nkf2        = 8
  nkf3        = 8
   nq1         = 2
  nq2         = 2
  nq3         = 2
  nqf1        = 2
  nqf2        = 2
  nqf3        = 2
 /
  8 cartesian
  0.000000000   0.000000000   0.000000000
  0.000000000   0.000000000  -1.558369542
  0.000000000  -1.579200165   0.000000000
  0.000000000  -1.579200165  -1.558369542
 -0.500000000  -1.497958742   0.000000000
 -0.500000000  -1.497958742  -1.558369542
 -0.500000000  -3.077158907   0.000000000
 -0.500000000  -3.077158907  -1.558369542
The first few lines of DyTe3.kgmap are

Code: Select all

       43602
     1    63
     2    63
     3    63
     4    63
     5    63
     6    63
     7    63
     8    63
     9    63
    10    63
    11    63
    12    63
    13    63
    14    63
    15    63
    16    63
    17    63
    18    63
    19    63
    20    63
    21    63
    22    63
    23    63
    24    63
    25    63
    26    63
    27    63
    ...
From my epw.out:

Code: Select all

     Calculating kgmap

     Progress kgmap: ########################################
     kmaps        :      0.40s CPU      0.49s WALL (       1 calls)
     Estimated size of gmap: ngxx =15584
     Symmetries of Bravais lattice:   4
     Symmetries of crystal:           2
and then nothing except a crash into stdout (first few lines);

Code: Select all

*** Error in `epw.x': corrupted size vs. prev_size: 0x000000000e90ad40 ***

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
...
2af2f1dff000-2af2f1e00000 rw-s 00000000 00:13 81570014                   /dev/shm/shm-col-space-93568-E2-5A73372B6CFA0
2af2f1e00000-2af2f1e01000 rw-s 00001000 00:13 81570014                   /dev/shm/shm-col-space-93568-E2-5A73372B6CFA0
2af2f1e01000-2af2f1e02000 rw-s 00002000 00:13 81570014                   /dev/shm/shm-col-space-93568-E2-5A73372B6CFA0
2af2f1e02000-2af2f1e4a000 rw-s 00003000 00:13 81570014                   /dev/shm/shm-col-space-93568-E2-5A73372B6CFA0
2af2f1e4a000-2af2f1e92000 rw-s 0004b000 00:13 81570014                   /dev/shm/shm-col-space-93568-E2-5A73372B6CFA0
2af2f1e92000-2af2f604a000 rw-p 00000000 00:00 0
2af2f6306000-2af2f91b5000 r-xp 00000000 00:36 20336762844                /blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/in$
2af2f91b5000-2af2f93b5000 ---p 02eaf000 00:36 20336762844                /blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/in$
2af2f93b5000-2af2f93bc000 r--p 02eaf000 00:36 20336762844                /blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/in$
2af2f93bc000-2af2f93c6000 rw-p 02eb6000 00:36 20336762844                /blues/gpfs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/in$
2af2f93c6000-2af30f7c6000 rw-p 00000000 00:00 0
2af310000000-2af310021000 rw-p 00000000 00:00 0
2af310021000-2af314000000 ---p 00000000 00:00 0
7fff19cd8000-7fff19cfd000 rw-p 00000000 00:00 0                          [stack]
7fff19d0d000-7fff19d0f000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Program received signal SIGABRT: Process abort signal.

which leads into the same backtrace to rotate.f90.

The program hangs for a while though, presumably trying to get the backtrace.

The kgmap zipped file is apparently too large to attach. if you want the whole kgmap file should I paste it all here? The output files are attached in a .zip file here
outputs.zip
(182.97 KiB) Downloaded 254 times
So the outputted ngxx in epw.out (shown above) is 15584, the ngxxf in DyTe3.kgmap is 43602. Here's the G grid data from epw.out:
first

Code: Select all

 
    G-vector sticks info
     --------------------
     sticks:   dense  smooth     PW     G-vecs:    dense   smooth      PW
     Sum        4823    2985   1117               143243    70047   16005

Code: Select all

     
     G cutoff = 4837.6385  ( 143243 G-vectors)     FFT grid: (144,144, 45)
     G cutoff = 3002.6722  (  70047 G-vectors)  smooth grid: (120,120,120)
Edit:
The commit version specifically I don't recall, I didn't git clone, I downloaded the tarball and installed from there. It should be either this one: https://gitlab.com/QEF/q-e/-/commit/0eb ... 23769ba248
or this one: https://gitlab.com/QEF/q-e/-/commit/6b7 ... ebb2d48ff7

Since you pointed out the version/commit # and I notice those commits are related to what I'm doing, I'm downloading the latest version of the develop branch and seeing if that helps.

hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by hlee »

Dear adenchfi:

Thank you for providing inputs and outputs.

In the output of epw, the number of symmetry operations is 2 including q -> -q+G, but it shouldn't be (it should be 4 if q->-q+G is not included).
First, I would suggest you to remove nosym=.true. in nscf.in .
The nscf.in for the phonons is the same, except I do K_POINTS automatic 8 8 8 0 0 0
Are you referring to scf.in? In any case, I would suggest you not to use nosym=.true.

Sincerely,

H. Lee

adenchfi
Posts: 26
Joined: Wed May 22, 2019 7:53 pm
Affiliation:

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by adenchfi »

Hi hlee,

I used nosym=.true. because it used to be recommended for EPW back in 2018 because PW would produce extraneous k-points that EPW didn't handle, see http://indico.ictp.it/event/8301/sessio ... al/0/0.pdf. I assume that has been changed by now.

I got rid of nosym=.true. in nscf.in, and EPW appears to be progressing past its previous point where it would crash.

hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW stops when DEALLOCATE mapg, and other gmap crashes

Post by hlee »

Dear adenchfi:
I used nosym=.true. because it used to be recommended for EPW back in 2018 because PW would produce extraneous k-points that EPW didn't handle, see http://indico.ictp.it/event/8301/sessio ... al/0/0.pdf. I assume that has been changed by now.
I think that nosym=.true. is not a good solution to the issue you mentioned and this issue is related to the following part.

Excerpt from setup.f90 in /PW/src

Code: Select all

  !
  ! ... Input k-points are assumed to be  given in the IBZ of the Bravais
  ! ... lattice, with the full point symmetry of the lattice.
  ! ... If some symmetries of the lattice are missing in the crystal,
  ! ... "irreducible_BZ" computes the missing k-points.
  !
  IF ( .NOT. lbands ) THEN
     CALL irreducible_BZ (nrot_, s, nsym, time_reversal, &
                          magnetic_sym, at, bg, npk, nkstot, xk, wk, t_rev)
  ELSE
     one = SUM (wk(1:nkstot))
     IF ( one > 0.0_dp ) wk(1:nkstot) = wk(1:nkstot) / one
  END IF
More importantly, in the recent development version of EPW,
(1) we don't need the q point list any more in epw.in; this information is automatically retrieved using the correct symmetry of the system.
(2) the last dimension of some arrays has the dimension of nsym, previously 48 (the maximum of symmetry operations) in order to avoid the useless memory usage
So we should not use nosym=.true.

For instance, now you don't need to include the q point list in epw.in:
From your epw.in

Code: Select all

 /
  8 cartesian
  0.000000000   0.000000000   0.000000000
  0.000000000   0.000000000  -1.558369542
  0.000000000  -1.579200165   0.000000000
  0.000000000  -1.579200165  -1.558369542
 -0.500000000  -1.497958742   0.000000000
 -0.500000000  -1.497958742  -1.558369542
 -0.500000000  -3.077158907   0.000000000
 -0.500000000  -3.077158907  -1.558369542
Lastly, I would like to add one comment:
From your nscf.in

Code: Select all

              diago_thr_init = 1e-5
In non-self-consistent calculation, diago_thr_init behaves differently from scf calculation; in nscf calculations, iterative diagonalization continues up to the value of diago_thr_init

Sincerely,

H. Lee

Post Reply