Invalid memory reference -- double free or corruption (out) while doing MgB2 1x1x2 supercell calculation

Post here questions linked with issue while running the EPW code

Moderator: hlee

Post Reply
al7
Posts: 8
Joined: Thu Apr 21, 2022 3:09 pm
Affiliation: University of Illinois Chicago

Invalid memory reference -- double free or corruption (out) while doing MgB2 1x1x2 supercell calculation

Post by al7 »

Hi,
I am trying to run an anisotropic calculation on MgB2 1x1x2 supercell. The unit cell calculation runs fine but the 1x1x2 supercell calculation gives the following error:

Code: Select all

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7ff8e9680dbf in ???
#1  0x7ff8e96d3cb3 in ???
#2  0x54f91b in __pw2wan2epw_MOD_compute_amn_para
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/pw2wan2epw.f90:1142
#3  0x556964 in __pw2wan2epw_MOD_pw2wan90epw
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/pw2wan2epw.f90:103
#4  0x4227d9 in __wannierization_MOD_wann_run
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/wannierization.f90:73
#5  0x40964d in epw
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/epw.f90:133
#6  0x40945c in main
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/epw.f90:20
When I ran the same supercell calculation earlier, I got the following error message with the same Backtrace for the error:

Code: Select all

double free or corruption (out)
Program received signal SIGABRT: Process abort signal.
As this error is related to memory issue I used valgrind to debug as below,

Code: Select all

valgrind --leak-check=full srun epw.x $flags1 -nk 128 -input epw.in > epw.out 2> epw.err
but I still do not understand what is giving this error.

Code: Select all

==462528== Memcheck, a memory error detector
==462528== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==462528== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==462528== Command: srun epw.x -nimage 1 -npool 16 -nband 1 -ntg 1 -ndiag 1 -nk 128 -input epw.in
==462528==
munmap_chunk(): invalid pointer

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f85954a2dbf in ???
#1  0x7f85954a2d2b in ???
#2  0x7f85954a43e4 in ???
#3  0x7f85954e8c26 in ???
#4  0x7f85954f0cc9 in ???
#5  0x7f85954f0f9b in ???
#6  0x54f91b in __pw2wan2epw_MOD_compute_amn_para
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/pw2wan2epw.f90:1142
#7  0x556964 in __pw2wan2epw_MOD_pw2wan90epw
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/pw2wan2epw.f90:103
#8  0x4227d9 in __wannierization_MOD_wann_run
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/wannierization.f90:73
#9  0x40964d in epw
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/epw.f90:133
#10  0x40945c in main
        at /global/common/software/nersc/pm-2021q4/sw/qe/pm-cpu/qe-7.0/EPW/src/epw.f90:20
srun: error: nid006696: task 107: Aborted
srun: Terminating StepId=18135667.2
slurmstepd: error: *** STEP 18135667.2 ON nid004583 CANCELLED AT 2023-11-12T23:26:14 ***
srun: error: nid005797: tasks 48-63: Terminated
srun: error: nid004773: tasks 16-31: Terminated
srun: error: nid004583: tasks 0-15: Terminated
srun: error: nid006134: tasks 80-95: Terminated
srun: error: nid005672: tasks 32-47: Terminated
srun: error: nid005866: tasks 64-79: Terminated
srun: error: nid006696: tasks 96-106,108-111: Terminated
srun: error: nid006779: tasks 112-127: Terminated
srun: Force Terminated StepId=18135667.2
==462529==
==462529== HEAP SUMMARY:
==462529==     in use at exit: 684,400 bytes in 4,903 blocks
==462529==   total heap usage: 23,096 allocs, 18,193 frees, 12,955,997 bytes allocated
==462529==
 ==462529== 0 bytes in 1 blocks are possibly lost in loss record 9 of 1,191
==462529==    at 0x4A366A4: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==462529==    by 0x531959F: register_state (in /lib64/libc-2.31.so)
==462529==    by 0x531BE86: re_acquire_state_context (in /lib64/libc-2.31.so)
==462529==    by 0x5322CCB: re_compile_internal (in /lib64/libc-2.31.so)
==462529==    by 0x5328298: regcomp (in /lib64/libc-2.31.so)
==462529==    by 0x4F42B57: s_p_hashtbl_create_cnt (parse_config.c:209)
==462529==    by 0x4F4299B: s_p_hashtbl_create (parse_config.c:217)
==462529==    by 0x4F54335: _init_slurm_conf (read_config.c:3215)
==462529==    by 0x4F5A4F7: slurm_conf_init_load (read_config.c:3508)
==462529==    by 0x4F5A859: slurm_conf_init (read_config.c:3533)
==462529==    by 0x4EE6141: slurm_init (init.c:47)
==462529==    by 0x411614: srun (srun.c:176)
==462529==
==462529== 0 bytes in 1 blocks are possibly lost in loss record 10 of 1,191
==462529==    at 0x4A366A4: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==462529==    by 0x531959F: register_state (in /lib64/libc-2.31.so)
==462529==    by 0x531BE86: re_acquire_state_context (in /lib64/libc-2.31.so)
==462529==    by 0x5322E12: re_compile_internal (in /lib64/libc-2.31.so)
==462529==    by 0x5328298: regcomp (in /lib64/libc-2.31.so)
==462529==    by 0x4F42B57: s_p_hashtbl_create_cnt (parse_config.c:209)
==462529==    by 0x4F4299B: s_p_hashtbl_create (parse_config.c:217)
==462529==    by 0x4F54335: _init_slurm_conf (read_config.c:3215)
==462529==    by 0x4F5A4F7: slurm_conf_init_load (read_config.c:3508)
==462529==    by 0x4F5A859: slurm_conf_init (read_config.c:3533)
==462529==    by 0x4EE6141: slurm_init (init.c:47)
==462529==    by 0x411614: srun (srun.c:176)
==462529==
==462529== 4 bytes in 1 blocks are possibly lost in loss record 19 of 1,191
==462529==    at 0x4A366A4: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==462529==    by 0x5319466: re_node_set_insert (in /lib64/libc-2.31.so)
==462529==    by 0x531A78D: duplicate_node_closure (in /lib64/libc-2.31.so)
==462529==    by 0x531B847: calc_eclosure_iter (in /lib64/libc-2.31.so)
==462529==    by 0x5322A1C: re_compile_internal (in /lib64/libc-2.31.so)
==462529==    by 0x5328298: regcomp (in /lib64/libc-2.31.so)
==462529==    by 0x4F42B57: s_p_hashtbl_create_cnt (parse_config.c:209)
==462529==    by 0x4F4299B: s_p_hashtbl_create (parse_config.c:217)
==462529==    by 0x4F54335: _init_slurm_conf (read_config.c:3215)
==462529==    by 0x4F5A4F7: slurm_conf_init_load (read_config.c:3508)
==462529==    by 0x4F5A859: slurm_conf_init (read_config.c:3533)
==462529==    by 0x4EE6141: slurm_init (init.c:47)   

... (and so on...)                                                                                  

Below are the input and output files for reference. At this point I am not sure what is causing the error. Could a selection of wrong wannier projections give this issue? Or k-grid selection?
Phonon calculation was done on 6x6x3 q-grid.
epw.in:

Code: Select all

--
&inputepw
  prefix      = 'MgB2_1x1x2',
  amass(1)    = 24.305,
  amass(2)    = 10.811
  outdir      = './'


  ep_coupling = .true.
  elph        = .true.
  epbwrite    = .true.
  epbread     = .false.

  epwwrite = .true.
  epwread  = .false.

  etf_mem     =  1

  nbndsub     =  5,

  wannierize  = .true.
  num_iter    = 500
  dis_froz_max= 8.8
  proj(1)     = 'B:pz'
  proj(2)     = 'f=0.5,1.0,0.25:s'
  proj(3)     = 'f=0.0,0.5,0.25:s'
  proj(4)     = 'f=0.5,0.5,0.25:s'
  proj(5)     = 'f=0.5,1.0,0.75:s'
  proj(6)     = 'f=0.0,0.5,0.75:s'
  proj(7)     = 'f=0.5,0.5,0.75:s'

  iverbosity  = 2

  eps_acustic = 2.0    ! Lowest boundary for the phonon frequency
  ephwrite    = .true. ! Writes .ephmat files used when Eliasberg = .true.

  fsthick     = 0.4  ! eV
  degaussw    = 0.10 ! eV
  nsmear      = 1
  delta_smear = 0.04 ! eV

  degaussq     = 0.5 ! meV
  nqstep       = 500
    eliashberg  = .true.

  laniso = .true.
  limag = .true.
  lpade = .true.

  conv_thr_iaxis = 1.0d-4

  wscut = 1.0   ! eV   Upper limit over frequency integration/summation in the Elisashberg eq

  nstemp   = 1     ! Nr. of temps
  temps    = 15.00 ! K  provide list of temperetures OR (nstemp and temps = tempsmin  tempsmax for even space mode)

  nsiter   = 500

  muc     = 0.16

  dvscf_dir   = '../phonons/save'

  nk1         = 6
  nk2         = 6
  nk3         = 3

  nq1         = 6
  nq2         = 6
  nq3         = 3

  mp_mesh_k = .true.
  nkf1 = 20
  nkf2 = 20
  nkf3 = 20

  nqf1 = 20
  nqf2 = 20
  nqf3 = 20
 /
epw.out stops here:

Code: Select all

     -------------------------------------------------------------------
     Wannierization on  6 x  6 x  3 electronic grid
     -------------------------------------------------------------------

     Spin CASE ( default = unpolarized )

     Initializing Wannier90


     Initial Wannier projections

     (   0.33333   0.66667   0.25000) :  l =   1 mr =   1
     (   0.66667   0.33333   0.75000) :  l =   1 mr =   1
     (   0.66667   0.33333   0.25000) :  l =   1 mr =   1
     (   0.33333   0.66667   0.75000) :  l =   1 mr =   1
     (   0.50000   1.00000   0.25000) :  l =   0 mr =   1

      - Number of bands is ( 12)
      - Number of total bands is ( 12)
      - Number of excluded bands is (  0)
      - Number of wannier functions is (  5)
      - All guiding functions are given

  Reading data about k-point neighbours

      - All neighbours are found

     AMN
      k points =   108 in  128 pools
            1 of    1 on ionode
MgB2_1x1x2.wout stops here:

Code: Select all

 Time to write kmesh            0.526 (sec)

 MgB2_1x1x2.nnkp written.
 Time to write kmesh            0.526 (sec)

 Finished setting up k-point neighbours.

 Exiting wannier_setup in wannier90 15:26:05
scf.in and nscf.in for reference:

Code: Select all

 &control
    calculation='scf',
    prefix='MgB2_1x1x2',
    pseudo_dir = '../pp/',
    outdir='./',
    tprnfor = .true.,
    tstress = .true.,
    etot_conv_thr = 1.0d-5
    forc_conv_thr = 1.0d-4
 /
 &system
    ibrav = 0,
    nat=  6,
    ntyp = 2,
    ecutwfc = 40
    smearing = 'mp'
    occupations = 'smearing'
    degauss = 0.02
 /
 &electrons
    diagonalization = 'david'
    mixing_mode = 'plain'
    mixing_beta = 0.7
    conv_thr =  1.0d-9
 /
ATOMIC_SPECIES
 Mg  24.305  Mg.pz-n-vbc.UPF
 B   10.811  B.pz-vbc.UPF
ATOMIC_POSITIONS crystal
Mg           0.0000000000       0.0000000000       0.0000000000
Mg           0.0000000000       0.0000000000       0.5000000000
B            0.3333333400       0.6666666800       0.2500000000
B            0.6666666090       0.3333332950       0.7500000000
B            0.6666666090       0.3333332950       0.2500000000
B            0.3333333400       0.6666666800       0.7500000000
K_POINTS AUTOMATIC
12 12 6 0 0 0
CELL_PARAMETERS angstrom
      3.0829999447       0.0000000000       0.0000000000
     -1.5414999723       2.6699562720       0.0000000000
      0.0000000000       0.0000000000       7.0419998169

Code: Select all

 &control
    calculation='nscf',
    prefix='MgB2_1x1x2',
    pseudo_dir = '../pp/',
    outdir='./',
    tprnfor = .true.,
    tstress = .true.,
    etot_conv_thr = 1.0d-5
    forc_conv_thr = 1.0d-4
 /
 &system
    ibrav = 0,
    nat=  6,
    ntyp = 2,
    ecutwfc = 40
    smearing = 'mp'
    occupations = 'smearing'
    degauss = 0.02
 /
 &electrons
    diagonalization = 'david'
    mixing_mode = 'plain'
    mixing_beta = 0.7
    conv_thr =  1.0d-9
 /
ATOMIC_SPECIES
 Mg  24.305  Mg.pz-n-vbc.UPF
 B   10.811  B.pz-vbc.UPF
ATOMIC_POSITIONS crystal
Mg           0.0000000000       0.0000000000       0.0000000000
Mg           0.0000000000       0.0000000000       0.5000000000
B            0.3333333400       0.6666666800       0.2500000000
B            0.6666666090       0.3333332950       0.7500000000
B            0.6666666090       0.3333332950       0.2500000000
B            0.3333333400       0.6666666800       0.7500000000
CELL_PARAMETERS angstrom
      3.0829999447       0.0000000000       0.0000000000
     -1.5414999723       2.6699562720       0.0000000000
      0.0000000000       0.0000000000       7.0419998169
K_POINTS crystal
108
0.0   0.0   0.0   0.00925925925926
0.0   0.0   0.333333333333   0.00925925925926
0.0   0.0   0.666666666667   0.00925925925926
0.0   0.166666666667   0.0   0.00925925925926
0.0   0.166666666667   0.333333333333   0.00925925925926
0.0   0.166666666667   0.666666666667   0.00925925925926
0.0   0.333333333333   0.0   0.00925925925926
0.0   0.333333333333   0.333333333333   0.00925925925926
0.0   0.333333333333   0.666666666667   0.00925925925926
0.0   0.5   0.0   0.00925925925926
0.0   0.5   0.333333333333   0.00925925925926
0.0   0.5   0.666666666667   0.00925925925926
0.0   0.666666666667   0.0   0.00925925925926
0.0   0.666666666667   0.333333333333   0.00925925925926
0.0   0.666666666667   0.666666666667   0.00925925925926
0.0   0.833333333333   0.0   0.00925925925926
0.0   0.833333333333   0.333333333333   0.00925925925926
0.0   0.833333333333   0.666666666667   0.00925925925926
0.166666666667   0.0   0.0   0.00925925925926
0.166666666667   0.0   0.333333333333   0.00925925925926
0.166666666667   0.0   0.666666666667   0.00925925925926
0.166666666667   0.166666666667   0.0   0.00925925925926
0.166666666667   0.166666666667   0.333333333333   0.00925925925926
0.166666666667   0.166666666667   0.666666666667   0.00925925925926
0.166666666667   0.333333333333   0.0   0.00925925925926
0.166666666667   0.333333333333   0.333333333333   0.00925925925926
0.166666666667   0.333333333333   0.666666666667   0.00925925925926
0.166666666667   0.5   0.0   0.00925925925926
0.166666666667   0.5   0.333333333333   0.00925925925926
0.166666666667   0.5   0.666666666667   0.00925925925926
0.166666666667   0.666666666667   0.0   0.00925925925926
0.166666666667   0.666666666667   0.333333333333   0.00925925925926
0.166666666667   0.666666666667   0.666666666667   0.00925925925926
0.166666666667   0.833333333333   0.0   0.00925925925926
0.166666666667   0.833333333333   0.333333333333   0.00925925925926
...  (and so on...)

Post Reply