Page 1 of 1

slurmstepd: error: Detected 2 oom-kill event(s)

Posted: Wed Dec 06, 2023 11:30 am
by lixuejie
Dear epw users

I am using EPW7.1 to calculate superconducting temperture of YB2 compound. I am getting this error all the time:

"= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 508807 RUNNING AT cn14
= KILLED BY SIGNAL: 9 (Killed)"

the error file:
"slurmstepd: error: Detected 2 oom-kill event(s) in StepId=19809.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cn14: task 0: Out Of Memory"

along with my input files for scf nscf and epw:

&control
calculation='scf'
prefix='YB2'
etot_conv_thr = 1.0d-5
forc_conv_thr = 1.0D-4, !Default: 1.0D-3 (a.u)
pseudo_dir='./'
outdir='./'
tprnfor = .true.,
tstress = .true.,
/

&system
ibrav=4,
celldm(1) = 6.224896721,
celldm(3) = 1.171172955,
nat=3,
ntyp=2,
ecutwfc=60,
smearing = 'mp'
occupations = 'smearing'
degauss = 0.02

/

&electrons
diagonalization = 'david'
mixing_mode = 'plain'
conv_thr=1.0d-9,
mixing_beta = 0.7,
/

ATOMIC_SPECIES
Y 88.906 Y_ONCV_PBE-1.0.upf
B 10.811 B_ONCV_PBE-1.0.upf

ATOMIC_POSITIONS crystal
Y 0.000000 0.000000 0.000000
B 0.666667 0.333333 0.500000
B 0.333333 0.666667 0.500000

K_POINTS automatic
12 12 12 0 0 0




&control
calculation='bands',
prefix='YB2',
pseudo_dir = './',
outdir='./',
tprnfor = .true.,
tstress = .true.,
etot_conv_thr = 1.0d-5
forc_conv_thr = 1.0d-4
/
&system
ibrav=4,
celldm(1) = 6.224896721,
celldm(3) = 1.171172955,
nat= 3,
ntyp = 2,
ecutwfc = 60
smearing = 'mp'
occupations = 'smearing'
degauss = 0.02
nbnd = 35
/
&electrons
diagonalization = 'david'
mixing_mode = 'plain'
mixing_beta = 0.7
conv_thr = 1.0d-9
/

ATOMIC_SPECIES
Y 88.906 Y_ONCV_PBE-1.0.upf
B 10.811 B_ONCV_PBE-1.0.upf

ATOMIC_POSITIONS crystal
Y 0.000000 0.000000 0.000000
B 0.666667 0.333333 0.500000
B 0.333333 0.666667 0.500000

K_POINTS crystal
1728
0.00000000 0.00000000 0.00000000 5.787037e-04
0.00000000 0.00000000 0.08333333 5.787037e-04
0.00000000 0.00000000 0.16666667 5.787037e-04
...

--
&inputepw
prefix = 'YB2',
amass(1) = 88.906,
amass(2) = 10.811
outdir = './'
max_memlt = 50

ep_coupling = .true.
elph = .true.
epbwrite = .true.
epbread = .false.

epwwrite = .true.
epwread = .false.
vme = .false.

etf_mem = 1

nbndsub = 7,

efermi_read = .true.
fermi_energy = 12.085876

wannierize = .true.
num_iter = 200

!dis_win_max = 14.8
!dis_win_min = 7.12
dis_froz_max = 12.32
dis_froz_min= 11.7

proj(1) = 'Y:dz2,dxy,dx2-y2'
proj(2) = 'B:pz,py'

wdata(1) = 'guiding_centres = .true.'
wdata(2) = 'dis_num_iter = 500'
wdata(3) = 'bands_plot = .true.'
wdata(4) = 'begin kpoint_path'
wdata(5) = 'G 0.0000000 0.0000000 0.0000000 K 0.3333333 0.3333333 0.0000000'
wdata(6) = 'K 0.3333333 0.3333333 0.0000000 M 0.5000000 0.0000000 0.0000000'
wdata(7) = 'M 0.5000000 0.0000000 0.0000000 G 0.0000000 0.0000000 0.0000000'
wdata(8) = 'G 0.0000000 0.0000000 0.0000000 A 0.0000000 0.0000000 0.5000000'
wdata(9) = 'A 0.0000000 0.0000000 0.5000000 H 0.3333333 0.3333333 0.5000000'
wdata(10) ='H 0.3333333 0.3333333 0.5000000 L 0.5000000 0.0000000 0.5000000'
wdata(11) ='L 0.5000000 0.0000000 0.5000000 A 0.0000000 0.0000000 0.5000000'
wdata(12) = 'end kpoint_path'
wdata(13) = 'bands_plot_format = gnuplot'
wdata(14)= 'use_ws_distance = T'
wdata(15)= 'conv_window = 4'
wdata(16) = 'kmesh_tol=0.00001'

iverbosity = 2

eps_acustic = 2.0 ! Lowest boundary for the phonon frequency
ephwrite = .true. ! Writes .ephmat files used when Eliasberg = .true.

fsthick = 0.4 ! eV
degaussw = 0.10 ! eV
nsmear = 1
delta_smear = 0.04 ! eV

degaussq = 0.5 ! meV
nqstep = 500

eliashberg = .true.

laniso = .true.
limag = .true.
lpade = .true.

conv_thr_iaxis = 1.0d-4

wscut = 0.42 ! eV Upper limit over frequency integration/summation in the Elisashberg eq

nstemp = 15 ! Nr. of temps
temps = 2 30 ! K provide list of temperetures OR (nstemp and temps = tempsmin tempsmax for even space mode)

nsiter = 500

muc = 0.1

dvscf_dir = '../11.25e-p/save'

nk1 = 12
nk2 = 12
nk3 = 12

nq1 = 6
nq2 = 6
nq3 = 6

mp_mesh_k = .true.
nkf1 = 48
nkf2 = 48
nkf3 = 48

nqf1 = 24
nqf2 = 24
nqf3 = 24
/

I would be thankful if any body could help me or give some suggestions.

Best regards,
xj Li

Re: slurmstepd: error: Detected 2 oom-kill event(s)

Posted: Sat Dec 09, 2023 9:07 pm
by stiwari
Hi,

From a first look it appears that your system where you are running EPW does not have enough memory. If it is an HPC, you can try to increase the number of nodes (while keeping the same number of cores) and see if this error goes away. Or reduce the nkf, nqf and check if it goes away. Otherwise, please post your EPW output files (.out file).

Best regards,
Sabya.