I recently use EPW package to solve anisotropic Eliashberg equations of a system, containing 27 atoms, 153 electrons in total on a cluster. Each node on the cluster contains the memory of 24 G.
For the self-consistent calculations of the charge density, 6*6*1 k-mesh was used (36 k-points in irreducible BZ). And a q-mesh of 2*2*1 (4 q-points in IBZ) was adapted to determine the dynamical matrix.
I have obtained the *.ephmat, *.freq, *.egnv, and *.ikmap files in advanced, which are needed by EPW to solve the equation. Then I tried to continue my calculations of solving the equation by means of:
Code: Select all
mpiexec -N 18 -n 36 epw.x -npool 36 < in.epw > out.epw
Running 36 processes on 18 nodes, which means on each node, there are two processes. Each node on the cluster contains the memory of 24 G.
However, it always failed without any notification, which puzzled me very much. Here are my input files and the last part of the output by EPW.
Input of EPW (part):
Code: Select all
restart_freq = 100
iverbosity = 1
ep_coupling = .true.
elph = .false.
kmaps = .true.
epbwrite = .false.
epbread = .true.
system_2d=.true.
epwwrite = .false.
epwread = .true.
max_memlt=12.0d0
nqstep = 500
eliashberg = .true.
limag = .true.
lpade = .true.
nk1 = 6
nk2 = 6
nk3 = 1
nq1 = 2
nq2 = 2
nq3 = 1
nkf1 = 50
nkf2 = 50
nkf3 = 1
nqf1 = 50
nqf2 = 50
nqf3 = 1
And the last part of the output by EPW are as follows:
Code: Select all
===================================================================
Solve anisotropic Eliashberg equations
===================================================================
Finish reading .freq file
....
Nr k-points within the Fermi shell = 2500 out of 2500
9 bands within the Fermi window
Finish reading .egnv file
Max nr of q-points = 2500
Finish reading .ikmap files
Size of allocated memory per pool : ~= 8.6031 Gb
Start reading .ephmat files
After that, EPW stops, and the standard output of the job reads:
Code: Select all
yhrun: error: cn14: task 5: Killed
yhrun: First task exited 60s ago
yhrun: tasks 0-4,6-15,17-32,34-35: running
yhrun: tasks 5,16,33: exited abnormally
yhrun: Terminating job step 9434701.0
slurmd[cn12]: *** STEP 9434701.0 KILLED AT 2017-12-22T22:02:30 WITH SIGNAL 9 ***
yhrun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn12]: *** STEP 9434701.0 KILLED AT 2017-12-22T22:02:30 WITH SIGNAL 9 ***
Can anyone help me?
Best regards!