EPW stops without any notification

General discussion around the EPW software

Moderator: stiwari

Post Reply
hellolori

EPW stops without any notification

Post by hellolori »

Dear all,

I recently use EPW package to solve anisotropic Eliashberg equations of a system, containing 27 atoms, 153 electrons in total on a cluster. Each node on the cluster contains the memory of 24 G.

For the self-consistent calculations of the charge density, 6*6*1 k-mesh was used (36 k-points in irreducible BZ). And a q-mesh of 2*2*1 (4 q-points in IBZ) was adapted to determine the dynamical matrix.

I have obtained the *.ephmat, *.freq, *.egnv, and *.ikmap files in advanced, which are needed by EPW to solve the equation. Then I tried to continue my calculations of solving the equation by means of:

Code: Select all

mpiexec -N 18 -n 36 epw.x  -npool 36  < in.epw > out.epw

Running 36 processes on 18 nodes, which means on each node, there are two processes. Each node on the cluster contains the memory of 24 G.

However, it always failed without any notification, which puzzled me very much. Here are my input files and the last part of the output by EPW.

Input of EPW (part):

Code: Select all

restart_freq = 100
iverbosity = 1

ep_coupling = .true.
elph        = .false.

kmaps       = .true.
epbwrite    = .false.
epbread     = .true.
system_2d=.true.
epwwrite = .false.
epwread  = .true.

max_memlt=12.0d0
nqstep       = 500
eliashberg  = .true.
limag = .true.
lpade = .true.

nk1         = 6
nk2         = 6
nk3         = 1

nq1         = 2
nq2         = 2
nq3         = 1

nkf1 = 50
nkf2 = 50
nkf3 = 1

nqf1 = 50
nqf2 = 50
nqf3 = 1


And the last part of the output by EPW are as follows:

Code: Select all

 ===================================================================
     Solve anisotropic Eliashberg equations
     ===================================================================


     Finish reading .freq file
     ....
     Nr k-points within the Fermi shell =      2500 out of      2500
     9 bands within the Fermi window

     Finish reading .egnv file

     Max nr of q-points =      2500

     Finish reading .ikmap files

     Size of allocated memory per pool : ~=    8.6031 Gb

     Start reading .ephmat files


After that, EPW stops, and the standard output of the job reads:

Code: Select all

yhrun: error: cn14: task 5: Killed
yhrun: First task exited 60s ago
yhrun: tasks 0-4,6-15,17-32,34-35: running
yhrun: tasks 5,16,33: exited abnormally
yhrun: Terminating job step 9434701.0
slurmd[cn12]: *** STEP 9434701.0 KILLED AT 2017-12-22T22:02:30 WITH SIGNAL 9 ***
yhrun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[cn12]: *** STEP 9434701.0 KILLED AT 2017-12-22T22:02:30 WITH SIGNAL 9 ***


Can anyone help me?

Best regards!
sponce
Site Admin
Posts: 616
Joined: Wed Jan 13, 2016 7:25 pm
Affiliation: EPFL

Re: EPW stops without any notification

Post by sponce »

Hello,

It does indeed looks like a memory problem.

Could you try
mpiexec -N 36 -n 36 epw.x -npool 36 < in.epw > out.epw

Also, are you sure that there is only 1 process per node ? Can you log on the node at runtime to verify this and also check the real amount of memory used.

Usually you can log on hpc nodes with ssh NODE_NAME.

Best,
Samuel
Prof. Samuel Poncé
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
hellolori

Re: EPW stops without any notification

Post by hellolori »

Dear Samuel,

Thank you very much for your useful suggestions !
I still have a question. Can we estimate the maximum memory for a process needed in the calculation of solving anisotropic Eliashberg equations, according to the last but one sentence of the output "Size of allocated memory per pool : ~= 8.6031 Gb" in the course of reading .ephmat files as shown below. The memory on each computational node is only 24 GB, and I'm afraid that this is not enough for the simulation.

Best,
Feipeng

Code: Select all

===================================================================
     Solve anisotropic Eliashberg equations
     ===================================================================


     Finish reading .freq file
     ....
     Nr k-points within the Fermi shell =      2500 out of      2500
     9 bands within the Fermi window

     Finish reading .egnv file

     Max nr of q-points =      2500

     Finish reading .ikmap files

     Size of allocated memory per pool : ~=    8.6031 Gb

     Start reading .ephmat files
sponce
Site Admin
Posts: 616
Joined: Wed Jan 13, 2016 7:25 pm
Affiliation: EPFL

Re: EPW stops without any notification

Post by sponce »

Dear

Take a look at the newly introduced subroutine "system_mem_usage.f90"

This should accurately tell you how much memory is used. However it is only called in some part of the code (not in Eliashberg at the moment). You can therefore add a call to the subroutine there.

The other option is to log interactively on the node and do "top" to see memory usage.

Best,
Samuel
Prof. Samuel Poncé
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Post Reply