Page 1 of 1

EPW job stops without error in output file

Posted: Mon Sep 22, 2025 10:25 pm
by tyj
Dear EPW community,
I'm, using EPW v5.9, QE 7.4.1. My epw jobs stops without error in the epw.out file. It stops after finishing Calculating kgmap (input and output files are attatched).

Code: Select all

 Calculating kgmap

     Progress kgmap: ########################################
     kmaps        :      0.15s CPU      0.35s WALL (       1 calls)
  
     ........
      ===================================================================
     irreducible q point #   14
     ===================================================================
      .......
     q(  108 ) = (   0.3333333  -0.5773503  -0.2009170 )

     Band disentanglement is used: nbndsub =   60
     Use zone-centred Wigner-Seitz cells
     Number of WS vectors for electrons     1099
     Number of WS vectors for phonons      129
     Number of WS vectors for electron-phonon      129
     Maximum number of cores for efficient parallelization     2322
     Results may improve by using use_ws == .TRUE.

     

Then I check my memory log file, which record the result of 'free -h':

Code: Select all

               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       594Gi       400Gi       696Mi        18Gi       412Gi
Swap:             0B          0B          0B
400
Sat Sep 20 00:26:01 EDT 2025
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       995Gi       7.8Gi       696Mi        12Gi        11Gi
Swap:             0B          0B          0B
7 8
Sat Sep 20 00:26:11 EDT 2025
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       128Gi       882Gi       696Mi       1.3Gi       878Gi
Swap:             0B          0B          0B
882
So looks like this is because all memory is used up? The free memory suddenly drops from 400G to 7.8G.

In the HPC node log file, it has the following lines:

Code: Select all

rank 0 died from signal 9
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread-2.31.s  000014CA28C7F910  Unknown               Unknown  Unknown
epw.x              000000000052809C  bloch2wannier_mp_         973  bloch2wannier.f90
epw.x              000000000046491C  wannier_mp_build_         317  wannier.f90
epw.x              0000000000409ECA  MAIN__                    209  epw.f90
epw.x              00000000004086BD  Unknown               Unknown  Unknown
libc-2.31.so       000014CA2243E1FD  __libc_start_main     Unknown  Unknown
epw.x              00000000004085EA  Unknown               Unknown  Unknown
It just shows process is killed but doesn't say why it's killed. Looks like it's because memory issue? I've used 'etf_mem = 2' to reduce the memory. If the value is 0, it stops even earlier.

My nscf step kpoints is 12 12 6, because this gives matched wannier v.s. DFT band and smaller spread. The 663 and 995 grid doesn't match that well.

So I'm wondering how to solve this problem. I'm using one big-memory node which has 1T memory, with total 128 cores on one node. I remember that if restart, we need to use same number of cores. If I use 2 nodes, or restart the job from a different machine, this I think will not work?

Best,
Yujia