EPW job stops without error in output file
Posted: Mon Sep 22, 2025 10:25 pm
Dear EPW community,
I'm, using EPW v5.9, QE 7.4.1. My epw jobs stops without error in the epw.out file. It stops after finishing Calculating kgmap (input and output files are attatched).
Then I check my memory log file, which record the result of 'free -h':
So looks like this is because all memory is used up? The free memory suddenly drops from 400G to 7.8G.
In the HPC node log file, it has the following lines:
It just shows process is killed but doesn't say why it's killed. Looks like it's because memory issue? I've used 'etf_mem = 2' to reduce the memory. If the value is 0, it stops even earlier.
My nscf step kpoints is 12 12 6, because this gives matched wannier v.s. DFT band and smaller spread. The 663 and 995 grid doesn't match that well.
So I'm wondering how to solve this problem. I'm using one big-memory node which has 1T memory, with total 128 cores on one node. I remember that if restart, we need to use same number of cores. If I use 2 nodes, or restart the job from a different machine, this I think will not work?
Best,
Yujia
I'm, using EPW v5.9, QE 7.4.1. My epw jobs stops without error in the epw.out file. It stops after finishing Calculating kgmap (input and output files are attatched).
Code: Select all
Calculating kgmap
Progress kgmap: ########################################
kmaps : 0.15s CPU 0.35s WALL ( 1 calls)
........
===================================================================
irreducible q point # 14
===================================================================
.......
q( 108 ) = ( 0.3333333 -0.5773503 -0.2009170 )
Band disentanglement is used: nbndsub = 60
Use zone-centred Wigner-Seitz cells
Number of WS vectors for electrons 1099
Number of WS vectors for phonons 129
Number of WS vectors for electron-phonon 129
Maximum number of cores for efficient parallelization 2322
Results may improve by using use_ws == .TRUE.
Then I check my memory log file, which record the result of 'free -h':
Code: Select all
total used free shared buff/cache available
Mem: 1.0Ti 594Gi 400Gi 696Mi 18Gi 412Gi
Swap: 0B 0B 0B
400
Sat Sep 20 00:26:01 EDT 2025
total used free shared buff/cache available
Mem: 1.0Ti 995Gi 7.8Gi 696Mi 12Gi 11Gi
Swap: 0B 0B 0B
7 8
Sat Sep 20 00:26:11 EDT 2025
total used free shared buff/cache available
Mem: 1.0Ti 128Gi 882Gi 696Mi 1.3Gi 878Gi
Swap: 0B 0B 0B
882
In the HPC node log file, it has the following lines:
Code: Select all
rank 0 died from signal 9
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.31.s 000014CA28C7F910 Unknown Unknown Unknown
epw.x 000000000052809C bloch2wannier_mp_ 973 bloch2wannier.f90
epw.x 000000000046491C wannier_mp_build_ 317 wannier.f90
epw.x 0000000000409ECA MAIN__ 209 epw.f90
epw.x 00000000004086BD Unknown Unknown Unknown
libc-2.31.so 000014CA2243E1FD __libc_start_main Unknown Unknown
epw.x 00000000004085EA Unknown Unknown Unknown
My nscf step kpoints is 12 12 6, because this gives matched wannier v.s. DFT band and smaller spread. The 663 and 995 grid doesn't match that well.
So I'm wondering how to solve this problem. I'm using one big-memory node which has 1T memory, with total 128 cores on one node. I remember that if restart, we need to use same number of cores. If I use 2 nodes, or restart the job from a different machine, this I think will not work?
Best,
Yujia