EPW calculation is killed at dvscf calculation
Posted: Fri Apr 24, 2020 2:06 pm
Dear Sir,
I am doing EPW calculations with QE-6.4 using PBE functionals along with Spin-orbit coupling (SOC). There are 5 irreducible points (25 total q-points) in the phonon calculations. The code is stopping arbitrarily at any of irreducible points while computing dvscf for star q-points. I have resubmitted the calculations many times with different cores like 56 cores / 84 cores /112 cores. But calculations is stopping every time and I can not understand the reason behind it. There is nothing printed about the memory required to do these calculations. Can you please help me out?
Following is the first output with 56 cores (2 nodes, 28 cores/node, 128 Gb memory/node) :
===================================================================
irreducible q point # 2
===================================================================
Symmetries of small group of q: 1
Number of q in the star = 6
List of q in the star:
1 0.000000000 0.230940108 0.000000000
2 -0.200000000 -0.115470054 0.000000000
3 0.200000000 -0.115470054 0.000000000
4 0.000000000 -0.230940108 0.000000000
5 0.200000000 0.115470054 0.000000000
6 -0.200000000 0.115470054 0.000000000
Dyn mat calculated from ifcs
q( 2 ) = ( 0.0000000 0.2309401 0.0000000 )
q( 3 ) = ( -0.2000000 -0.1154701 0.0000000 )
q( 4 ) = ( 0.2000000 -0.1154701 0.0000000 )
q( 5 ) = ( 0.0000000 -0.2309401 0.0000000 )
After this, jobs is killed automatically.
Following is the second output with 112 cores (4 nodes, 28 cores/node, 128 Gb memory/node) :
===================================================================
irreducible q point # 5
===================================================================
Symmetries of small group of q: 1
Number of q in the star = 6
List of q in the star:
1 0.200000000 0.577350269 0.000000000
2 -0.600000000 -0.115470054 0.000000000
3 0.400000000 -0.461880215 0.000000000
4 -0.200000000 -0.577350269 0.000000000
5 0.600000000 0.115470054 0.000000000
6 -0.400000000 0.461880215 0.000000000
Dyn mat calculated from ifcs
q( 20 ) = ( 0.2000000 0.5773503 0.0000000 )
After this, jobs is killed automatically.
Following is the error file :
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
=>> PBS: job killed: node 1 (cn02) requested job die, code 15009
[mpiexec@cn01] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 1 at host cn02 failed
[mpiexec@cn01] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@cn01] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@cn01] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
I am doing EPW calculations with QE-6.4 using PBE functionals along with Spin-orbit coupling (SOC). There are 5 irreducible points (25 total q-points) in the phonon calculations. The code is stopping arbitrarily at any of irreducible points while computing dvscf for star q-points. I have resubmitted the calculations many times with different cores like 56 cores / 84 cores /112 cores. But calculations is stopping every time and I can not understand the reason behind it. There is nothing printed about the memory required to do these calculations. Can you please help me out?
Following is the first output with 56 cores (2 nodes, 28 cores/node, 128 Gb memory/node) :
===================================================================
irreducible q point # 2
===================================================================
Symmetries of small group of q: 1
Number of q in the star = 6
List of q in the star:
1 0.000000000 0.230940108 0.000000000
2 -0.200000000 -0.115470054 0.000000000
3 0.200000000 -0.115470054 0.000000000
4 0.000000000 -0.230940108 0.000000000
5 0.200000000 0.115470054 0.000000000
6 -0.200000000 0.115470054 0.000000000
Dyn mat calculated from ifcs
q( 2 ) = ( 0.0000000 0.2309401 0.0000000 )
q( 3 ) = ( -0.2000000 -0.1154701 0.0000000 )
q( 4 ) = ( 0.2000000 -0.1154701 0.0000000 )
q( 5 ) = ( 0.0000000 -0.2309401 0.0000000 )
After this, jobs is killed automatically.
Following is the second output with 112 cores (4 nodes, 28 cores/node, 128 Gb memory/node) :
===================================================================
irreducible q point # 5
===================================================================
Symmetries of small group of q: 1
Number of q in the star = 6
List of q in the star:
1 0.200000000 0.577350269 0.000000000
2 -0.600000000 -0.115470054 0.000000000
3 0.400000000 -0.461880215 0.000000000
4 -0.200000000 -0.577350269 0.000000000
5 0.600000000 0.115470054 0.000000000
6 -0.400000000 0.461880215 0.000000000
Dyn mat calculated from ifcs
q( 20 ) = ( 0.2000000 0.5773503 0.0000000 )
After this, jobs is killed automatically.
Following is the error file :
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
=>> PBS: job killed: node 1 (cn02) requested job die, code 15009
[mpiexec@cn01] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 1 at host cn02 failed
[mpiexec@cn01] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@cn01] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@cn01] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion