Page 1 of 1

EPW calculation is killed at dvscf calculation

Posted: Fri Apr 24, 2020 2:06 pm
by Gautam Sharma Monty
Dear Sir,
I am doing EPW calculations with QE-6.4 using PBE functionals along with Spin-orbit coupling (SOC). There are 5 irreducible points (25 total q-points) in the phonon calculations. The code is stopping arbitrarily at any of irreducible points while computing dvscf for star q-points. I have resubmitted the calculations many times with different cores like 56 cores / 84 cores /112 cores. But calculations is stopping every time and I can not understand the reason behind it. There is nothing printed about the memory required to do these calculations. Can you please help me out?


Following is the first output with 56 cores (2 nodes, 28 cores/node, 128 Gb memory/node) :
===================================================================
irreducible q point # 2
===================================================================

Symmetries of small group of q: 1

Number of q in the star = 6
List of q in the star:
1 0.000000000 0.230940108 0.000000000
2 -0.200000000 -0.115470054 0.000000000
3 0.200000000 -0.115470054 0.000000000
4 0.000000000 -0.230940108 0.000000000
5 0.200000000 0.115470054 0.000000000
6 -0.200000000 0.115470054 0.000000000
Dyn mat calculated from ifcs

q( 2 ) = ( 0.0000000 0.2309401 0.0000000 )
q( 3 ) = ( -0.2000000 -0.1154701 0.0000000 )
q( 4 ) = ( 0.2000000 -0.1154701 0.0000000 )
q( 5 ) = ( 0.0000000 -0.2309401 0.0000000 )


After this, jobs is killed automatically.


Following is the second output with 112 cores (4 nodes, 28 cores/node, 128 Gb memory/node) :

===================================================================
irreducible q point # 5
===================================================================

Symmetries of small group of q: 1

Number of q in the star = 6
List of q in the star:
1 0.200000000 0.577350269 0.000000000
2 -0.600000000 -0.115470054 0.000000000
3 0.400000000 -0.461880215 0.000000000
4 -0.200000000 -0.577350269 0.000000000
5 0.600000000 0.115470054 0.000000000
6 -0.400000000 0.461880215 0.000000000
Dyn mat calculated from ifcs

q( 20 ) = ( 0.2000000 0.5773503 0.0000000 )


After this, jobs is killed automatically.


Following is the error file :

Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
=>> PBS: job killed: node 1 (cn02) requested job die, code 15009
[mpiexec@cn01] control_cb (../../pm/pmiserv/pmiserv_cb.c:781): connection to proxy 1 at host cn02 failed
[mpiexec@cn01] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@cn01] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@cn01] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion

Re: EPW calculation is killed at dvscf calculation

Posted: Fri Apr 24, 2020 10:37 pm
by hlee
Dear Gautam Sharma Monty:

I don't know much about the EPW included in QE v6.4.
Given informations, I can't give you the clear answer; I would just suggest to try the most recent (development) version of EPW at https://gitlab.com/QEF.

Also if you suspect the memory issue, you could try to use 4 nodes and 14 cores/node (total of 56 cores).

However, before any trials, you should make sure that you merged files (dvscf, etc) correctly if your phonon calculations were split (http://epw.phpbbhosts.co.uk/viewtopic.php?f=3&t=1270).

Sincerely,

H. Lee

Re: EPW calculation is killed at dvscf calculation

Posted: Sun Apr 26, 2020 4:47 am
by Gautam Sharma Monty
Dear Sir,

I have collected dvscf correctly, I have recomputed the phonon at q=5 such that it is completed in one run. So, there should be no such issue related to this.
Apart from this, I should point out a glitch b/w EPW and QE particular to SOC cases.
When we compute phonon with QE>6.2, then this no longer generates dyn#.xml files, but EPW demands these files in xml format. So, one has to go back to QE-6.2 and get dyn files in xml format after recovering the phonons. I think EPW should also consider the normal format in SOC cases. In other words,
ifc.q2r.xml should be abandoned and rather than that ifc.q2r should work which is generated using QE>6.2.

Re: EPW calculation is killed at dvscf calculation

Posted: Mon Apr 27, 2020 4:53 pm
by hlee
Dear Gautam Sharma Monty:

Regarding the xml issue:
Did you try the EPW included in QE v6.5 or the recent version of EPW? I didn't implement it, but I think that this issue is addressed by calling the subroutine of check_is_xml_file.
Please check the subroutines of read_ifc in io_epw.f90 and dynmat_asr in dynmat_asr.f90.

Sincerely,

H. Lee

Re: EPW calculation is killed at dvscf calculation

Posted: Mon May 04, 2020 5:30 am
by Gautam Sharma Monty
Thank you, Sir. I will check out latest version.

Re: EPW calculation is killed at dvscf calculation

Posted: Tue May 05, 2020 1:29 pm
by Gautam Sharma Monty
Dear Sir,
I am trying EPW calculations with QE-6.5 with phonon-dvscf, and dyn#xml computed using QE-6.4, but calculations is getting killed everytime. Could it be due to directory of phonon-dvscf files computed using qe-6.4. However, EPW is running using QE-6.4. Do I need to compute the phonons again with QE-6.5?

Re: EPW calculation is killed at dvscf calculation

Posted: Tue May 05, 2020 3:35 pm
by hlee
Dear Gautam Sharma Monty:

>Could it be due to directory of phonon-dvscf files computed using qe-6.4.

I didn't check it, but you can check it yourself by performing test calculations for a simple system, for example, Pb with spin-orbit coupling.

>Do I need to compute the phonons again with QE-6.5?

I still think that the main trouble might come from the step of merging dvscf files, etc. Although you said there is no problem, I can't confirm it.
In your case, it is not easy for me to give you the clear answer.

Sincerely,

H. Lee

Re: EPW calculation is killed at dvscf calculation

Posted: Sat May 09, 2020 7:07 pm
by Gautam Sharma Monty
Dear Sir,

It was the memory issue which is resolved by your reply, "Also if you suspect the memory issue, you could try to use 4 nodes and 14 cores/node (total of 56 cores)."
I am grateful for this.