EPW crashes after the Bloch2wane lines

Post here questions linked with issue while running the EPW code

Moderator: stiwari

Post Reply
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

EPW crashes after the Bloch2wane lines

Post by mdogan »

Hello,

I'm getting the following error after the Bloch2wane lines in an epw calculation:

Code: Select all

     Bloch2wane:         29 /         32
     Bloch2wane:         30 /         32
     Bloch2wane:         31 /         32
     Bloch2wane:         32 /         32

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 12 PID 170291 RUNNING AT c112-073
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
In the attachments folder [https://www.dropbox.com/sh/5osljbb8j8n0 ... _Km-a?dl=0] you may find more details such as the epw input file and the full error log file. I have no clue as to what the root of this crash might be. Any help will be greatly appreciated!

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes after the Bloch2wane lines

Post by hlee »

Dear Mehmet:

Code: Select all

COMPLEX(KIND = DP) :: epmatwp_mem(nbnd, nbnd, nrr_k, nmodes)
I am suspicious of the large size of the complex-valued array of epmatwp_mem defined in the subroutine of epmatwp_mem.
In you case, it is 16 * 24 * 24 * 16391 * (24*3) = 10876280832 bytes = about 11 GB.

However, you first have to check the line 1366 of bloch2wan.f90:

Code: Select all

 1 0x00000000005c46ee bloch2wan_mp_ephbloch2wanp_mem_()  /home1/02365/mdogan/qe-6.6-v3/EPW/src/bloch2wan.f90:1366
Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes after the Bloch2wane lines

Post by mdogan »

Dear H. Lee,

Thank you for looking into this! Line 1366 of bloch2wan.f90 simply says "len1(:) = zero" which doesn't appear helpful (at least to me). It seems like that subroutine in bloch2wan.f90 was called by line 514 of ephwann_shuffle.f90 which says:

Code: Select all

    IF (etf_mem > 0) THEN
      CALL ephbloch2wanp_mem(nbndsub, nmodes, xqc, nqc, irvec_k, irvec_g, nrr_k, nrr_g)
    ENDIF
I have etf_mem = 1 which should have more I/O and less memory requirement. I might try to set etf_mem = 0 to avoid this call, but it seems counterintuitive since that would require more memory. What do you think?

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes after the Bloch2wane lines

Post by hlee »

Dear Mehmet:
Line 1366 of bloch2wan.f90 simply says "len1(:) = zero" which doesn't appear helpful (at least to me).
If the line 1366 of bloch2wan.f90 is the statement of "len1(:) = zero", the first statement which is related to the writing to memory of the subroutine of epmatwp_mem, my suspicion is right.
The solution to this issue is related to the details of cluster and environment you are using; you had better consult with the system administrator.
It is also related to your current memory usage. So the simple solution with "ulimit -s unlimited" might not work.
I have etf_mem = 1 which should have more I/O and less memory requirement. I might try to set etf_mem = 0 to avoid this call, but it seems counterintuitive since that would require more memory. What do you think?
I think that etf_mem=0 (etf_mem=2 as well) doesn't work.
Instead you could try the allocatable array for epmatwp_mem in the subroutine of epmatwp_mem.
If it also doesn't work, you might need more modifications in the code.

Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes after the Bloch2wane lines

Post by mdogan »

Dear H. Lee,

Thank you very much for your advice! It's unfortunate that I might have to consult my system administrator, which is usually a very slow process. I'd like to try a couple of things before submitting a help ticket.

1) Do you think simply increasing the number of nodes might solve this problem? I don't seem to have this problem for a smaller calculation (12-atom cell vs. 24-atom cell) so do you think it would be worth the try?

2) You say
Instead you could try the allocatable array for epmatwp_mem in the subroutine of epmatwp_mem.
Could you explain a bit more how I would do this? How would I need to modify the code (bloch2wan.f90)?

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes after the Bloch2wane lines

Post by hlee »

Dear Mehmet:
Do you think simply increasing the number of nodes might solve this problem? I don't seem to have this problem for a smaller calculation (12-atom cell vs. 24-atom cell) so do you think it would be worth the try?
It might work, but also might not work; I think that in your case there is no solution which works with 100% probability since it is related to several things such as system details, build environments, and current memory usage, etc.
2) You say
Instead you could try the allocatable array for epmatwp_mem in the subroutine of epmatwp_mem.
Could you explain a bit more how I would do this? How would I need to modify the code (bloch2wan.f90)?
You can try the allocatable array for epmatwp_mem with the correct statement of (de)allocate as below:

Code: Select all

COMPLEX(KIND = DP), ALLOCATABLE :: epmatwp_mem(:, :, :, :)
instead of

Code: Select all

COMPLEX(KIND = DP) :: epmatwp_mem(nbnd, nbnd, nrr_k, nmodes)
However, I would like to emphasize again that in your case there is no solution which works with 100% probability

Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes after the Bloch2wane lines

Post by mdogan »

Dear H. Lee,

Thank you very much! I will try these things as soon as I am able to run calculations again on Frontera, which unfortunately is reserved for a week for large-scale jobs (1000s of nodes). I will let you know what happens. Thank you!

Best,
Mehmet
Post Reply