EPW crashes while reading ephmat files

Post here questions linked with issue while running the EPW code

Moderator: stiwari

mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

EPW crashes while reading ephmat files

Post by mdogan »

Hello,

I have a fairly large system with 24 hydrogen atoms. I'm able to run the q-grid interpolation with success, using 128 processors on 32 nodes, but when the calculation gets to the calculation of the Eliashberg spectral function, it crashes after the line that says "Start reading .ephmat files". The error file reads:

Code: Select all

forrtl: severe (67): input statement requires too much data, unit 110, file /scratch1/02365/mdogan/H.C2c24.P500.novdW.qe6.2/./scf.ephmat/ephmat2
I tried increasing the number of nodes to 64 (while keeping the number of processors as 128), but the error remains. These ephmat files are each about 430 MB, so, since each node has 192 GB of memory, I don't really get how the memory is not sufficient.

The error file is attached (as well as the other relevant files) at [https://www.dropbox.com/sh/zwjvc2hq600k ... 7bYxa?dl=0]. According to the error file, the problem originates in line 1289 of io_eliashberg.f90, which sits within many nested loops, so I can't distinguish what might be causing the issue. Any help will be greatly appreciated!

Best,
Mehmet
hpaudya1
Posts: 194
Joined: Tue Mar 21, 2017 7:11 pm
Affiliation:

Re: EPW crashes while reading ephmat files

Post by hpaudya1 »

Hi Mehmet,

It also happens sometimes if the "prefix.ephmat/ephmatXX" file not written correctly in the previous run. I recommend you to try running one more time by writing the ephmatXX files. And also, trying with the smaller meshes for such a larger systems would be helpful to know about the memory requirement.

Best,
Hari Paudyal
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes while reading ephmat files

Post by mdogan »

Dear Hari,

Thank you very much for your suggestions! I know that the calculation was able to run when the fine q-grid was 8*8*4 instead of 16*16*8. However, unfortunately I didn't save the ephmat files from that run, so I don't know what their sizes were. I have run a related calculation though, with a 12-atom cell, and it worked fine using 16 nodes and 64 processors. The ephmat files for that calculation are all 382 MB to 403 MB in size. The calculation that fails (24-atom cell) has ephmat files that are between 405 MB to 432 MB. If my understanding is correct, each processor handles one of these files. So, when I tried to use 64 nodes and 128 processors in the 24-atom calculation, the memory requirement per node was smaller than the 12-atom calculation (2 proc/node vs. 4 proc/node). So I don't understand why this calculation fails while the other one doesn't. Does my analysis make sense?

I could try repeating the calculation, but because it's a very expensive calculation (32 nodes * ~1 week), I'd rather avoid that if possible, unless we cannot find any other plausible reason for the error than an ephmat file being incorrectly written. Let me know what you think. Thank you!

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes while reading ephmat files

Post by hlee »

Dear Mehmet:

In general, this error is not related to the memory footprint. Even if you have low memory footprint, this error can occur, for example, if the 16 bytes data was actually written in the record, but you are trying to read 32 bytes data from this record.

As Hari mentioned, this might be due to the corrupted file, but there might be other factors which leads you to this error.

Basically, I think that the current file I/O mechanism (I/O per element) in the superconducting module is problematic in some cases and it is desirable to modify it.

Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes while reading ephmat files

Post by mdogan »

Dear H. Lee,

Thank you for weighing in! I have a few follow-up questions:

1) Is there another way of fixing this error short of fully repeating the calculation? For instance, would it be possible to regenerate the ephmat files from other files on disk, without running the full q-grid interpolation again?

2) Is such an I/O error common? If I should only expect it occur rarely, then I can just repeat the calculation, and hopefully it shouldn't recur. But I need to run this calculation 6 or 8 times on related systems, and each q-grid interpolation takes about a week on 32 nodes (depending on the queue times, 2–3 weeks in total), so if the error is not so rare, then taking the gamble doesn't really make much sense.

3) Relatedly, is there a way of decreasing the probability of this error happening? You say "there might be other factors"; what could they be? Would having fewer ephmat files help? I could try to run with 64 processors instead of 128 (I have a feeling I tried it before and the memory per processor wasn't enough so it failed). Or I could try to recompile the code in a different way. Do you happen to run EPW on TACC Frontera? If there is a compiling script available for that machine among the developer team, I could easily try recompiling the code.

More generally, what would you do if you were in my place? Essentially, these 6 or 8 very large calculations are the main part of this project, so I need to be able to run them one way or another. Please let me know what you suggest. Thank you very much!

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes while reading ephmat files

Post by hlee »

Dear Mehmet:

First of all, I would suggest you to check whether your file is corrupted.
Please check it by using the following code (I assume that the file of ephmat2 has an issue).
If you are using intel compilers, you might need the following compile options, ifort -assume byterecl .
If you don't encounter any error, please compare the printed file size with the real file size.

Code: Select all

PROGRAM check_ephmat

  IMPLICIT NONE

  INTEGER, PARAMETER :: DP = selected_real_kind(14,200)
  INTEGER            :: i, ios, idummy
  REAL(DP)           :: g2

  OPEN(UNIT = 101, FILE = 'ephmat2', STATUS = 'old', FORM = 'unformatted', IOSTAT = ios)

  READ(101) idummy, idummy

  i=0
  DO
     READ(101, IOSTAT = ios) g2
     i=i+1
     IF (ios > 0) THEN
        print *, 'Error occurs at ', i
        stop
     ELSE IF (ios < 0) THEN
        EXIT
     ELSE
        PRINT *, i, g2
     END IF
  END DO

  print *, 'Total file size=', 4*2+4*2+(4*2+8)*(i-1)

  STOP

END PROGRAM check_ephmat
Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes while reading ephmat files

Post by mdogan »

Dear H. Lee,

Thank you! The program stopped with the output: "Error occurs at 3320497". So it looks like the file is corrupted, is that right?

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes while reading ephmat files

Post by hlee »

Dear Mehmet:

Yes, it is corrupted.
Could you let me know the exact file size (in bytes) of ephmat2?

Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: EPW crashes while reading ephmat files

Post by mdogan »

Dear H. Lee,

It is 442115728 bytes.

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: EPW crashes while reading ephmat files

Post by hlee »

Dear Mehmet:

The number of |g|^2 included in the file of ephmat2 is 27632232, where g is the electron-phonon matrix element.
So the file corruption starts rather early.

Definitely you should rerun your calculation, but I have no confidence in its success.
Basically, the issues in I/O is unpredictable; they can be affected by other users' I/O.

Sincerely,

H. Lee
Post Reply