Page 1 of 2
EPW crashes while reading ephmat files
Posted: Wed Jan 27, 2021 1:26 am
by mdogan
Hello,
I have a fairly large system with 24 hydrogen atoms. I'm able to run the q-grid interpolation with success, using 128 processors on 32 nodes, but when the calculation gets to the calculation of the Eliashberg spectral function, it crashes after the line that says "Start reading .ephmat files". The error file reads:
Code: Select all
forrtl: severe (67): input statement requires too much data, unit 110, file /scratch1/02365/mdogan/H.C2c24.P500.novdW.qe6.2/./scf.ephmat/ephmat2
I tried increasing the number of nodes to 64 (while keeping the number of processors as 128), but the error remains. These ephmat files are each about 430 MB, so, since each node has 192 GB of memory, I don't really get how the memory is not sufficient.
The error file is attached (as well as the other relevant files) at [
https://www.dropbox.com/sh/zwjvc2hq600k ... 7bYxa?dl=0]. According to the error file, the problem originates in line 1289 of io_eliashberg.f90, which sits within many nested loops, so I can't distinguish what might be causing the issue. Any help will be greatly appreciated!
Best,
Mehmet
Re: EPW crashes while reading ephmat files
Posted: Wed Jan 27, 2021 4:29 am
by hpaudya1
Hi Mehmet,
It also happens sometimes if the "prefix.ephmat/ephmatXX" file not written correctly in the previous run. I recommend you to try running one more time by writing the ephmatXX files. And also, trying with the smaller meshes for such a larger systems would be helpful to know about the memory requirement.
Best,
Hari Paudyal
Re: EPW crashes while reading ephmat files
Posted: Wed Jan 27, 2021 6:05 am
by mdogan
Dear Hari,
Thank you very much for your suggestions! I know that the calculation was able to run when the fine q-grid was 8*8*4 instead of 16*16*8. However, unfortunately I didn't save the ephmat files from that run, so I don't know what their sizes were. I have run a related calculation though, with a 12-atom cell, and it worked fine using 16 nodes and 64 processors. The ephmat files for that calculation are all 382 MB to 403 MB in size. The calculation that fails (24-atom cell) has ephmat files that are between 405 MB to 432 MB. If my understanding is correct, each processor handles one of these files. So, when I tried to use 64 nodes and 128 processors in the 24-atom calculation, the memory requirement per node was smaller than the 12-atom calculation (2 proc/node vs. 4 proc/node). So I don't understand why this calculation fails while the other one doesn't. Does my analysis make sense?
I could try repeating the calculation, but because it's a very expensive calculation (32 nodes * ~1 week), I'd rather avoid that if possible, unless we cannot find any other plausible reason for the error than an ephmat file being incorrectly written. Let me know what you think. Thank you!
Best,
Mehmet
Re: EPW crashes while reading ephmat files
Posted: Wed Jan 27, 2021 3:28 pm
by hlee
Dear Mehmet:
In general, this error is not related to the memory footprint. Even if you have low memory footprint, this error can occur, for example, if the 16 bytes data was actually written in the record, but you are trying to read 32 bytes data from this record.
As Hari mentioned, this might be due to the corrupted file, but there might be other factors which leads you to this error.
Basically, I think that the current file I/O mechanism (I/O per element) in the superconducting module is problematic in some cases and it is desirable to modify it.
Sincerely,
H. Lee
Re: EPW crashes while reading ephmat files
Posted: Wed Jan 27, 2021 6:43 pm
by mdogan
Dear H. Lee,
Thank you for weighing in! I have a few follow-up questions:
1) Is there another way of fixing this error short of fully repeating the calculation? For instance, would it be possible to regenerate the ephmat files from other files on disk, without running the full q-grid interpolation again?
2) Is such an I/O error common? If I should only expect it occur rarely, then I can just repeat the calculation, and hopefully it shouldn't recur. But I need to run this calculation 6 or 8 times on related systems, and each q-grid interpolation takes about a week on 32 nodes (depending on the queue times, 2–3 weeks in total), so if the error is not so rare, then taking the gamble doesn't really make much sense.
3) Relatedly, is there a way of decreasing the probability of this error happening? You say "there might be other factors"; what could they be? Would having fewer ephmat files help? I could try to run with 64 processors instead of 128 (I have a feeling I tried it before and the memory per processor wasn't enough so it failed). Or I could try to recompile the code in a different way. Do you happen to run EPW on TACC Frontera? If there is a compiling script available for that machine among the developer team, I could easily try recompiling the code.
More generally, what would you do if you were in my place? Essentially, these 6 or 8 very large calculations are the main part of this project, so I need to be able to run them one way or another. Please let me know what you suggest. Thank you very much!
Best,
Mehmet
Re: EPW crashes while reading ephmat files
Posted: Thu Jan 28, 2021 3:10 pm
by hlee
Dear Mehmet:
First of all, I would suggest you to check whether your file is corrupted.
Please check it by using the following code (I assume that the file of ephmat2 has an issue).
If you are using intel compilers, you might need the following compile options, ifort -assume byterecl .
If you don't encounter any error, please compare the printed file size with the real file size.
Code: Select all
PROGRAM check_ephmat
IMPLICIT NONE
INTEGER, PARAMETER :: DP = selected_real_kind(14,200)
INTEGER :: i, ios, idummy
REAL(DP) :: g2
OPEN(UNIT = 101, FILE = 'ephmat2', STATUS = 'old', FORM = 'unformatted', IOSTAT = ios)
READ(101) idummy, idummy
i=0
DO
READ(101, IOSTAT = ios) g2
i=i+1
IF (ios > 0) THEN
print *, 'Error occurs at ', i
stop
ELSE IF (ios < 0) THEN
EXIT
ELSE
PRINT *, i, g2
END IF
END DO
print *, 'Total file size=', 4*2+4*2+(4*2+8)*(i-1)
STOP
END PROGRAM check_ephmat
Sincerely,
H. Lee
Re: EPW crashes while reading ephmat files
Posted: Thu Jan 28, 2021 7:57 pm
by mdogan
Dear H. Lee,
Thank you! The program stopped with the output: "Error occurs at 3320497". So it looks like the file is corrupted, is that right?
Best,
Mehmet
Re: EPW crashes while reading ephmat files
Posted: Thu Jan 28, 2021 10:04 pm
by hlee
Dear Mehmet:
Yes, it is corrupted.
Could you let me know the exact file size (in bytes) of ephmat2?
Sincerely,
H. Lee
Re: EPW crashes while reading ephmat files
Posted: Thu Jan 28, 2021 10:11 pm
by mdogan
Dear H. Lee,
It is 442115728 bytes.
Best,
Mehmet
Re: EPW crashes while reading ephmat files
Posted: Thu Jan 28, 2021 10:24 pm
by hlee
Dear Mehmet:
The number of |g|^2 included in the file of ephmat2 is 27632232, where g is the electron-phonon matrix element.
So the file corruption starts rather early.
Definitely you should rerun your calculation, but I have no confidence in its success.
Basically, the issues in I/O is unpredictable; they can be affected by other users' I/O.
Sincerely,
H. Lee