Page 1 of 2

system msg for write_line failure : Bad file descriptor

Posted: Thu Jun 15, 2017 8:47 am
by balabi
Hello, everyone!

I use quantum espresso 6.1 and want to calculate some electron-phonon coupling things.

I compiled the EPW use "make epw" in QE6.1 folder. Then I want to run some tests

I choose QE-6.1/EPW/tests/Inputs/t05 which corresponds to MgB2 example according to Readme.

I tried to run RUN.sh in serial mode (comment off the parallel part), but encountered error message

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor


I examine step by step and found this error is caused when running

Code: Select all

../../../src/epw.x < epw_iso_real.in > epw_iso_real.out


So what is the meaning of the error message? How to fix it? Thank you for helping.

PS. I compiled QE and EPW with intel parallel studio 2017u2, but I also tried other intel version, all the same error.

Re: system msg for write_line failure : Bad file descriptor

Posted: Tue Jun 20, 2017 6:46 am
by balabi
so just no one can give a warm reply after 5 days? I am so desperate.

Re: system msg for write_line failure : Bad file descriptor

Posted: Tue Jun 20, 2017 8:43 am
by sponce
Hello,

Intelmpi is a big more challenging.

Mabye you can try ifort17 + openmpi ?

For intel+impi I can suggest:

1) Is QE and especially ph.x ok? Does ph.x run correctly in parallel ?

2) you can try to go in EPW/src/ and do "make" there and then try to use the executable directly from EPW/src/epw.x

3) When you run in sequential, you might have to do something like
mpirun -np 1 ../../../src/epw.x -npool 1 < epw_iso_real.in > epw_iso_real.out

Indeed since the code is compiled in parallel it might not work by doing directly ./epw.x

Best,

Samuel

Re: system msg for write_line failure : Bad file descriptor

Posted: Thu Jun 22, 2017 1:03 am
by balabi
sponce wrote:Hello,

Intelmpi is a big more challenging.

Mabye you can try ifort17 + openmpi ?

For intel+impi I can suggest:

1) Is QE and especially ph.x ok? Does ph.x run correctly in parallel ?

2) you can try to go in EPW/src/ and do "make" there and then try to use the executable directly from EPW/src/epw.x

3) When you run in sequential, you might have to do something like
mpirun -np 1 ../../../src/epw.x -npool 1 < epw_iso_real.in > epw_iso_real.out

Indeed since the code is compiled in parallel it might not work by doing directly ./epw.x

Best,

Samuel


Dear Samuel,

Thank you so much for reply.

I am wondering why intel+intelmpi could be a challenge? I compiled QE6.1 with intel parallel studio, and run for several months without any problem. Besides on the test farm page of EPW http://epw.org.uk/Main/TestFarm I clearly see intel+intelmmpi combination.

I tried all your suggestion:
1. the ph.x runs correctly in both serial and parallel mode.
2. directly make under EPW/src and use EPW/src/epw.x doesn't make any difference.
3. mpirun -np 1 ../../../src/epw.x -npool 1 < epw_iso_real.in > epw_iso_real.out also doesn't work. It just output simpler error message "application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0" only.

I may need to emphasize that epw_iso.in actually works. So I think the fact that epw_iso.in work while epw_iso_real.in fails is the key to analyze what is wrong.

And I just found I missed something important, there is a CRASH file generated in the folder saying

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 0
from epw_readin : error # 19
reading input_epw namelist
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Does it means that epw_iso_real.in has bug? I found that for example elinterp, phinterp actually not defined in EPW documentation page, but they appeared in epw_iso_real.in

So it turns out that it may not be the problem of mpi. So I need a confirmation. Does EPW/tests shipped with QE6.1 has bug or not?

Thank you so much for help.

Re: system msg for write_line failure : Bad file descriptor

Posted: Thu Jun 22, 2017 8:23 am
by sponce
Dear balabi,

Yes, that's it. Some input variable have been removed and I forgot to update that test. The list of update variable can be found at http://epw.org.uk/Main/About

Thank you for reporting it.

Actually, only the tests in QE/test-suite/epw_* are tested automatically. Those should be correct.

For impi, I just meant that sometimes the configure does not find automatically all the links for mkl for example and you have to update manually the make.sys file. But it was not the issue here.

Best,

Samuel

Re: system msg for write_line failure : Bad file descriptor

Posted: Fri Jun 23, 2017 3:00 pm
by balabi
sponce wrote:Dear balabi,

Yes, that's it. Some input variable have been removed and I forgot to update that test. The list of update variable can be found at http://epw.org.uk/Main/About

Thank you for reporting it.

Actually, only the tests in QE/test-suite/epw_* are tested automatically. Those should be correct.

For impi, I just meant that sometimes the configure does not find automatically all the links for mkl for example and you have to update manually the make.sys file. But it was not the issue here.

Best,

Samuel


Dear Samuel,
Thank you so much for reply.
I took your advice to test the testsuite.
I run the test by "make run-tests-epw-parallel" and "make run-tests-epw-serial" for parallel and serial respectively.
I have tested two compiled version and found many problems.

Intel parallel studio compiled version

  1. serial mode
    1. only 18 out of 24 tests passed.
    2. But many of the error is unbelievably significant. For example,

    Code: Select all

    ERROR: absolute error 2.61e+02 greater than 5.00e-02. (Test: 439.026029.  Benchmark: 700.525115.)
       ERROR: relative error 3.73e-01 greater than 6.00e-04. (Test: 439.026029.  Benchmark: 700.525115.)
    isig
       ERROR: absolute error 2.61e+02 greater than 5.00e-02. (Test: 439.026029.  Benchmark: 700.525115.)
       ERROR: relative error 3.73e-01 greater than 6.00e-04. (Test: 439.026029.  Benchmark: 700.525115.)
    isig
       ERROR: absolute error 1.38e+01 greater than 5.00e-02. (Test: 36.255925.  Benchmark: 50.047645.)
       ERROR: relative error 2.76e-01 greater than 6.00e-04. (Test: 36.255925.  Benchmark: 50.047645.)
    isig
       ERROR: absolute error 2.04e+02 greater than 5.00e-02. (Test: 359.447991.  Benchmark: 563.927535.)
       ERROR: relative error 3.63e-01 greater than 6.00e-04. (Test: 359.447991.  Benchmark: 563.927535.)

  2. parallel mode
    1. failed much more
    only 6 out of 24 tests passed.
    2. besides siginificant errors like serial mode. I also got several

    Code: Select all

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        Error in routine divide_et_impera (1):
        some nodes have no k-points
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    and
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
        Error in routine pw_readfile (1):
        error opening xml data file
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    3. I even got a peculiar traceback info in the middle, like this

    Code: Select all

    *** Error in `/home/admin-pc/QE/qe-6.1-mklseq-traceback/qe-6.1/test-suite/..//bin/epw.x': free(): invalid next size (normal): 0x0000000004111420 ***
    ======= Backtrace: =========
    /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x2b924816a7e5]
    /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x2b924817337a]
    /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x2b924817753c]
    /home/admin-pc/QE/qe-6.1-mklseq-traceback/qe-6.1/test-suite/..//bin/epw.x[0xcbf808]
    /home/admin-pc/QE/qe-6.1-mklseq-traceback/qe-6.1/test-suite/..//bin/epw.x[0x480ac4]
    .....
    ....
  3. the full serial result is here : https://drive.google.com/open?id=0By4WA ... GlFLVdsV0E
    the full parallel result is here : https://drive.google.com/open?id=0By4WA ... V96cmNOWnc


gfortran + ubuntu original openmpi version

  1. serial mode
    only 22 out of 24 tests passed
    this looks fine, and the error is quite small
  2. parallel mode
    I also got an MPI_ABORT error. and what is more, it didn't finished, it hault at

    Code: Select all

     mpirun -np 8 /home/admin-pc/QE/qe-6.1-gfortran/qe-6.1/test-suite/..//bin/epw.x -npool 8 < epw2.in > test.out.220617-1.inp=epw2.in.args=3 2> test.err.220617-1.inp=epw2.in.args=3

    and just stuck for half an hour, though the cpu is 100%. I think I have to abort it manually.
    full output is here https://pastebin.com/NbUwXNU4


question

So why the errors are so significant, why intel version is so bad, this is unbelievable. Can I trust result of intel version. Actually the question can be reverted, since I know you probably compiled with gfortran according to youtube video, so the reference data probably run with gfortran. But can I trust gfortran over intel result? Anyway, intel is a much mature compiler. And what is more, what is wrong with unfinished parallel gfortran run?

best regards

Re: system msg for write_line failure : Bad file descriptor

Posted: Fri Jun 23, 2017 5:53 pm
by sponce
Dear balabi,

There might indeed be an issue with your compilation.

So, first of all, although I showed the compilation with gfortran in the YouTube video, EPW is tested nightly on a test-farm with multiple compilers:http://epw.org.uk/Main/TestFarm

As you can see intel 2015 works perfectly: http://129.67.86.21:8010/builders/EPW-f ... logs/stdio

In theory, the parallel version of the test-suite should use 4 mpi threads, not 8. Did you modify this?
If so, you should not because some test calculations might have less than 8 k-points. This explains why you get

Code: Select all

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Error in routine divide_et_impera (1):
    some nodes have no k-points
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


I therefore think that with 4MPI you might get gfortran+openmpi to work properly.

Regarding the intel compilation, can you try intel+openmpi ?

Best,

Samuel

Re: system msg for write_line failure : Bad file descriptor

Posted: Sat Jun 24, 2017 9:07 am
by balabi
sponce wrote:Dear balabi,

There might indeed be an issue with your compilation.

So, first of all, although I showed the compilation with gfortran in the YouTube video, EPW is tested nightly on a test-farm with multiple compilers:http://epw.org.uk/Main/TestFarm

As you can see intel 2015 works perfectly: http://129.67.86.21:8010/builders/EPW-f ... logs/stdio

In theory, the parallel version of the test-suite should use 4 mpi threads, not 8. Did you modify this?
If so, you should not because some test calculations might have less than 8 k-points. This explains why you get

Code: Select all

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Error in routine divide_et_impera (1):
    some nodes have no k-points
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


I therefore think that with 4MPI you might get gfortran+openmpi to work properly.

Regarding the intel compilation, can you try intel+openmpi ?

Best,

Samuel


Dear Samuel,

Thank you so much for reply.

I think I may found what is going wrong.

Firstly

you are right. I should not modify thread from 4 to 8. After change it back, gfortran version give correct output.

Secondly

I found the problem of large error is not due to compiler or mpi. It is the problem of bugfix.
On the page http://epw.org.uk/Main/DownloadAndInstall, you mentioned
Due to a change in the symmetries of QE, a bugfix has been created

So we have to replace epw.f90 and elphon_shuffle_wrap.f90 with bugfixed version.
So I did it. But I did it for intel compiled version, and forgot for gfortran version. And that is why intel got so "bad".
If I compile without bugfix with intel parallel sutdio. The result is almost near perfect, both serial and parallel execution gives same output. Full output see here https://pastebin.com/zWqTt9xx
21 out of 24 tests passed. But there is one mysterious fail, see line 57 to 60, I don't know what is wrong.

question
  1. What is the error in line 57 to 60?
  2. Could you please check if the bugfix is correct? Do we really need it?

best regards

Re: system msg for write_line failure : Bad file descriptor

Posted: Sat Jun 24, 2017 11:43 am
by sponce
Hello,

The release of the version 6.1 of QE was a bit of a mess.

Just before the release a changed of sym. has been made in QE. I did not see it in time and that had an impact on EPW.

Unfortunately, for the release of QE 6.1, somebody re-generate the benchmark reference files of EPW with the mistake (this should not have been done).

As a result, some of the benchmark reference files are wrong I'm afraid.
I only created the patch so that the code produce correct results but did not create a patch for the benchmark file to avoid making a mess.

You can download the development version of QE on github: https://github.com/QEF/q-e

This version has the correct reference benchmark files for EPW and you do not need to apply any patch.

Best,

Samuel

Re: system msg for write_line failure : Bad file descriptor

Posted: Tue Jun 27, 2017 3:35 am
by balabi
sponce wrote:Hello,

The release of the version 6.1 of QE was a bit of a mess.

Just before the release a changed of sym. has been made in QE. I did not see it in time and that had an impact on EPW.

Unfortunately, for the release of QE 6.1, somebody re-generate the benchmark reference files of EPW with the mistake (this should not have been done).

As a result, some of the benchmark reference files are wrong I'm afraid.
I only created the patch so that the code produce correct results but did not create a patch for the benchmark file to avoid making a mess.

You can download the development version of QE on github: https://github.com/QEF/q-e

This version has the correct reference benchmark files for EPW and you do not need to apply any patch.

Best,

Samuel


Dear Samuel,

Thank you so much for your reply.

I tried development version now.

There is still discrepancies.

For intel parallel studio 2017u2 compiled version, this is full output https://pastebin.com/SS4rVkL7
Three fails. Note line 140--165, the error is large. What is wrong? and also line 57 and line 75 both fails

For gfortran version, this is full output https://pastebin.com/abnHpdKU , only one fail and error is small, almost perfect.

What confused me is that if I open epw.f90 in this dev qe version, it shows version 4.1. But in the official site, the latest version is 4.2. Then what is the latest version really?

best regards