How to reduce load on parallel file system

ganwar · Post by **ganwar** » Tue May 21, 2019 2:56 pm

Hi,

I'm supporting a user on our HPC facility running epw from QE 6.3. Unfortunately the jobs the user is running is generating a very high load on our parallel file system (GPFS) to the extent that several (2-3) concurrent multi-node (between 3-10 nodes) jobs are causing the file system to become unusable for other users.

Does anyone have advice on reducing this IO load? I believe with QE (pw.x) can you separately set wfcdir to a local disk (for per processor files) and outdir to the parallel file system to reduce disk IO, as well as setting disk_io. However for epw it seems that everything goes via outdir and setting it to a local disk for multinode jobs results in MPI_FILE_OPEN errors.

Any advice or suggestions would be welcome, apologies if I've misunderstood or missed something.

Thanks

Post by **sponce** » Wed May 29, 2019 8:36 am

Dear ganwar,

Thanks for your message.

Can you be more specific on the type of calculation the user is running ?

A traditional EPW calculation will write .epb files (local on each node) as well as a .epmatwp1 file.
However that later file is then indeed read using MPI_READ (i.e. all the cores are reading the same big file). Upon testing this should not stress the cluster too much. If it does can you tell us which part of the code is responsible.

Now if the user use the newly implemented feature of mobility calculation, then yes everything is MPI_SEEK + MPI_WRITE.
Actually I'm working on an alternative possibility were everything is local for the mobility.

It would also help if you can get the user to send a typical input file that generate the problem as well as the size of the XX.epmatwp1 file.

Thanks,
Samuel

ganwar · Post by **ganwar** » Wed May 29, 2019 1:39 pm

Hi Samuel,

Thanks for your response.

I suspect the issue is really the epb files but given our walltime limit (48 hours) the user said they wanted to keep these in case the jobs don't reach the epw phase in time. I had suggested that they set etf_mem but this causes the jobs to run out of RAM.

I've asked the user to look over you questions and either respond via here or through me so I'll let you know when I have more.

Thanks

Post by **sponce** » Mon Jun 03, 2019 9:28 am

Dear ganwar,

It would be very surprising that the generation of .epb files generate a lot of load on your filesystem.
Those files are produced locally (.i.e. each cores within a node should be writing on its own scratch with no communication between nodes).

They should indeed keep those files until they generate the XX.epmatwp1 file. Once that is done, it is safe for them to remove all the .epb files.

Best wishes,
Samuel

ganwar · Post by **ganwar** » Mon Jun 03, 2019 9:43 am

Hi Samuel,

"Those files are produced locally (.i.e. each cores within a node should be writing on its own scratch with no communication between nodes)."

While the files are produced locally, they seem to be written to the same directory as "outdir" which in our case is a cluster-wide file system. It would be useful if there were a way to save those files in a separate location from outdir.

However having said that, I can see a job running at the moment which is only accessing a .epmatwp1 file and that does seem to be generating a lot of IO so I am probably wrong about the .epb files being the cause of the high IO.

Thanks

Chathu · Post by **Chathu** » Mon Jun 03, 2019 6:29 pm

ganwar wrote:Hi,

I'm supporting a user on our HPC facility running epw from QE 6.3. Unfortunately the jobs the user is running is generating a very high load on our parallel file system (GPFS) to the extent that several (2-3) concurrent multi-node (between 3-10 nodes) jobs are causing the file system to become unusable for other users.

Does anyone have advice on reducing this IO load? I believe with QE (pw.x) can you separately set wfcdir to a local disk (for per processor files) and outdir to the parallel file system to reduce disk IO, as well as setting disk_io. However for epw it seems that everything goes via outdir and setting it to a local disk for multinode jobs results in MPI_FILE_OPEN errors.

Any advice or suggestions would be welcome, apologies if I've misunderstood or missed something.

Thanks

Hi Samuel,

I am the HPC user mentioned here. Following is my input file for epw.

--
&inputepw
prefix = 'NbCoSn',

amass(1) = 92.90638
amass(2) =58.933195
amass(3) =118.71
! outdir = '/tmp/esscmv/NbCoSn/'
! dvscf_dir = '/tinisgpfs/home/csc/esscmv/bandstructure_qe/NbCoSn/EPW_2/save'
outdir = './'
dvscf_dir = './save'

elph = .true.
kmaps = .true.
epbwrite = .true.
epbread = .false.

epwwrite = .true.
epwread = .false.

nbndsub = 12
nbndskip = 0

wannierize = .true.
num_iter = 300
dis_win_max = 25
dis_win_min = 0
dis_froz_min= 14
dis_froz_max= 25

wdata(1) = 'bands_plot = .true.'
wdata(2) = 'begin kpoint_path'
wdata(3) = 'G 0.00 0.00 0.00 X 0.00 0.50 0.50'
wdata(4) = 'X 0.00 0.50 0.50 W 0.25 0.50 0.75'
wdata(5) = 'W 0.25 0.50 0.75 L 0.50 0.50 0.50'
wdata(6) = 'L 0.50 0.50 0.50 K 0.375 0.375 0.75'
wdata(7) = 'K 0.375 0.375 0.75 G 0.00 0.00 0.00'
wdata(8) = 'G 0.00 0.00 0.00 L 0.50 0.50 0.50'
wdata(9) = 'end kpoint_path'
wdata(10) = 'bands_plot_format = gnuplot'

iverbosity = 3
etf_mem = 1
restart=.true.
restart_freq=1000

elecselfen = .true.
delta_approx= .true.
phonselfen = .false.
efermi_read = .true.
fermi_energy= 16.4224

fsthick = 2.5 ! eV
eptemp = 300 ! K
degaussw = 0.05 ! eV

a2f = .false.

nkf1 = 48
nkf2 = 48
nkf3 = 48

nqf1 = 48
nqf2 = 48
nqf3 = 48

nk1 = 24
nk2 = 24
nk3 = 24

nq1 = 6
nq2 = 6
nq3 = 6
/
16 cartesian
0.000000000000000E+00 0.000000000000000E+00 0.000000000000000E+00
0.117851130197756E+00 0.117851130197756E+00 -0.117851130197756E+00
0.235702260395511E+00 0.235702260395511E+00 -0.235702260395511E+00
-0.353553390593267E+00 -0.353553390593267E+00 0.353553390593267E+00
0.235702260395511E+00 -0.654205191118227E-17 0.654205191118227E-17
0.353553390593267E+00 0.117851130197756E+00 -0.117851130197756E+00
-0.235702260395511E+00 -0.471404520791023E+00 0.471404520791023E+00
-0.117851130197756E+00 -0.353553390593267E+00 0.353553390593267E+00
0.261682076447291E-16 -0.235702260395511E+00 0.235702260395511E+00
0.471404520791023E+00 -0.130841038223645E-16 0.130841038223645E-16
-0.117851130197756E+00 -0.589255650988778E+00 0.589255650988778E+00
-0.261682076447291E-16 -0.471404520791023E+00 0.471404520791023E+00
-0.707106781186534E+00 0.000000000000000E+00 0.000000000000000E+00
-0.235702260395511E+00 -0.471404520791023E+00 0.707106781186534E+00
-0.117851130197756E+00 -0.353553390593267E+00 0.589255650988778E+00
-0.707106781186534E+00 0.235702260395511E+00 0.261682076447291E-16

I would be relly grateful if you can have look and see if there is anything I can change here to solve this issue.

Regards,
Chathu

ganwar · Post by **ganwar** » Wed Jun 05, 2019 1:00 pm

Hi Samuel,

Just to add to Chathu's comment, I've managed to get a little detail from storage system regarding the load on the file system for a job that's currently active. As far as I can tell the high IO (for this particular stage in the job) appears to be due to the MPI tasks (128 of them) reading from the epmatwp1 file which is currently 95GB. I don't know if this is helpful at all.

Post by **sponce** » Mon Jun 10, 2019 10:33 am

Hello,

Oh I see. Well, clearly the "coarse" grid is not coarse at all.
I can see that the user is using:

Code: Select all

nk1 = 24
nk2 = 24
nk3 = 24

which is very dense.

The idea of our software EPW is precisely to avoid having to use such dense coarse grids by relying on the properties of MLWF.
This seems total overkill to me.
At most the user should use

Code: Select all

nk1 = 12
nk2 = 12
nk3 = 12

and might probably be fine with

Code: Select all

nk1 = 6
nk2 = 6
nk3 = 6

This will drastically reduce the size of the epmatwp1 file to ~3 Gb which will significantly decrease the IO.

Best wishes,
Samuel

ganwar · Post by **ganwar** » Wed Jul 31, 2019 3:21 pm

Hi Samuel,

Apologies for not responding to your message earlier. I just wanted to confirm that following your advice Chathu has started running a more coarse grid which has significantly reduce the load on the file system. Thanks once again for your advice.

EPW Forum

How to reduce load on parallel file system

How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system

Re: How to reduce load on parallel file system