Server overload due to many tasks reading from a very large epmatwp file
Posted: Wed Jan 06, 2021 10:48 pm
Hello,
I'm running EPW on the TACC Frontera supercomputer. While running a q-point interpolation on 32 nodes and 128 processors, my job was canceled by system admins, due to "an OSS server overloading while reading a very large file from many tasks". The file in question is the epmatwp file whose size is 396 GB. Until I figure out a way not to repeat the same problem, my queue access is disabled.
Is there a way to divide this file up like the epb files? Or some other method so that not all the tasks are reading from the same file at the same time? I checked their "Managing I/O" webpage [https://portal.tacc.utexas.edu/tutorials/managingio] but I wasn't sure if any of the suggestions there were applicable. I will greatly appreciate any suggestions.
My epw input file is as follows:
Thank you very much!
Best,
Mehmet
I'm running EPW on the TACC Frontera supercomputer. While running a q-point interpolation on 32 nodes and 128 processors, my job was canceled by system admins, due to "an OSS server overloading while reading a very large file from many tasks". The file in question is the epmatwp file whose size is 396 GB. Until I figure out a way not to repeat the same problem, my queue access is disabled.
Is there a way to divide this file up like the epb files? Or some other method so that not all the tasks are reading from the same file at the same time? I checked their "Managing I/O" webpage [https://portal.tacc.utexas.edu/tutorials/managingio] but I wasn't sure if any of the suggestions there were applicable. I will greatly appreciate any suggestions.
My epw input file is as follows:
Code: Select all
electron-phonon calculation
&inputepw
outdir = './'
prefix = 'scf'
dvscf_dir = './save'
iverbosity = 1
ep_coupling = .true.
elph = .true.
kmaps = .true.
epbwrite = .false.
epbread = .false.
epwwrite = .true.
epwread = .true.
etf_mem = 1
wannierize = .false.
nbndsub = 24
num_iter = 200
dis_win_min = -20
dis_win_max = 80
dis_froz_min = -20
dis_froz_max = 20
proj(1) = 'H:s'
wdata(1) = 'bands_plot = .true.'
wdata(2) = 'begin kpoint_path'
wdata(3) = 'Z 0.00 0.00 0.50 Y 0.50 -0.50 0.00'
wdata(4) = 'Y 0.50 -0.50 0.00 H3 0.66666667 -0.33333333 0.50'
wdata(5) = 'H3 0.66666667 -0.33333333 0.50 T 0.50 -0.50 0.50'
wdata(6) = 'T 0.50 -0.50 0.50 Z 0.00 0.00 0.50'
wdata(7) = 'Z 0.00 0.00 0.50 G 0.00 0.00 0.00'
wdata(8) = 'G 0.00 0.00 0.00 R 0.00 -0.50 -0.50'
wdata(9) = 'R 0.00 -0.50 -0.50 S 0.00 -0.50 0.00'
wdata(10) = 'S 0.00 -0.50 0.00 H2 0.33333333 -0.66666667 0.50'
wdata(11) = 'end kpoint_path'
wdata(12) = 'bands_num_points = 80'
wdata(13) = 'kmesh_tol = 0.0001'
fsthick = 1
degaussw = 0.1
ephwrite = .true.
eliashberg = .true.
nsiter = 400
conv_thr_iaxis = 1.0d-3
wscut = 1.0
temps = 20 180
nstemp = 9
muc = 0.10
mp_mesh_k = .true.
restart = .true.
restart_step = 10
nkf1 = 64
nkf2 = 64
nkf3 = 32
nqf1 = 16
nqf2 = 16
nqf3 = 8
nk1 = 32
nk2 = 32
nk3 = 16
nq1 = 4
nq2 = 4
nq3 = 2
/
Best,
Mehmet