Page 1 of 1

Server overload due to many tasks reading from a very large epmatwp file

Posted: Wed Jan 06, 2021 10:48 pm
by mdogan
Hello,

I'm running EPW on the TACC Frontera supercomputer. While running a q-point interpolation on 32 nodes and 128 processors, my job was canceled by system admins, due to "an OSS server overloading while reading a very large file from many tasks". The file in question is the epmatwp file whose size is 396 GB. Until I figure out a way not to repeat the same problem, my queue access is disabled.

Is there a way to divide this file up like the epb files? Or some other method so that not all the tasks are reading from the same file at the same time? I checked their "Managing I/O" webpage [https://portal.tacc.utexas.edu/tutorials/managingio] but I wasn't sure if any of the suggestions there were applicable. I will greatly appreciate any suggestions.

My epw input file is as follows:

Code: Select all

electron-phonon calculation
 &inputepw
    outdir = './'
    prefix = 'scf'
    dvscf_dir = './save'
    iverbosity = 1

    ep_coupling = .true.
    elph = .true.
    kmaps = .true.
    epbwrite = .false.
    epbread = .false.
    epwwrite = .true.
    epwread = .true.
    etf_mem = 1

    wannierize = .false.
    nbndsub = 24
    num_iter = 200
    dis_win_min = -20
    dis_win_max = 80
    dis_froz_min = -20
    dis_froz_max = 20
    proj(1) = 'H:s'
    wdata(1) = 'bands_plot = .true.'
    wdata(2) = 'begin kpoint_path'
    wdata(3) = 'Z  0.00  0.00  0.50  Y  0.50  -0.50  0.00'
    wdata(4) = 'Y  0.50  -0.50  0.00  H3  0.66666667  -0.33333333  0.50'
    wdata(5) = 'H3  0.66666667  -0.33333333  0.50  T  0.50  -0.50  0.50'
    wdata(6) = 'T  0.50  -0.50  0.50  Z  0.00  0.00  0.50'
    wdata(7) = 'Z  0.00  0.00  0.50  G  0.00  0.00  0.00'
    wdata(8) = 'G  0.00  0.00  0.00  R  0.00  -0.50  -0.50'
    wdata(9) = 'R  0.00  -0.50  -0.50  S  0.00  -0.50  0.00'
    wdata(10) = 'S  0.00  -0.50  0.00  H2  0.33333333  -0.66666667  0.50'
    wdata(11) = 'end kpoint_path'
    wdata(12) = 'bands_num_points = 80'
    wdata(13) = 'kmesh_tol = 0.0001'

    fsthick = 1
    degaussw = 0.1

    ephwrite = .true.
    eliashberg = .true.

    nsiter = 400
    conv_thr_iaxis = 1.0d-3
    wscut = 1.0

    temps = 20 180
    nstemp = 9

    muc = 0.10
    mp_mesh_k = .true.

    restart = .true.
    restart_step = 10

    nkf1 = 64
    nkf2 = 64
    nkf3 = 32
    nqf1 = 16
    nqf2 = 16
    nqf3 = 8

    nk1 = 32
    nk2 = 32
    nk3 = 16
    nq1 = 4
    nq2 = 4
    nq3 = 2

 /
Thank you very much!

Best,
Mehmet

Re: Server overload due to many tasks reading from a very large epmatwp file

Posted: Thu Jan 07, 2021 3:58 pm
by hlee
Dear Mehmet:

Unlike prefix.epb* files, prefix.epmatwp file consists of a single file in order to remove the restriction of the use of the same number of cores for restart and it is read and written by parallel I/O for efficiency; so prefix.epmatwp file can not be divided.

Instead, I would suggest you to incorporate proper striping in order to distribute prefix.epmatwp file across multiple OSTs (object storage targets), thereby avoiding stressing any one OST.

For details, please check the following page at https://frontera-portal.tacc.utexas.edu ... ide/files/ or contact the system admins of Frontera.

Sincerely,

H. Lee

Re: Server overload due to many tasks reading from a very large epmatwp file

Posted: Tue Jan 12, 2021 1:06 am
by mdogan
Dear H. Lee,

Thank you very much! After striping the large files, my access to the queues has been restored, and hopefully further issues will be avoided.

Best,
Mehmet