Server overload due to many tasks reading from a very large epmatwp file

Post here questions linked with issue while running the EPW code

Moderator: stiwari

Post Reply
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Server overload due to many tasks reading from a very large epmatwp file

Post by mdogan »

Hello,

I'm running EPW on the TACC Frontera supercomputer. While running a q-point interpolation on 32 nodes and 128 processors, my job was canceled by system admins, due to "an OSS server overloading while reading a very large file from many tasks". The file in question is the epmatwp file whose size is 396 GB. Until I figure out a way not to repeat the same problem, my queue access is disabled.

Is there a way to divide this file up like the epb files? Or some other method so that not all the tasks are reading from the same file at the same time? I checked their "Managing I/O" webpage [https://portal.tacc.utexas.edu/tutorials/managingio] but I wasn't sure if any of the suggestions there were applicable. I will greatly appreciate any suggestions.

My epw input file is as follows:

Code: Select all

electron-phonon calculation
 &inputepw
    outdir = './'
    prefix = 'scf'
    dvscf_dir = './save'
    iverbosity = 1

    ep_coupling = .true.
    elph = .true.
    kmaps = .true.
    epbwrite = .false.
    epbread = .false.
    epwwrite = .true.
    epwread = .true.
    etf_mem = 1

    wannierize = .false.
    nbndsub = 24
    num_iter = 200
    dis_win_min = -20
    dis_win_max = 80
    dis_froz_min = -20
    dis_froz_max = 20
    proj(1) = 'H:s'
    wdata(1) = 'bands_plot = .true.'
    wdata(2) = 'begin kpoint_path'
    wdata(3) = 'Z  0.00  0.00  0.50  Y  0.50  -0.50  0.00'
    wdata(4) = 'Y  0.50  -0.50  0.00  H3  0.66666667  -0.33333333  0.50'
    wdata(5) = 'H3  0.66666667  -0.33333333  0.50  T  0.50  -0.50  0.50'
    wdata(6) = 'T  0.50  -0.50  0.50  Z  0.00  0.00  0.50'
    wdata(7) = 'Z  0.00  0.00  0.50  G  0.00  0.00  0.00'
    wdata(8) = 'G  0.00  0.00  0.00  R  0.00  -0.50  -0.50'
    wdata(9) = 'R  0.00  -0.50  -0.50  S  0.00  -0.50  0.00'
    wdata(10) = 'S  0.00  -0.50  0.00  H2  0.33333333  -0.66666667  0.50'
    wdata(11) = 'end kpoint_path'
    wdata(12) = 'bands_num_points = 80'
    wdata(13) = 'kmesh_tol = 0.0001'

    fsthick = 1
    degaussw = 0.1

    ephwrite = .true.
    eliashberg = .true.

    nsiter = 400
    conv_thr_iaxis = 1.0d-3
    wscut = 1.0

    temps = 20 180
    nstemp = 9

    muc = 0.10
    mp_mesh_k = .true.

    restart = .true.
    restart_step = 10

    nkf1 = 64
    nkf2 = 64
    nkf3 = 32
    nqf1 = 16
    nqf2 = 16
    nqf3 = 8

    nk1 = 32
    nk2 = 32
    nk3 = 16
    nq1 = 4
    nq2 = 4
    nq3 = 2

 /
Thank you very much!

Best,
Mehmet
hlee
Posts: 415
Joined: Thu Aug 03, 2017 12:24 pm
Affiliation: The University of Texas at Austin

Re: Server overload due to many tasks reading from a very large epmatwp file

Post by hlee »

Dear Mehmet:

Unlike prefix.epb* files, prefix.epmatwp file consists of a single file in order to remove the restriction of the use of the same number of cores for restart and it is read and written by parallel I/O for efficiency; so prefix.epmatwp file can not be divided.

Instead, I would suggest you to incorporate proper striping in order to distribute prefix.epmatwp file across multiple OSTs (object storage targets), thereby avoiding stressing any one OST.

For details, please check the following page at https://frontera-portal.tacc.utexas.edu ... ide/files/ or contact the system admins of Frontera.

Sincerely,

H. Lee
mdogan
Posts: 59
Joined: Thu Jun 18, 2020 5:59 pm
Affiliation: UC Berkeley

Re: Server overload due to many tasks reading from a very large epmatwp file

Post by mdogan »

Dear H. Lee,

Thank you very much! After striping the large files, my access to the queues has been restored, and hopefully further issues will be avoided.

Best,
Mehmet
Post Reply