Page 1 of 1

Possibly inefficient call to ZAXPY and proposed performance enhancement

Posted: Sun Jun 05, 2022 2:39 am
by dengtq
Dear EPW developers,

I would like to report a possibly inefficient call to ZAXPY in subroutine ephwan2blochp (and similarly in ephwan2blochp_mem) in wan2bloch.f90

For example, as of QE-7.0 (similar in QE-7.1rc), in line 1938-1939, for the case of use_ws == .true. and etf_mem == 1:

Code: Select all

            CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, :, :, :), 1, &
                 eptmp(iw, :, :, 3 * (na - 1) + 1:3 * na), 1)
This may create two temporary array slices that were not continuous in memory (may depend on systems and compilers).

However, ZAXPY is already designed to handle such operation without slicing:

Code: Select all

            CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, 1, 1, 1), dims, &
                 eptmp(iw, 1, 1, 3 * (na - 1) + 1), dims)
which yields exactly the same results in my test.

By replacing all similar calls to ZAXPY, I actually achieved to a speed-up of over 200% in my tests (E5-2640 v4 @ 2.40GHz, 2 nodes, 32 cores per node, Intel compiler+MKL). The tests were on a FCC crystal with 3 atoms per unit-cell and interpolated to 10^3 grid.

The interpolation run took about 13 minutes before modification:

Code: Select all

     Electron-Phonon interpolation
     ephwann      :    792.80s CPU    878.24s WALL (       1 calls)
     ep-interp    :    775.42s CPU    860.44s WALL (     962 calls)

     DynW2B       :      0.49s CPU      0.56s WALL (     962 calls)
     HamW2B       :      0.72s CPU      0.84s WALL (    3854 calls)
     ephW2Bp      :    766.65s CPU    846.84s WALL (     962 calls)
     ephW2B       :      0.76s CPU      0.76s WALL (     408 calls)
     print_ibte   :      4.06s CPU      8.67s WALL (     962 calls)
     vmewan2bloch :      1.29s CPU      1.37s WALL (    1778 calls)
     vmewan2bloch :      1.29s CPU      1.37s WALL (    1778 calls)
and less than 4 minutes on CPU after such modification:

Code: Select all

     Electron-Phonon interpolation
     ephwann      :    216.97s CPU    291.82s WALL (       1 calls)
     ep-interp    :    199.92s CPU    274.59s WALL (     962 calls)

     DynW2B       :      0.46s CPU      0.51s WALL (     962 calls)
     HamW2B       :      0.68s CPU      0.73s WALL (    3854 calls)
     ephW2Bp      :    190.97s CPU    260.99s WALL (     962 calls)
     ephW2B       :      0.76s CPU      0.76s WALL (     408 calls)
     print_ibte   :      4.34s CPU      8.85s WALL (     962 calls)
     vmewan2bloch :      1.27s CPU      1.28s WALL (    1778 calls)
     vmewan2bloch :      1.27s CPU      1.28s WALL (    1778 calls)
The resulting mobilities are identical.

Since this subroutine (ephW2Bp) takes the majority of CPU time in my test (and similarly in other cases using much denser fine grids), I believe the modification is very important at least for similar cases. Further tests might be needed for different systems.

Thanks and Regards,
Tianqi Deng
dengtq@zju.edu.cn
Principal Investigator, HIC and MSE, Zhejiang University

Re: Possibly inefficient call to ZAXPY and proposed performance enhancement

Posted: Sun Jun 05, 2022 4:23 pm
by hlee
Dear Tianqi Deng:

Thank you very much for your fix.

Your fix clearly eliminates copy-in/copy-out here.

Even if in your fix, zaxpy deals with non-stride-1 access, usually dims is small and the size of temporary array for copy-in/copy-out is large; so your fix is more efficient.

Sincerely,

H. Lee

Re: Possibly inefficient call to ZAXPY and proposed performance enhancement

Posted: Mon Jun 06, 2022 1:23 am
by dengtq
Dear Dr Lee,

Thank you for confirming the fix, and I'm more than glad to contribute.

Attached is my own modified wan2bloch.f90 for your kind reference.

Best Regards,
Tianqi Deng
dengtq@zju.edu.cn
Principal Investigator, HIC and MSE, Zhejiang University