Possibly inefficient call to ZAXPY and proposed performance enhancement
Posted: Sun Jun 05, 2022 2:39 am
Dear EPW developers,
I would like to report a possibly inefficient call to ZAXPY in subroutine ephwan2blochp (and similarly in ephwan2blochp_mem) in wan2bloch.f90
For example, as of QE-7.0 (similar in QE-7.1rc), in line 1938-1939, for the case of use_ws == .true. and etf_mem == 1:
This may create two temporary array slices that were not continuous in memory (may depend on systems and compilers).
However, ZAXPY is already designed to handle such operation without slicing:
which yields exactly the same results in my test.
By replacing all similar calls to ZAXPY, I actually achieved to a speed-up of over 200% in my tests (E5-2640 v4 @ 2.40GHz, 2 nodes, 32 cores per node, Intel compiler+MKL). The tests were on a FCC crystal with 3 atoms per unit-cell and interpolated to 10^3 grid.
The interpolation run took about 13 minutes before modification:
and less than 4 minutes on CPU after such modification:
The resulting mobilities are identical.
Since this subroutine (ephW2Bp) takes the majority of CPU time in my test (and similarly in other cases using much denser fine grids), I believe the modification is very important at least for similar cases. Further tests might be needed for different systems.
Thanks and Regards,
Tianqi Deng
dengtq@zju.edu.cn
Principal Investigator, HIC and MSE, Zhejiang University
I would like to report a possibly inefficient call to ZAXPY in subroutine ephwan2blochp (and similarly in ephwan2blochp_mem) in wan2bloch.f90
For example, as of QE-7.0 (similar in QE-7.1rc), in line 1938-1939, for the case of use_ws == .true. and etf_mem == 1:
Code: Select all
CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, :, :, :), 1, &
eptmp(iw, :, :, 3 * (na - 1) + 1:3 * na), 1)
However, ZAXPY is already designed to handle such operation without slicing:
Code: Select all
CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, 1, 1, 1), dims, &
eptmp(iw, 1, 1, 3 * (na - 1) + 1), dims)
By replacing all similar calls to ZAXPY, I actually achieved to a speed-up of over 200% in my tests (E5-2640 v4 @ 2.40GHz, 2 nodes, 32 cores per node, Intel compiler+MKL). The tests were on a FCC crystal with 3 atoms per unit-cell and interpolated to 10^3 grid.
The interpolation run took about 13 minutes before modification:
Code: Select all
Electron-Phonon interpolation
ephwann : 792.80s CPU 878.24s WALL ( 1 calls)
ep-interp : 775.42s CPU 860.44s WALL ( 962 calls)
DynW2B : 0.49s CPU 0.56s WALL ( 962 calls)
HamW2B : 0.72s CPU 0.84s WALL ( 3854 calls)
ephW2Bp : 766.65s CPU 846.84s WALL ( 962 calls)
ephW2B : 0.76s CPU 0.76s WALL ( 408 calls)
print_ibte : 4.06s CPU 8.67s WALL ( 962 calls)
vmewan2bloch : 1.29s CPU 1.37s WALL ( 1778 calls)
vmewan2bloch : 1.29s CPU 1.37s WALL ( 1778 calls)
Code: Select all
Electron-Phonon interpolation
ephwann : 216.97s CPU 291.82s WALL ( 1 calls)
ep-interp : 199.92s CPU 274.59s WALL ( 962 calls)
DynW2B : 0.46s CPU 0.51s WALL ( 962 calls)
HamW2B : 0.68s CPU 0.73s WALL ( 3854 calls)
ephW2Bp : 190.97s CPU 260.99s WALL ( 962 calls)
ephW2B : 0.76s CPU 0.76s WALL ( 408 calls)
print_ibte : 4.34s CPU 8.85s WALL ( 962 calls)
vmewan2bloch : 1.27s CPU 1.28s WALL ( 1778 calls)
vmewan2bloch : 1.27s CPU 1.28s WALL ( 1778 calls)
Since this subroutine (ephW2Bp) takes the majority of CPU time in my test (and similarly in other cases using much denser fine grids), I believe the modification is very important at least for similar cases. Further tests might be needed for different systems.
Thanks and Regards,
Tianqi Deng
dengtq@zju.edu.cn
Principal Investigator, HIC and MSE, Zhejiang University