I would like to report a possibly inefficient call to ZAXPY in subroutine ephwan2blochp (and similarly in ephwan2blochp_mem) in wan2bloch.f90
For example, as of QE-7.0 (similar in QE-7.1rc), in line 1938-1939, for the case of use_ws == .true. and etf_mem == 1:
Code: Select all
CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, :, :, :), 1, &
eptmp(iw, :, :, 3 * (na - 1) + 1:3 * na), 1)
However, ZAXPY is already designed to handle such operation without slicing:
Code: Select all
CALL ZAXPY(nrr_k * 3 * nbnd, cfac(iw, na, ir), epmatw(iw, 1, 1, 1), dims, &
eptmp(iw, 1, 1, 3 * (na - 1) + 1), dims)
By replacing all similar calls to ZAXPY, I actually achieved to a speed-up of over 200% in my tests (E5-2640 v4 @ 2.40GHz, 2 nodes, 32 cores per node, Intel compiler+MKL). The tests were on a FCC crystal with 3 atoms per unit-cell and interpolated to 10^3 grid.
The interpolation run took about 13 minutes before modification:
Code: Select all
Electron-Phonon interpolation
ephwann : 792.80s CPU 878.24s WALL ( 1 calls)
ep-interp : 775.42s CPU 860.44s WALL ( 962 calls)
DynW2B : 0.49s CPU 0.56s WALL ( 962 calls)
HamW2B : 0.72s CPU 0.84s WALL ( 3854 calls)
ephW2Bp : 766.65s CPU 846.84s WALL ( 962 calls)
ephW2B : 0.76s CPU 0.76s WALL ( 408 calls)
print_ibte : 4.06s CPU 8.67s WALL ( 962 calls)
vmewan2bloch : 1.29s CPU 1.37s WALL ( 1778 calls)
vmewan2bloch : 1.29s CPU 1.37s WALL ( 1778 calls)
Code: Select all
Electron-Phonon interpolation
ephwann : 216.97s CPU 291.82s WALL ( 1 calls)
ep-interp : 199.92s CPU 274.59s WALL ( 962 calls)
DynW2B : 0.46s CPU 0.51s WALL ( 962 calls)
HamW2B : 0.68s CPU 0.73s WALL ( 3854 calls)
ephW2Bp : 190.97s CPU 260.99s WALL ( 962 calls)
ephW2B : 0.76s CPU 0.76s WALL ( 408 calls)
print_ibte : 4.34s CPU 8.85s WALL ( 962 calls)
vmewan2bloch : 1.27s CPU 1.28s WALL ( 1778 calls)
vmewan2bloch : 1.27s CPU 1.28s WALL ( 1778 calls)
Since this subroutine (ephW2Bp) takes the majority of CPU time in my test (and similarly in other cases using much denser fine grids), I believe the modification is very important at least for similar cases. Further tests might be needed for different systems.
Thanks and Regards,
Tianqi Deng
Principal Investigator, HIC and MSE, Zhejiang University