Hello Nandan,
I cannot see your email.
In any case, you can post your error message here even if its long.
Just use the [code] balise and it will be smaller with a navigation bar.
Best,
Samuel
segmentation fault for high q-grid
Moderator: stiwari
Re: segmentation fault for high q-grid
Prof. Samuel Poncé
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Re: segmentation fault for high q-grid
Hi Samuel,
The error message is just multiple repetitions of the following:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AEED9CD2710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002AEED9EFED5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AD492260710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
gan-work.e38018405
Regards,
Nandan.
The error message is just multiple repetitions of the following:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AEED9CD2710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002AEED9EFED5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AD492260710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
gan-work.e38018405
Regards,
Nandan.
Re: segmentation fault for high q-grid
The epw.out ends with:
94 No wavefunction gauge setting applied
95
96 -------------------------------------------------------------------
97 Using ./gan.ukk from disk
98 -------------------------------------------------------------------
99
100
101 Using kmap and kgmap from disk
102
103 Do not need to read .epb files; read .fmt files
104
105
106 band disentanglement is used: nbndsub = 16
107
108 Reading Hamiltonian, Dynamical matrix and EP vertex in Wann rep from file
109
110
111 Finished reading Wann rep data from file
112
113 Using uniform q-mesh: 40 40 40
114 Size of q point mesh for interpolation: 64000
115 Using uniform k-mesh: 30 30 30
116 Size of k point mesh for interpolation: 54000
117 Max number of k points per pool: 450
118 -------------------------------------------------------
119 Primary job terminated normally, but 1 process returned
120 a non-zero exit code.. Per user-direction, the job has been aborted.
121 -------------------------------------------------------
122 --------------------------------------------------------------------------
123 mpiexec detected that one or more processes exited with non-zero status, thus causing
124 the job to be terminated. The first process to do so was:
125
126 Process name: [[16793,1],108]
127 Exit code: 174
128 --------------------------------------------------------------------------
Not sure if that helps.
Nandan.
94 No wavefunction gauge setting applied
95
96 -------------------------------------------------------------------
97 Using ./gan.ukk from disk
98 -------------------------------------------------------------------
99
100
101 Using kmap and kgmap from disk
102
103 Do not need to read .epb files; read .fmt files
104
105
106 band disentanglement is used: nbndsub = 16
107
108 Reading Hamiltonian, Dynamical matrix and EP vertex in Wann rep from file
109
110
111 Finished reading Wann rep data from file
112
113 Using uniform q-mesh: 40 40 40
114 Size of q point mesh for interpolation: 64000
115 Using uniform k-mesh: 30 30 30
116 Size of k point mesh for interpolation: 54000
117 Max number of k points per pool: 450
118 -------------------------------------------------------
119 Primary job terminated normally, but 1 process returned
120 a non-zero exit code.. Per user-direction, the job has been aborted.
121 -------------------------------------------------------
122 --------------------------------------------------------------------------
123 mpiexec detected that one or more processes exited with non-zero status, thus causing
124 the job to be terminated. The first process to do so was:
125
126 Process name: [[16793,1],108]
127 Exit code: 174
128 --------------------------------------------------------------------------
Not sure if that helps.
Nandan.
Re: segmentation fault for high q-grid
Hello,
What is the content of the file "gan-work.e38018405" ?
Is it what you showed above?
If so, it does not give much info indeed ....
This crash is specific to his calculation or other restart works ?
Cheers,
Samuel
What is the content of the file "gan-work.e38018405" ?
Is it what you showed above?
If so, it does not give much info indeed ....
This crash is specific to his calculation or other restart works ?
Cheers,
Samuel
Prof. Samuel Poncé
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Re: segmentation fault for high q-grid
Yes, it is the gan-work.e38018405 file. This is happening all the time. Consider the following example which is the latest case to fail a restart.
The input file is:
========================================
kmaps = .true.
etf_mem = .false.
epbwrite = .false.
epbread = .false.
epwwrite = .false.
epwread = .true.
nbndsub = 16
nbndskip = 0
wannierize = .false.
num_iter = 5000
iprint = 2
dis_froz_max= 17.5
dis_froz_min= 10
proj(1) = 'random'
proj(2) = 'N:sp3'
iverbosity = 3
elecselfen = .true.
phonselfen = .false.
parallel_k = .true.
parallel_q = .false.
fsthick = 1.d10
eptemp = 300.d0
degaussw = 0.002
dvscf_dir = '/mnt/home/tandonna/work/GaN_666_qmesh/phonons/inp/save'
filukk = './gan.ukk'
nk1 = 6
nk2 = 6
nk3 = 6
nq1 = 6
nq2 = 6
nq3 = 6
nkf1 = 30
nkf2 = 30
nkf3 = 30
nqf1 = 35
nqf2 = 35
nqf3 = 35
========================================
This was right after a calculation with the following mesh ended :
nqf=30x30x30
No changes to submit script were made. The error message again does not give much information.
epw.out ends with:
======================================
Using uniform q-mesh: 35 35 35
Size of q point mesh for interpolation: 42875
Using uniform k-mesh: 30 30 30
Size of k point mesh for interpolation: 54000
Max number of k points per pool: 676
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[48056,1],8]
Exit code: 174
--------------------------------------------------------------------------
======================================
and error file has:
======================================
Error getting SCIF driver version
Inactive Modules:
1) Boost 2) R 3) TBB
The following have been reloaded with a version change:
1) OpenMPI/1.4.3 => OpenMPI/1.8.3
The following have been reloaded with a version change:
1) OpenMPI/1.8.3 => OpenMPI/1.10.0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AC950C92710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002AC950EBED5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002B2F41FA8710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002B2F421D4D5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
======================================
Nothing else is printed. Anything that I should try?
Regards,
Nandan.
The input file is:
========================================
kmaps = .true.
etf_mem = .false.
epbwrite = .false.
epbread = .false.
epwwrite = .false.
epwread = .true.
nbndsub = 16
nbndskip = 0
wannierize = .false.
num_iter = 5000
iprint = 2
dis_froz_max= 17.5
dis_froz_min= 10
proj(1) = 'random'
proj(2) = 'N:sp3'
iverbosity = 3
elecselfen = .true.
phonselfen = .false.
parallel_k = .true.
parallel_q = .false.
fsthick = 1.d10
eptemp = 300.d0
degaussw = 0.002
dvscf_dir = '/mnt/home/tandonna/work/GaN_666_qmesh/phonons/inp/save'
filukk = './gan.ukk'
nk1 = 6
nk2 = 6
nk3 = 6
nq1 = 6
nq2 = 6
nq3 = 6
nkf1 = 30
nkf2 = 30
nkf3 = 30
nqf1 = 35
nqf2 = 35
nqf3 = 35
========================================
This was right after a calculation with the following mesh ended :
nqf=30x30x30
No changes to submit script were made. The error message again does not give much information.
epw.out ends with:
======================================
Using uniform q-mesh: 35 35 35
Size of q point mesh for interpolation: 42875
Using uniform k-mesh: 30 30 30
Size of k point mesh for interpolation: 54000
Max number of k points per pool: 676
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[48056,1],8]
Exit code: 174
--------------------------------------------------------------------------
======================================
and error file has:
======================================
Error getting SCIF driver version
Inactive Modules:
1) Boost 2) R 3) TBB
The following have been reloaded with a version change:
1) OpenMPI/1.4.3 => OpenMPI/1.8.3
The following have been reloaded with a version change:
1) OpenMPI/1.8.3 => OpenMPI/1.10.0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002AC950C92710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002AC950EBED5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
epw.x 0000000000E261D1 Unknown Unknown Unknown
epw.x 0000000000E24927 Unknown Unknown Unknown
epw.x 0000000000DC46B4 Unknown Unknown Unknown
epw.x 0000000000DC44C6 Unknown Unknown Unknown
epw.x 0000000000D57D6F Unknown Unknown Unknown
epw.x 0000000000D5EF9D Unknown Unknown Unknown
libpthread.so.0 00002B2F41FA8710 Unknown Unknown Unknown
epw.x 000000000096DFF7 Unknown Unknown Unknown
epw.x 00000000004631AD Unknown Unknown Unknown
epw.x 0000000000451C44 Unknown Unknown Unknown
epw.x 0000000000412A49 Unknown Unknown Unknown
epw.x 0000000000411CBE Unknown Unknown Unknown
libc.so.6 00002B2F421D4D5D Unknown Unknown Unknown
epw.x 0000000000411BC9 Unknown Unknown Unknown
======================================
Nothing else is printed. Anything that I should try?
Regards,
Nandan.
Re: segmentation fault for high q-grid
Hello,
Not that it should be a problem, I notice that your fsthick = 1.d10 is very large.
Does the problem appears with small grids? If it works, it could a memory issue.
What I usually do in that case is to do an interactive job.
You can then launch the calculation interactively and monitor the memory usage with "top".
The crash should be right at the beginning so its easy to reproduce.
Best,
Samuel
Not that it should be a problem, I notice that your fsthick = 1.d10 is very large.
Does the problem appears with small grids? If it works, it could a memory issue.
What I usually do in that case is to do an interactive job.
You can then launch the calculation interactively and monitor the memory usage with "top".
The crash should be right at the beginning so its easy to reproduce.
Best,
Samuel
Prof. Samuel Poncé
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Chercheur qualifié F.R.S.-FNRS / Professeur UCLouvain
Institute of Condensed Matter and Nanosciences
UCLouvain, Belgium
Web: https://www.samuelponce.com
Re: segmentation fault for high q-grid
Thanks for the reply. I got busy with other things and have not had time to look back at this.
I will try this out soon.
Regards,
Nandan.
I will try this out soon.
Regards,
Nandan.