severe (174): SIGSEGV, segmentation fault occurred. libpthre
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
severe (174): SIGSEGV, segmentation fault occurred. libpthre
Dear all,
I know that this error (the one that I am more afraid to get) can be caused by a lot of things, so I will try to give as much information as possible. I have been working for a while with ROMS. Now I am using ROMS/TOMS version 3.7, revision 921. After last updates, when I try to run the model I get:
--------------------------------------------------------------------------------
Model Input Parameters: ROMS/TOMS version 3.7
Wednesday - September 19, 2018 - 5:10:14 PM
--------------------------------------------------------------------------------
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
oceanM 00000000008544E5 Unknown Unknown Unknown
oceanM 0000000000852107 Unknown Unknown Unknown
oceanM 0000000000801784 Unknown Unknown Unknown
oceanM 0000000000801596 Unknown Unknown Unknown
oceanM 00000000007B40A6 Unknown Unknown Unknown
oceanM 00000000007B7CA0 Unknown Unknown Unknown
libpthread.so.0 00007F9460F46790 Unknown Unknown Unknown
oceanM 000000000086E313 Unknown Unknown Unknown
oceanM 00000000007FCE18 Unknown Unknown Unknown
oceanM 000000000041A869 Unknown Unknown Unknown
oceanM 00000000004249A2 Unknown Unknown Unknown
oceanM 0000000000412CEC Unknown Unknown Unknown
oceanM 000000000040BAD2 Unknown Unknown Unknown
oceanM 000000000040B59C Unknown Unknown Unknown
oceanM 000000000040B45E Unknown Unknown Unknown
libc.so.6 00007F946093DD5D Unknown Unknown Unknown
oceanM 000000000040B369 Unknown Unknown Unknown
I am able to run the model with older ROMS revision (i.e: ROMS/TOMS version 3.7 SVN Revision : 836M). I have tested successfully that NetCDF used was NetCDF4 using "ncdump -k", returning for all the input files: netCDF-4.
I am using ifort to compile the code. I am sorry, but I am not able to use another one cause I am not the admin of the system. The compiler file is set up:
setenv USE_MPI on # distributed-memory parallelism
# setenv USE_MPIF90 on # compile with mpif90 script
#setenv which_MPI mpich # compile with MPICH library
setenv which_MPI mpich2 # compile with MPICH2 library
## setenv which_MPI openmpi # compile with OpenMPI library
#setenv USE_OpenMP on # shared-memory parallelism
setenv FORT ifort
#setenv FORT gfortran
#setenv FORT pgi
#setenv USE_DEBUG on # use Fortran debugging flags
setenv USE_LARGE on # activate 64-bit compilation
setenv USE_NETCDF4 on # compile with NetCDF-4 library
setenv USE_PARALLEL_IO on # Parallel I/O with NetCDF-4/HDF5
I get the same result running it serial (oceanS). When I tried to activate DEBUG, I get this error:
ld: cannot find -ldl
So I am not able to know exactly what file is getting me in troubles.
I have been able to parallel run with MPI the upwelling test case. So I am pretty sure that my problem is related with the NetCDF file that I am using.
I really appreciate if you could point me out next steps to follow.
Thanks a lot,
-Francisco
I tried
I know that this error (the one that I am more afraid to get) can be caused by a lot of things, so I will try to give as much information as possible. I have been working for a while with ROMS. Now I am using ROMS/TOMS version 3.7, revision 921. After last updates, when I try to run the model I get:
--------------------------------------------------------------------------------
Model Input Parameters: ROMS/TOMS version 3.7
Wednesday - September 19, 2018 - 5:10:14 PM
--------------------------------------------------------------------------------
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
oceanM 00000000008544E5 Unknown Unknown Unknown
oceanM 0000000000852107 Unknown Unknown Unknown
oceanM 0000000000801784 Unknown Unknown Unknown
oceanM 0000000000801596 Unknown Unknown Unknown
oceanM 00000000007B40A6 Unknown Unknown Unknown
oceanM 00000000007B7CA0 Unknown Unknown Unknown
libpthread.so.0 00007F9460F46790 Unknown Unknown Unknown
oceanM 000000000086E313 Unknown Unknown Unknown
oceanM 00000000007FCE18 Unknown Unknown Unknown
oceanM 000000000041A869 Unknown Unknown Unknown
oceanM 00000000004249A2 Unknown Unknown Unknown
oceanM 0000000000412CEC Unknown Unknown Unknown
oceanM 000000000040BAD2 Unknown Unknown Unknown
oceanM 000000000040B59C Unknown Unknown Unknown
oceanM 000000000040B45E Unknown Unknown Unknown
libc.so.6 00007F946093DD5D Unknown Unknown Unknown
oceanM 000000000040B369 Unknown Unknown Unknown
I am able to run the model with older ROMS revision (i.e: ROMS/TOMS version 3.7 SVN Revision : 836M). I have tested successfully that NetCDF used was NetCDF4 using "ncdump -k", returning for all the input files: netCDF-4.
I am using ifort to compile the code. I am sorry, but I am not able to use another one cause I am not the admin of the system. The compiler file is set up:
setenv USE_MPI on # distributed-memory parallelism
# setenv USE_MPIF90 on # compile with mpif90 script
#setenv which_MPI mpich # compile with MPICH library
setenv which_MPI mpich2 # compile with MPICH2 library
## setenv which_MPI openmpi # compile with OpenMPI library
#setenv USE_OpenMP on # shared-memory parallelism
setenv FORT ifort
#setenv FORT gfortran
#setenv FORT pgi
#setenv USE_DEBUG on # use Fortran debugging flags
setenv USE_LARGE on # activate 64-bit compilation
setenv USE_NETCDF4 on # compile with NetCDF-4 library
setenv USE_PARALLEL_IO on # Parallel I/O with NetCDF-4/HDF5
I get the same result running it serial (oceanS). When I tried to activate DEBUG, I get this error:
ld: cannot find -ldl
So I am not able to know exactly what file is getting me in troubles.
I have been able to parallel run with MPI the upwelling test case. So I am pretty sure that my problem is related with the NetCDF file that I am using.
I really appreciate if you could point me out next steps to follow.
Thanks a lot,
-Francisco
I tried
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
I have a similar problem and am running with gfortran for one domain and with an old code for the other domain. Sorry I don’t have a third fix.
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Thanks a lot Kate. Let´s see if someone could give us any clue about what´s happening. Meanwhile I will try to use the old code as you suggested.
- arango
- Site Admin
- Posts: 1368
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Nowadays, severe segmentation errors are associated with stack size, which is used for allocating automatic arrays. They are allocated on stack or heap according to you choice of compiler options. I mentioned this in the last trac ticket.
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Dear Arango,
Thanks a lot for your answer. I will talk with the administrator of our HPC system to try to configure the stack size properly.
Regards,
-Francisco
Thanks a lot for your answer. I will talk with the administrator of our HPC system to try to configure the stack size properly.
Regards,
-Francisco
- arango
- Site Admin
- Posts: 1368
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
It is very simple as I have mentioned several times before. You just need to edit your login script and add one of the lines below:
I wrote lots of information in previous trac ticket.
Code: Select all
.cshrc, .tcshrc, etc.
limit stacksize unlimited
or .bashrc
ulimit -s unlimited
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Dear Arango,
I am sorry for not explaining it properly. I followed your advice at the ticket doing:
.cshrc, .tcshrc, etc.
limit stacksize unlimited
or .bashrc
ulimit -s unlimited
But I got the same error. The next step was to compile the model with the option -heap-arrays (I am using ifort). So I asked the administrator to do so. Although as you point out in the ticket it may affect performance by slowing down the computations, I hope it could help to detect where is the problem and solved it in a better way.
Thanks a lot, I really appreciate your help.
-Francisco
I am sorry for not explaining it properly. I followed your advice at the ticket doing:
.cshrc, .tcshrc, etc.
limit stacksize unlimited
or .bashrc
ulimit -s unlimited
But I got the same error. The next step was to compile the model with the option -heap-arrays (I am using ifort). So I asked the administrator to do so. Although as you point out in the ticket it may affect performance by slowing down the computations, I hope it could help to detect where is the problem and solved it in a better way.
Thanks a lot, I really appreciate your help.
-Francisco
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Today I have been able to run the model without error using the option -heap-arrays but it affect a lot the performance. Below you will find a comparative table
OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec
I will like to keep ROMS updated but the performance penalty it´s too high. The memory configuration used has been:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256669
max locked memory (kbytes, -l) 4086160
max memory size (kbytes, -m) 65536000
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I will try to keep on working to find the problem with my NetCDF file that I was able to use in the older version but give me this error in the last one. Any clue are really welcome.
OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec
I will like to keep ROMS updated but the performance penalty it´s too high. The memory configuration used has been:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256669
max locked memory (kbytes, -l) 4086160
max memory size (kbytes, -m) 65536000
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I will try to keep on working to find the problem with my NetCDF file that I was able to use in the older version but give me this error in the last one. Any clue are really welcome.
- arango
- Site Admin
- Posts: 1368
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Yes, your problem is the stack size per CPU and it seems to be associated with the automatic arrays used in distributed-memory for I/O operations. This is not a direct ROMS problem, but a computer problem because of not enough memory to handle automatic arrays that are either allocated on stack or heap for scattering/gathering of I/O.
I see that you are using 32 CPUs. How big are all your grids? You said that have two nested grids.
I updated the code today for reporting memory requirements. See trac ticket for more information.
I see that you are using 32 CPUs. How big are all your grids? You said that have two nested grids.
I updated the code today for reporting memory requirements. See trac ticket for more information.
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Dear Arango,
I have updated the code with the last revision. Now I have been able to run the model without the option -heap-arrays but it still take more time than older revisions.
OceanM Versión 3.7 rev.923 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 28580.836 sec
OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec
The memory report shows:
I have been reviewing old model outputs and I have realize that in the older version the options heap-array was activated. Below you will find the compiler options used:
Thanks a lot for your help,
-Francisco
I have updated the code with the last revision. Now I have been able to run the model without the option -heap-arrays but it still take more time than older revisions.
OceanM Versión 3.7 rev.923 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 28580.836 sec
OceanM Versión 3.7 rev.922 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 30138.171 sec
OceanM Versión 3.7 rev.836 // 2 Nested grid // 2 nodes 16 CPU/node // Total Elapsed CPU Time = 9568.668 sec
The memory report shows:
Code: Select all
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Dynamic and Automatic memory (MB) usage for Grid 01: 332x332x10 tiling: 4x8
tile Dynamic Automatic USAGE MPI-Buffers
0 44.95 19.63 64.59 8.92
1 45.33 19.63 64.97 8.92
2 45.33 19.63 64.97 8.92
3 45.14 19.63 64.78 8.92
4 46.48 19.63 66.11 8.92
5 46.90 19.63 66.53 8.92
6 46.90 19.63 66.53 8.92
7 46.69 19.63 66.32 8.92
8 46.48 19.63 66.11 8.92
9 46.90 19.63 66.53 8.92
10 46.90 19.63 66.53 8.92
11 46.69 19.63 66.32 8.92
12 46.48 19.63 66.11 8.92
13 46.90 19.63 66.53 8.92
14 46.90 19.63 66.53 8.92
15 46.69 19.63 66.32 8.92
16 46.48 19.63 66.11 8.92
17 46.90 19.63 66.53 8.92
18 46.90 19.63 66.53 8.92
19 46.69 19.63 66.32 8.92
20 46.48 19.63 66.11 8.92
21 46.90 19.63 66.53 8.92
22 46.90 19.63 66.53 8.92
23 46.69 19.63 66.32 8.92
24 46.48 19.63 66.11 8.92
25 46.90 19.63 66.53 8.92
26 46.90 19.63 66.53 8.92
27 46.69 19.63 66.32 8.92
28 45.33 19.63 64.97 8.92
29 45.72 19.63 65.36 8.92
30 45.72 19.63 65.36 8.92
31 45.53 19.63 65.16 8.92
SUM 1484.82 628.28 2113.11 285.58
Dynamic and Automatic memory (MB) usage for Grid 02: 222x189x10 tiling: 4x8
tile Dynamic Automatic USAGE MPI-Buffers
0 24.61 9.00 33.60 9.00
1 24.61 9.00 33.60 9.00
2 24.61 9.00 33.60 9.00
3 24.48 9.00 33.48 9.00
4 24.61 9.00 33.60 9.00
5 24.61 9.00 33.60 9.00
6 24.61 9.00 33.60 9.00
7 24.48 9.00 33.48 9.00
8 24.61 9.00 33.60 9.00
9 24.61 9.00 33.60 9.00
10 24.61 9.00 33.60 9.00
11 24.48 9.00 33.48 9.00
12 24.61 9.00 33.60 9.00
13 24.61 9.00 33.60 9.00
14 24.61 9.00 33.60 9.00
15 24.48 9.00 33.48 9.00
16 24.61 9.00 33.60 9.00
17 24.61 9.00 33.60 9.00
18 24.61 9.00 33.60 9.00
19 24.48 9.00 33.48 9.00
20 24.61 9.00 33.60 9.00
21 24.61 9.00 33.60 9.00
22 24.61 9.00 33.60 9.00
23 24.48 9.00 33.48 9.00
24 24.61 9.00 33.60 9.00
25 24.61 9.00 33.60 9.00
26 24.61 9.00 33.60 9.00
27 24.48 9.00 33.48 9.00
28 24.06 9.00 33.06 9.00
29 24.06 9.00 33.06 9.00
30 24.06 9.00 33.06 9.00
31 23.94 9.00 32.94 9.00
SUM 784.22 287.92 1072.14 287.92
TOTAL 2269.04 916.20 3185.24 573.50
Code: Select all
Operating system : Linux
CPU/hardware : x86_64
Compiler system : ifort
Compiler command : /opt/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiifort
Compiler flags : -heap-arrays -fp-model precise -ip -O3 -free -free -free
SVN Root URL : https://www.myroms.org/svn/src/trunk
SVN Revision : 836M
==============================================================
Operating system : Linux
CPU/hardware : x86_64
Compiler system : ifort
Compiler command : /opt/intel/parallel_studio_xe_2016_update2/impi/5.1.3.181/intel64/bin/mpiifort
Compiler flags : -fp-model precise -ip -O3
MPI Communicator : 1140850688 PET size = 32
SVN Root URL : https://www.myroms.org/svn/src/trunk
SVN Revision : 923M
I used to run 1 grid with 3 refined grids. After reading some of your tickets explaining the importance to test the best cores configuration to run the model I decided to get a test case only with 1 donor grid (Lm=332 and Mn=332) and 1 refined grid (Lm=222 and Mm=189)and do some test changing the number of code used and different domain decomposition parameters. But then I started to get the segmentation fault error.I see that you are using 32 CPUs. How big are all your grids? You said that have two nested grids.
Thanks a lot for your help,
-Francisco
- arango
- Site Admin
- Posts: 1368
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
I think that you need to read the following trac ticket and choose the MPI communication options that are more efficient in the computer environment that you are running. You should check the profiling information that ROMS reports to standard output to see in what region of the code are slower. If -heap-arrays is faster, then use it. However, it is our experience that the -heap-arrays option for ifort is less efficient.
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Dear Arango,
I really appreciate your help. I was trying to test the different configuration to try to speed up my runs, but then I got in trouble with the segmentation fault.
I will talk with the administrator to analyze the output to see in what region of the code are slower.
About the heap-array options it´s quite strange. The latest revisions (922) took three time more than the older ones (836), both using heap-array. After last updated (923) I am able to run the model without heap-array but I am still getting worse performance than with 836M (near three times slower).
Regards,
-Francisco
I really appreciate your help. I was trying to test the different configuration to try to speed up my runs, but then I got in trouble with the segmentation fault.
I will talk with the administrator to analyze the output to see in what region of the code are slower.
About the heap-array options it´s quite strange. The latest revisions (922) took three time more than the older ones (836), both using heap-array. After last updated (923) I am able to run the model without heap-array but I am still getting worse performance than with 836M (near three times slower).
Regards,
-Francisco
- arango
- Site Admin
- Posts: 1368
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
I am going to try again for the last time. Read carefully trac ticket 747. In the older version of the code, we choose either lower- or higher-level MPI function for exchanges. We no longer do that in the newer versions, you need to experiment and select which options are more efficient in your computer. The computer administrator cannot help you with that. You need to select the appropriate ROMS CPP options If you don't know what I am talking about, you need to learn a little about the distributed-memory paradigm.
-
- Posts: 68
- Joined: Tue Nov 10, 2009 6:42 pm
- Location: Technical University of Cartagena,Murcia, Spain
Re: severe (174): SIGSEGV, segmentation fault occurred. libp
Hi Arango,
I am sorry for bothering you. I was writing the result in the forum just in case it could help other users and perhaps get some feedback. I have started to do the performance test with the different configuration explained in the ticket and trying to learn a little about distributed memory paradigm. I hope to be able to get the same performance with the new revision as the older one.
Thanks a lot,
-Francisco
I am sorry for bothering you. I was writing the result in the forum just in case it could help other users and perhaps get some feedback. I have started to do the performance test with the different configuration explained in the ticket and trying to learn a little about distributed memory paradigm. I hope to be able to get the same performance with the new revision as the older one.
Thanks a lot,
-Francisco