tides_date.f90 variable definition

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

tides_date.f90 variable definition

#1 Unread post by skastner »

Hello,

I'm getting an error in the startup process on a model that includes tides. The stack trace looks like this:
forrtl: severe (194): Run-Time Check Failure. The variable 'tides_date_$FOUNDIT' is being used in 'tides_date.f90(120,13)' without being defined
Image PC Routine Line Source
romsG 00000000024408B4 tides_date_ 120 tides_date.f90
romsG 00000000022CB301 read_phypar_ 2402 read_phypar.f90
romsG 0000000000E1EDEB inp_par_mod_mp_in 113 inp_par.f90
romsG 000000000040EBA1 roms_kernel_mod_m 99 roms_kernel.f90
romsG 00000000004115F0 MAIN__ 97 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F65F764CCF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown

This is on a model instance that I'm trying to bring up to speed on the latest version of ROMS, I had previously been running version 3.8. I think I have correctly implemented what is suggested in this ticket: https://www.myroms.org/projects/src/ticket/896

I am attaching my .h, .in, error, and log, files. I can send my tidal forcing file if necessary.

Any suggestions on ways forward?
Attachments
roms_almirante.in
(144.63 KiB) Downloaded 592 times
almirante.h
(3.09 KiB) Downloaded 607 times
almirantelog.txt
(10.88 KiB) Downloaded 593 times
almirante_error.txt
(194.29 KiB) Downloaded 640 times

robertson
Site Admin
Posts: 227
Joined: Wed Feb 26, 2003 3:12 pm
Location: IMCS, Rutgers University

Re: tides_date.f90 variable definition

#2 Unread post by robertson »

Could you also attach your tides_date.f90 from your Build_romsG directory?

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#3 Unread post by skastner »

Yes, sorry, here it is.

Thanks!
Attachments
tides_date.f90
(8.91 KiB) Downloaded 619 times

robertson
Site Admin
Posts: 227
Joined: Wed Feb 26, 2003 3:12 pm
Location: IMCS, Rutgers University

Re: tides_date.f90 variable definition

#4 Unread post by robertson »

I believe the issue is that your roms_almirante.in file is out of date. Several new parameters were added since 3.8. In particular, you are missing INP_LIB and OUT_LIB. Compare your .in file with one of the roms_*.in files in the ROMS/External folder of your new source code and add any missing parameters.

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#5 Unread post by skastner »

Ah, of course. This was the issue! Thanks, David.

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#6 Unread post by skastner »

A new problem has arisen, though. This seems to be some sort of I/O error with openmpi... although it's occuring in the set_tides.f90 code. Here's the stack trace:

[hpc3-21-18][[59635,1],230][btl_openib_component.c:3655:handle_wc] Unhandled work completion opcode is 136
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
romsG 000000000246E7AB Unknown Unknown Unknown
libpthread-2.28.s 00007FEE7150ACE0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE717460D0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE7176DFAF ompi_request_defa Unknown Unknown
libmpi.so.40.20.3 00007FEE717C5D2F ompi_coll_base_se Unknown Unknown
libmpi.so.40.20.3 00007FEE717C85FD ompi_coll_base_al Unknown Unknown
libmpi.so.40.20.3 00007FEE717822EE PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00007FEE71AA8B59 mpi_allreduce_ Unknown Unknown
romsG 0000000001204B19 distribute_mod_mp 1604 distribute.f90
romsG 000000000066D7D7 set_tides_mod_mp_ 282 set_tides.f90
romsG 000000000064FA36 set_tides_mod_mp_ 65 set_tides.f90
romsG 000000000041761C main3d_ 187 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007FEE70F69CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown

Any thoughts? If this would be better in a new thread/different subforum let me know. Attaching the set_tides.f90, distribute.f90, log, error, and new .in files.
Attachments
distribute.f90
(222.42 KiB) Downloaded 599 times
set_tides.f90
(21.32 KiB) Downloaded 582 times
roms_almirante.in
(156.74 KiB) Downloaded 580 times
almirante_mpi_io_problem.txt
(1.53 KiB) Downloaded 592 times
almirantelog.txt
(150.16 KiB) Downloaded 617 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: tides_date.f90 variable definition

#7 Unread post by wilkin »

Does the log really end abruptly like that, with no error reporting from ROMS?
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#8 Unread post by skastner »

Yes, strangely. I'm attaching the output from the compilation process here, in case that's of use.
Attachments
almirante_buildlog.txt
(521.67 KiB) Downloaded 673 times

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tides_date.f90 variable definition

#9 Unread post by arango »

I think that the issue here is that something is missing in the configuration. Did you update varinfo.yaml? Mostly all users ignore trac updates to ROMS. We provide precise information in trac tickets with instructions on using new features. We usually see postings here that we need to reconstruct and guess the user source code and input scripts and NetCDF files.

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#10 Unread post by skastner »

Thanks for the suggestion--my .in file does include a path to the .yaml file, rather than a non-existent .dat file:

! Input variable information file name. This file needs to be processed
! first so all information arrays can be initialized properly.

VARNAME = /dfs6/pub/skastner/ROMS/trunk/ROMS/External/varinfo.yaml

This is indeed where the varinfo.yaml file is located.

Is this what you meant? I've tried to go back through the trac tickets as suggested, but I am not sure which tickets correspond to things I should change in my .in file. I have gone through the output of a "diff" command between my application .in file (roms_almirate.in, attached here) and the upwelling test case .in file (roms_upwelling.in, also attached here, which does not produce this error), and changed glaring differences between the two.

The other trac ticket that looked like it could influence my model was this, on LuvSrc/LwSrc: https://www.myroms.org/projects/src/ticket/905
I don't use LwSrc, though, only LuvSrc.

It is possible that some of my metadata conflicts between the .yaml file and my netcdf forcing files---I will check this. I do find it hard to believe that a metadata conflict could cause this kind of error, though.
Attachments
roms_upwelling.in
(154.63 KiB) Downloaded 621 times
roms_almirante.in
(156.74 KiB) Downloaded 586 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: tides_date.f90 variable definition

#11 Unread post by wilkin »

One perplexing thing is the way in which your job exited. There are quite extensive error trapping routines in ROMS so that you get some information on where the code failed before it exits. Yours just stopped, judging by the log you posted. It did not indicate a BLOW UP, or give a line number for the point of failure. This somewhat points to a system problem rather than a code problem.

That said, I notice some things in your log. Your shallowest depth is 0.1 m, and the thinnest vertical layer 5 mm (!) thick. You might consider setting the minimum depth in your bathymetry to be a bit deeper. But if this were the problem, and a vertical CFL violation occurred, ROMS would have reported this.

You have a very large grid and a large number of processors (240). So you are computing across multiple nodes in a cluster. Can you test this model on a smaller number of cores?

Have you followed our advice that the libraries you link to are compiled with the same compiler (and version) you are using for ROMS itself?
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#12 Unread post by skastner »

Thanks, John! I'll look into changing the bathymetry. I am using the same compilers to compile and for the run itself.

I'm going to run a few tests:

1.) Use a smaller number of cores
2.) Run in serial
3.) Use newer versions of openmpi and netcdf. Currently, I'm using openmpi 4.0.3 and netcdf 4.7.0. I have access to openmpi 4.1.2 and netcdf 4.8.1, which both use the 2022 version of ifort (I'm currently using 2020 for these modules).
4.) Change the bathymetry.

Do you think the version of openmpi matters between these two?

All that said, I'm re-running the model now as it was to replicate the error and it's not crashing, which is in some ways scarier than the error I was getting previously.

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: tides_date.f90 variable definition

#13 Unread post by wilkin »

All that said, I'm re-running the model now as it was to replicate the error and it's not crashing, which is in some ways scarier than the error I was getting previously.
The fact that your earlier run terminated with no error report from ROMS does suggest it was a crash of one of the processors. You should ask your sysadmin if they have error logs to help diagnose a processor failure at the time your run crashed. All the more reason to test on fewer cores while you debug the set-up.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

skastner
Posts: 21
Joined: Mon Mar 23, 2020 8:16 pm
Location: Western Washington University

Re: tides_date.f90 variable definition

#14 Unread post by skastner »

Sorry for the slow response--I've tested the model with three amounts of processors: 1 node (40 cpus), 3 nodes (120 cpus), and 6 nodes (240 cpus). The 6 node run crashes as described above. The 1 node run is excruciatingly slow (~2x real time speed), and does not crash in the time I allowed it to run. The 3 node run is slightly faster (~5x real time speed, still not fast enough), and does not crash in the time I allowed it to run. It does throw this warning, though:

forrtl: warning (406): fort: (1): In call to MP_GATHER3D, an array temporary was created for argument #13

Image PC Routine Line Source
romsG 00000000024655EF Unknown Unknown Unknown
romsG 0000000000EA662C nf_fwrite4d_mod_m 183 nf_fwrite4d.f90
romsG 0000000000E0586A wrt_rst_mod_mp_wr 285 wrt_rst.f90
romsG 0000000000DF2F79 wrt_rst_mod_mp_wr 69 wrt_rst.f90
romsG 00000000004FB9C0 output_ 284 output.f90
romsG 0000000000419517 main3d_ 230 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F404E8C7CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown

I'm attaching the full error file, the run timed out after 5 days.

This seems to be happening while writing the restart file? I don't think this is the same as what occurred above, so perhaps not related. It does have to do with the mpi setup, though.
Attachments
slurmerror_romsv41_120cpu.txt
(3.9 MiB) Downloaded 651 times

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: tides_date.f90 variable definition

#15 Unread post by arango »

ROMS will run much faster if the code is optimized (executable romsM) than with debugging flags (executable romsG).

Post Reply