Understanding ROMS output

Post ROMS benchmark results

Moderators: arango, robertson

Post Reply
Message
Author
s.celestino

Understanding ROMS output

#1 Unread post by s.celestino »

Hi everyone,

I'm trying to understand timing results in ROMS output. I'm running the 4DVAR WC13 test case. I want to know if I can find the lanczos algorithm time in this timing output, I can't find anything:

Code: Select all

Elapsed CPU time (seconds):

 Node   #  0 CPU:    6839.045
 Total:             13268.614

 Nonlinear model elapsed time profile, Grid: 01

  Allocation and array initialization ..............         0.147  ( 0.0022 %)
  Ocean state initialization .......................         0.013  ( 0.0002 %)
  Reading of input data ............................         0.240  ( 0.0035 %)
  Processing of input data .........................         0.271  ( 0.0040 %)
  Computation of vertical boundary conditions ......         0.041  ( 0.0006 %)
  Computation of global information integrals ......         0.785  ( 0.0115 %)
  Writing of output data ...........................         0.922  ( 0.0135 %)
  Model 2D kernel ..................................        28.575  ( 0.4178 %)
  2D/3D coupling, vertical metrics .................        38.537  ( 0.5635 %)
  Omega vertical velocity ..........................        21.385  ( 0.3127 %)
  Equation of state for seawater ...................        78.373  ( 1.1460 %)
  Atmosphere-Ocean bulk flux parameterization ......         2.074  ( 0.0303 %)
  GLS vertical mixing parameterization .............        21.278  ( 0.3111 %)
  3D equations right-side terms ....................         2.505  ( 0.0366 %)
  3D equations predictor step ......................         5.655  ( 0.0827 %)
  Pressure gradient ................................         2.413  ( 0.0353 %)
  Harmonic mixing of tracers, geopotentials ........         2.952  ( 0.0432 %)
  Harmonic stress tensor, S-surfaces ...............         1.200  ( 0.0175 %)
  Corrector time-step for 3D momentum ..............         3.105  ( 0.0454 %)
  Corrector time-step for tracers ..................         3.225  ( 0.0472 %)
                                              Total:       213.695    3.1246
and so on...
Anyway, can you monitor time with this routine in custom parts?

Code: Select all

 CALL wclock_on (ng, model, 36)
Thanks

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Understanding ROMS output

#2 Unread post by kate »

It's tricky to add profiling of new parts. Start with mod_strings and see what's dimensioned Mregion and Nregion. You'll have to expand Nregion by the number of new things you want to monitor. For the ice code, I added a new chunk of things to monitor and added a new section of the report starting at Iceregion.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Understanding ROMS output

#3 Unread post by shchepet »

The profile above indicates that the most computationally expensive part is seawater Equation of State. Is this actually true?

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Understanding ROMS output

#4 Unread post by arango »

What do you mean by the Lanczos algorithm? Are you talking about the minimization routine, rpcg_lanzos (RPCG)? The Lanczos vectors are stored in a NetCDF and use the adjoint model iterations over all outer and inner loops.

I recently profiled WC13 with W4DPSAS: Nouter=2 and Ninner=50 on 4 CPUs on my Mac:

Code: Select all

Elapsed CPU time (seconds):

 Node   #  0 CPU:    2734.718
 Node   #  1 CPU:    2735.069
 Node   #  2 CPU:    2735.295
 Node   #  3 CPU:    2735.149
 Total:             22543.769

 Nonlinear model elapsed time profile, Grid: 01

  Allocation and array initialization ..............         0.094  ( 0.0009 %)
  Ocean state initialization .......................         0.015  ( 0.0001 %)
  Reading of input data ............................         0.450  ( 0.0041 %)
  Processing of input data .........................         0.596  ( 0.0055 %)
  Computation of vertical boundary conditions ......         0.086  ( 0.0008 %)
  Computation of global information integrals ......         1.635  ( 0.0149 %)
  Writing of output data ...........................         1.442  ( 0.0132 %)
  Model 2D kernel ..................................        37.881  ( 0.3462 %)
  2D/3D coupling, vertical metrics .................        76.908  ( 0.7030 %)
  Omega vertical velocity ..........................        60.869  ( 0.5564 %)
  Equation of state for seawater ...................       118.920  ( 1.0870 %)
  Atmosphere-Ocean bulk flux parameterization ......         0.971  ( 0.0089 %)
  GLS vertical mixing parameterization .............        28.591  ( 0.2613 %)
  3D equations right-side terms ....................         3.249  ( 0.0297 %)
  3D equations predictor step ......................         7.552  ( 0.0690 %)
  Pressure gradient ................................         2.072  ( 0.0189 %)
  Harmonic mixing of tracers, geopotentials ........         3.532  ( 0.0323 %)
  Harmonic stress tensor, S-surfaces ...............         1.700  ( 0.0155 %)
  Corrector time-step for 3D momentum ..............         3.660  ( 0.0335 %)
  Corrector time-step for tracers ..................         3.338  ( 0.0305 %)
                                              Total:       353.561    3.2317

 Tangent linear model elapsed time profile, Grid: 01

  Ocean state initialization .......................         0.567  ( 0.0052 %)
  Reading of input data ............................        41.545  ( 0.3797 %)
  Processing of input data .........................       183.269  ( 1.6752 %)
  Computation of vertical boundary conditions ......         2.944  ( 0.0269 %)
  Computation of global information integrals ......        32.821  ( 0.3000 %)
  Writing of output data ...........................         2.365  ( 0.0216 %)
  Model 2D kernel ..................................      2042.182  (18.6667 %)
  2D/3D coupling, vertical metrics .................        63.879  ( 0.5839 %)
  Omega vertical velocity ..........................        88.458  ( 0.8086 %)
  Equation of state for seawater ...................       249.450  ( 2.2801 %)
  3D equations right-side terms ....................       177.238  ( 1.6201 %)
  3D equations predictor step ......................       431.872  ( 3.9476 %)
  Pressure gradient ................................       119.952  ( 1.0964 %)
  Harmonic mixing of tracers, geopotentials ........       137.110  ( 1.2533 %)
  Harmonic stress tensor, S-surfaces ...............        77.808  ( 0.7112 %)
  Corrector time-step for 3D momentum ..............       273.983  ( 2.5044 %)
  Corrector time-step for tracers ..................       293.129  ( 2.6794 %)
                                              Total:      4218.573   38.5602

 Adjoint model elapsed time profile, Grid: 01

  Ocean state initialization .......................         0.425  ( 0.0039 %)
  Reading of input data ............................        42.638  ( 0.3897 %)
  Processing of input data .........................       197.069  ( 1.8013 %)
  Computation of vertical boundary conditions ......         3.994  ( 0.0365 %)
  Computation of global information integrals ......        94.125  ( 0.8604 %)
  Writing of output data ...........................         4.823  ( 0.0441 %)
  Model 2D kernel ..................................      3047.378  (27.8548 %)
  2D/3D coupling, vertical metrics .................       218.972  ( 2.0015 %)
  Omega vertical velocity ..........................       127.261  ( 1.1632 %)
  Equation of state for seawater ...................       170.072  ( 1.5546 %)
  3D equations right-side terms ....................       260.863  ( 2.3844 %)
  3D equations predictor step ......................       583.112  ( 5.3300 %)
  Pressure gradient ................................       169.630  ( 1.5505 %)
  Harmonic mixing of tracers, geopotentials ........       532.145  ( 4.8641 %)
  Harmonic stress tensor, S-surfaces ...............        86.664  ( 0.7922 %)
  Corrector time-step for 3D momentum ..............       363.944  ( 3.3267 %)
  Corrector time-step for tracers ..................       305.798  ( 2.7952 %)
                                              Total:      6208.914   56.7531

 Nonlinear model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............         9.305  ( 0.0851 %)
  Message Passage: 3D halo exchanges ...............        61.940  ( 0.5662 %)
  Message Passage: 4D halo exchanges ...............         1.017  ( 0.0093 %)
  Message Passage: lateral boundary exchanges ......         1.622  ( 0.0148 %)
  Message Passage: data broadcast ..................         3.228  ( 0.0295 %)
  Message Passage: data reduction ..................         0.101  ( 0.0009 %)
  Message Passage: data gathering ..................         0.308  ( 0.0028 %)
  Message Passage: data scattering..................         1.469  ( 0.0134 %)
  Message Passage: point data gathering ............         0.073  ( 0.0007 %)
                                              Total:        79.063    0.7227

 Tangent linear model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............       285.993  ( 2.6141 %)
  Message Passage: 3D halo exchanges ...............       101.907  ( 0.9315 %)
  Message Passage: 4D halo exchanges ...............        56.638  ( 0.5177 %)
  Message Passage: lateral boundary exchanges ......         6.171  ( 0.0564 %)
  Message Passage: data broadcast ..................        42.248  ( 0.3862 %)
  Message Passage: data reduction ..................         2.566  ( 0.0235 %)
  Message Passage: data gathering ..................         0.822  ( 0.0075 %)
  Message Passage: data scattering..................         9.092  ( 0.0831 %)
  Message Passage: point data gathering ............         0.589  ( 0.0054 %)
                                              Total:       506.028    4.6254

 Adjoint model message Passage profile, Grid: 01

  Message Passage: 2D halo exchanges ...............       392.205  ( 3.5850 %)
  Message Passage: 3D halo exchanges ...............       163.995  ( 1.4990 %)
  Message Passage: 4D halo exchanges ...............        58.984  ( 0.5392 %)
  Message Passage: lateral boundary exchanges ......         1.265  ( 0.0116 %)
  Message Passage: data broadcast ..................        34.569  ( 0.3160 %)
  Message Passage: data reduction ..................         3.558  ( 0.0325 %)
  Message Passage: data gathering ..................         5.227  ( 0.0478 %)
  Message Passage: data scattering..................         9.032  ( 0.0826 %)
  Message Passage: point data gathering ............         1.186  ( 0.0108 %)
                                              Total:       670.022    6.1244

 All percentages are with respect to total time =        10940.230
We need to be careful when changing ROMS profiling because of how the MPI phase is profiled. You cannot just add another number to the wclock routines. I currently looking at this, let me know what do you want to separate and I will take a lock at it. The routine rpcg_lanzos is called few times in 4DVAR but I think that the impact is very minimal when compared with the tangent linear and adjoint models that are iterated 100 times each in the above statistics.

s.celestino

Re: Understanding ROMS output

#5 Unread post by s.celestino »

kate wrote:It's tricky to add profiling of new parts. Start with mod_strings and see what's dimensioned Mregion and Nregion. You'll have to expand Nregion by the number of new things you want to monitor. For the ice code, I added a new chunk of things to monitor and added a new section of the report starting at Iceregion.
Thanks for the answer, I tried it already but it is don't work.

Thanks for the answer Hernan. I'm talking about the I4DVAR module, so I'm looking at the cgradient.F routine. I want to understand what the most computationally expensive routines are. But you said that that routine is called over each inner loop. Right? and not one time. So the minimization step is carried out by each tile? Maybe I deceived this routine:

Code: Select all

# ifdef PROFILE
      CALL wclock_on (ng, model, 36)
# endif
I was expecting to find some reference in the output for the cgradient routine.

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Understanding ROMS output

#6 Unread post by arango »

Yes, the primal formulation I4DVAR is very expensive because the minimization is done in the physical space (i.e. full application grid for the state vector, 10^5 to 10^7 points) so we need to do a lot of I/O manipulations in cgradient to compute and orthogonalize the Lanczos vectors. We need to read the full state vector from NetCDF files for every saved record over and over since we are not storing it in memory. So the bottleneck here is I/O and there is no way to accelerate it, expect saving 10-30 copies of the state vector in memory. Don't waste your time on this one. The 4DVAR primal formulation is phasing out because it is inefficient and cumbersome to use in other fancy algorithms.

Nowadays, we use the dual formulation algorithms (W4DPSAS or R4DVAR) because the minimization space is that spanned by the observations. It allows us to build fancy algorithms because the observations are just a single vector with no dependencies on the grid. For this one we have rpcg_lanczos which is very efficient and converges well. This allows building the adjoint of the full 4DVAR system that can be used to compute sophisticated diagnostics like observation sensitivities, etc.

Anyway, I can add the 4DVAR minimization to the profiling. In order to add profiling statistics, I will need to rename all the counters used for MPI. Let me do it. I probably will leave some blank counters for future use.

I need to update the code soon. Stay tuned.

Post Reply