Intel’s new i7 980x CPU gives disappointing speedup

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
BAN

Intel’s new i7 980x CPU gives disappointing speedup

#1 Unread post by BAN »

Hi
I have been running ROMS on a pc with an i7 930 CPU (4 core, 2.80GHz, 64 bit, triple channel DDR3 1333MHz RAM). The operating system is the 64-bit version of Ubuntu 10.04 (lucid).

Now I tried to run ROMS on an equivalent pc but with an i7 980x CPU (6 core, 3.33GHz) which is the present flagship from Intel regarding CPU performance (and price!!).

The results so far have been disappointing. On paper the CPU power is increased from 4*2.8=11.2 to 6*3.33=19.98 i.e. almost double. Parallel computations do not scale linearly, but I was hoping/expecting a speedup in the range of 30%, but this was not by any means realized.

Running an identical ROMS run with 4 treads (model domain partitioned into 4 sections) on the 930-pc and the 980x-pc, the speedup is merely 3.2%. (clock speed difference is 19%)

Running the same run with 6 treads (model domain partitioned into 6 sections) on the 980x-pc only gives a speedup of 11.5% compared to the 4 tread run on the 930-pc.

This tells me that either something strange has gone wrong, or that my ROMS application is more dependent on memory bandwidth (RAM speed, RAM accessed in parallel, speed of motherboard etc.) than brute CPU force (clock speed and number of cores).

Now to the QUESTIONS:
- Does this observation (memory bandwidth vs. brute CPU force) agree with your experience with ROMS?
- Do you know any rules of thumb regarding what hardware is best to purchase - that is for the unfortunate of us who do not have cluster kind of money?
- Is it correct in my case, that the performance would be much better using DDR3 2000MHz RAM rather that the present DDR3 1333MHz RAM?

p.s. I tried to run different compiler optimizations which on paper should be faster (take more advantage of the new hardware) compared to the default ROMS-Linux-ifort settings, but the default values turned out to be just as fast. . . fore some reason ROMS could not compile with the option -ipo
turned on!?

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Intel’s new i7 980x CPU gives disappointing speedup

#2 Unread post by kate »

The person who could write a thesis in response to this is Sasha Shchepetkin, who would explain that the community ROMS does not do things in the best way for speed.

Beyond that, you have to consider things like memory bandwidth to the chip. Now the same memory bus has to serve six cores instead of four, but did the pipe get better? I don't know. Yes, our code is memory bandwidth limited.

We have several Linux clusters here and the per-core speed hasn't changed much in recent years. Instead, the idea is that you add more cores. The community ROMS is disappointing in its speedup beyond some modest number of cores, the number depending on the details of your problem. I believe there's going to have to be some fundamental change in how we do computing if we plan to take advantage of 1000+ cores. Then there's the parallel output and the parallel pre- and post-processing that we aren't doing yet.

ocecept
Posts: 42
Joined: Tue Jan 08, 2008 3:57 pm
Location: Universidade Federal do Ceará
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#3 Unread post by ocecept »

Hi Ban;

I'm wondering how many grid points (i x j) do you have?
Probably you already know that, but have a look on the Herman comments about partitions and number of cores at viewtopic.php?t=1979

Your question came in a good time, I'm going to buy a computer with Intel i7 processors and I was curios to know how the fortram compilers (and ROMS) deal with the intel hyper threading technology.

Comments from other people would be great.

Cheers;

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#4 Unread post by BAN »

Hi Kate and ocecept, and thank you for your quick replies.

As you agree that the memory bandwidth is an important issue, I’ll make the relative modest investment and purchase the faster RAM.

The present model domain is not ideal for parallel tiling; it was initially derived for a serial Fortran77 application, where water depth was the only important factor (model domain size and rotation give least possible max depth i.e. as large time step as possible).

kurapov
Posts: 29
Joined: Fri Sep 05, 2003 4:49 pm
Location: COAS/OSU

Re: Intel’s new i7 980x CPU gives disappointing speedup

#5 Unread post by kurapov »

Ban, -- Thanks for interesting statistics. Did you run ROMS test in OpenMP or MPI regime?
-- Alex

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#6 Unread post by BAN »

Hi Alex

I first ran it using MPI on an old small Linux cluster that the University owns (five Dell 2600 2.6-GHz Xeon servers with a total of 10 dual core CPUs). When the cluster recently stopped working I tried to run the application in OpenMP mode on my Ubuntu i7 930 PC. To my surprise the 4-core PC was actually slightly faster than the cluster.
I have not tried to run the application in MPI mode on the PCs, but I assume that it would run slower compared to OpenMP mode.
My MPI experience from the cluster was that an application would run faster when adding nodes/CPUs but activating more treads/cores on already active node would not speedup the process, at best it would remain the same.

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#7 Unread post by arango »

Benchmarking ROMS on a computer is not as simple as you may think. There are several things that you need to consider: application, grid size, I/O, number of threads/nodes, memory, cache size, tile partition, tile balancing, compiler, compiler flags, IEEE standard representation of floating-point operations, math processor, distributed-memory library (MPICH, MPICH2, OpenMPI), OpenMP shared-library version, number and type of jobs running in the computer, intra-processor communication, outside disk communications, time of the day, and so on.

ROMS comes with it own benchmark. It is an idealized Southern Ocean application and it is activated with the BENCHMARK option. By default there is no I/O. There are 3 grid sizes: 512x64x30 (benchmark1.in), 1024x128x30 (benchmark2.in), and 2048x256x30 (benchmark3.in). Notice that all the horizontal grid sizes are powers of two, so we can have endless, balanced tile partitions.

You need to carry an ensemble of benchmarks to be statistically meaningful. They need to be carried at different times of the day. We need to avoid I/O always to get realistic timings. You for instance can turn on I/O in the BENCHMARK application to see what I am talking about.

My experience with shared-memory and distributed-memory is that for small grids the timings is about the same and I haven't observed any statistical trends. You will be surprised sometimes that actually the distributed-memory is faster. As the grid gets larger and no longer fit in cache, the distributed-memory configuration is actually faster. This is because of the page faulting in shared-memory state global arrays after the cache size is exceeded. In distributed-memory, the state global arrays are not global but of the size of the partition plus ghost-points. The efficiency of an application has an optimal tile partition for a particular grid size. Once that this is reached, the efficiency deteriorates due to excessive MPI communications. I have made this point countless times in this forum.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Intel’s new i7 980x CPU gives disappointing speedup

#8 Unread post by kate »

I would argue that you need to check the timings on your full realistic application, with I/O. We had one system that looked really pretty good with the BENCHMARK case. Then it was an absolute dog with my realistic setup. Doing the profiling showed that it was all in the I/O, where the base ROMS code had vectorized, but the netcdf library had not (notably the conversion to single precision for smaller output).

Turning on an ecosystem model with its dozen+ tracers will change things too. Suddenly the chunk of code taking the most time is the rotated mixing tensor for tracers as opposed to the 2-D timestepping.

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#9 Unread post by arango »

True but the I/O is not benchmarking the computer CPU but the connectivity to the disk where the files are read or written. This the problem with parallel I/O. I agree that the NetCDF and HDF5 libraries are very inefficient and there is a lot of room for improvement. When doing I/O, it also depends on the frequency of the I/O. This is the killer.

Like I said, benchmarking is not trivial :!:

User avatar
susonic
Posts: 171
Joined: Tue Aug 21, 2007 5:44 pm
Location: UST21 / Korea
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#10 Unread post by susonic »

Hi Mr.Ban,

Did you fix the problem ?

I had faced with similar problem which you mentioned above.

Recently, I tried with updated ROMS svn 514 and it shows better performance than previous ROMS version.

Dr.Arango amended some bugs with parrallel and I believe that the amending brought better result.

Would you try that one?

-JH
Joonho Lee

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#11 Unread post by shchepet »

I just saw this thread of conversation and to my amusement found what I would characterize as
pristine naiveness: did not we been through this before?

MPI vs. OpenMP .... influence of I/O, and, ... the usual award winning phrase that "benchmarking ROMS is not trivial..."

Contrary to popular belief it is trivial: Poor Man's computing at work -- there is nothing
new about it.

Furthermore, to my experience, the i7-family CPUs machines (by at least a factor of 2)
outperform all previous generations of CPUs, including Core 2 and 5400-series Xeons (that
is ROMS running 8 theads on a Core i7 920 is faster than on 8 threads dual-quad Xeon 5420s).
And yes, it is 8 threads on 4-core i7: hyperthreading makes its 4 cores appear to the
operating system as 8 CPUs, and unlike in the case of Pentium 4 (remember that Intel
HT-technology gimmik?) at this time it actually works(!). I do observe 15...20% gain when
going from 4 to 8 threads. It is impressive.

First, lets eliminate irrelevant suspects: time spent in I/O is neglidible: you running
on a single machine.

Obviously OpenMP is faster than MPI in such conditions, but that is kind of trivial.

Now I have a set of questions:

old:

What is you grid size and what is you partition?

and new:

What are the BIOS settings of your machine?

FSB?

Memory speed?

Memory profile?

Memory timings?

Did you set you BIOS to "all default" or you played with it?
Last edited by shchepet on Sun Jan 30, 2011 5:17 am, edited 1 time in total.

ce107
Posts: 10
Joined: Tue Jul 01, 2003 10:31 am
Location: MIT,EAPS

Re: Intel’s new i7 980x CPU gives disappointing speedup

#12 Unread post by ce107 »

shchepet wrote: Furthermore, to my experience, the i7-family CPUs machines (by at least a factor of 2) outperform all previous generations of CPUs, including Core 2 and 5400-series Xeons (that is ROMS running 8 theads on a Core i7 920 is faster than on 8 threads dual-quad Xeon 5420s).
This is a statement that holds true more or less for the Nehalem/Westmere (Core i7 tick/tock) when it comes to other ocean models (eg. MITgcm) as well and many other codes that exercise the memory subsystem extensively enough. Intel had been constricted performance-wise for years by sticking to the Front Side Bus and with QPI in the i7 it got things right for a change. A superior branch predictor helps as well but not so much for our type of codes.
shchepet wrote: And yes, it is 8 threads on 4-core i7: hyperthreading makes its 4 cores appear to the operating system as 8 CPUs, and unlike in the case of Pentium 4 (remember that Intel HT-technology gimmik?) at this time it actually works(!). I do observe 15...20% gain when going from 4 to 8 threads. It is impressive.
Again simultaneous multithreading (the academic non-Intel name for hyperthreading) is supposed to work when the execution pipelines have bubbles because of dependencies (either branch or load) - if you're waiting on main memory and you cannot saturate the main memory bus with one ocean model thread it can help you. The old Pentium4 for example could use up all of the FSB bandwidth with one thread just doing memory copies (and its implementation of hyperthreading had a few other limitations as well).
shchepet wrote: First, lets eliminate irrelevant suspects: time spent in I/O is neglidible: you running on a single machine.

Obviously OpenMP is faster than MPI in such conditions, but that is kind of trivial.
In certain cases OpenMP is going to be slower than MPI just by virtue of having less strictly enforced data ownership (all data is private to its MPI process while you have the potential for false sharing with OpenMP). So even within the same box, it might be beneficial to do OpenMP within a socket and MPI across sockets - ROMS does not have such a "hybrid" mode (yet).

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#13 Unread post by BAN »

Hi again, and thank you all for the interest and comments

To answer the questions from one end:

It appears that I’m using ROMS svn 511.
- I’ll follow the suggestion by subsonic and update the code.

Grid size: 1398 x 726 x 1
Tilling: is 3 x 2

Regarding the BIOS settings, I haven’t done anything at all (both PC's were purchased assembled and tested from the same vendor). I guess everything is default, but I do not know.

Some hardware information regarding both PC's which might be of interest:
- Motherboard: ASUS P6T SE, X58 chipset
- RAM: DDR3 1333MHz , in triple channel configuration.
I have just received new ram DDR3 which is 2000Hz (XMP), which I hope will give some extra speedup.

- I have not tested yet if MPI for some reason should be faster on this PC

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#14 Unread post by shchepet »

Grid size: 1398 x 726 x 1
Tilling: is 3 x 2
This is not the best way to run ROMS.

Try to set tiling to 8 x 81 (yes, meaning eight by eighty one),
rerun it and report your finding back to this forum. We will
continue after that...

Motherboard: ASUS P6T SE, X58 chipset
RAM: DDR3 1333MHz , in triple channel configuration.
I have just received new ram DDR3 which is 2000Hz (XMP)...
This is a good choice of motherboard, although faster
memory would be useful (you obviously realized that
already). One somewhat confusing part about buying
memory is that one has to pay attention to latencies
and cooling. As a rule, increase of frequency comes
with with some penalty in CAS latencies, which partially
devalues the gain of larger frequency. For example,
Corsair makes 1600MHz 8-8-8-24 memory, but going to higher
speed, say 1800MHz, it ends up set to 9-9-9-whatever.

Also, typically higher memory clock speed requires higher
voltages (beyond the JEDEC standards), so in any case
you have to open the box, check exactly what kind of memory
you have, go to manufacturer web suite and get some technical
document about recommended settings for that particular
memory module. No BIOS will do it for you. Neither does
vendor/computer manufacturer.

Regarding the BIOS settings, I haven’t done anything at all
(both PC's were purchased assembled and tested from the same
vendor). I guess everything is default, but I do not know.
ASUS boards always have very rich BIOS. XMP profile is
NEVER enabled by default. Furthermore, your 1333MHz memory
may be clocked at 1066 because this is kind of default set
by Intel specifications for i7.

Even if your memory is not "XMP certified" you may take advantage
of Manual settings for timing. For example, on Intel DX58SO boards
I ended up not using XMP profile, but rather set memory to 1333
(even thought it is 1600MHz memory), but instead cranking up base
clock from 133 to 145MHz, while having FSB locked to memory.
This overclocks both FSB (hence the processor) and memory (if it
would be 1333, but in fact, it ends up being actually downclocked,
since it is 1600MHz memory set to run at 1450).

Inspect memory timings, but do not mess with them during the
first time [you messing with BIOS is far from over, so you will
revise memory timings later].

Got to Boot section man make sure that Qick Boot and Logo are
both set to "Disable". Enable Summary screen. This way you will
see what the machine is set to.

...also go to SouthBridge SATA settings and make sure that AHCI
mode is set to "Enabled". This is specifically important to
satisfy Hernan's concern about I/O. ...and, please ignore that
warning about that you have to have latest Windows XP (or 7 what
ever it is called now days) to enable this feature: Linux is
perfectly capable of taking advantage of it.

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#15 Unread post by arango »

Interesting information... 8) Perhaps, we need to create a page in :arrow: WikiROMS that contains all these information for future reference and to avoid repeating this again in the future.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#16 Unread post by BAN »

Hi again shchepet
and thank you for your advice, which so far have given surprisingly high performance gains.

QUESTION: where does the 8x81 tiling come from, memory configurations?

So far I’ve been thinking the tiling should correspond to the number of cores/treads on the PC/Cluster but apparently I was very wrong, at least on the PC-part!

In order to get some perspective I used to model to calculate the barotropic currents over a 40 day period (tiling: 3x2, number of treads: 6). This has taken, until now, in the order of 5-6 days to complete on the 980x-PC. As a base reference these runs take 8:32 (8min 32sec) to model one hour.

In order to make comparison of different setups of the PC and the model I only run the model for one hour in the following examples.

As startup effects of very short model runs will bias the model run-time, the following SETUP 1 reference runs were made:

SETUP 1 default
(ROMS svn 511, tiling: 3x2, number of treads: 6, RAM 1333MHz)
Run 1: duration 9:02
Run 2: duration 8:38
Run 3: duration 8:40
Run 4: duration 8:59
Average duration SETUP1: 8:50

SETUP 2 new svn version
(ROMS svn 514, tiling: 3x2, number of treads: 6, RAM 1333MHz)
Run 1: duration 8:41
Run 2: duration 8:41
Run 3: duration 8:39
Run 4: duration 8:47
Average duration SETUP2: 8:42

SETUP 3 new tiling, 6 treads
(ROMS svn 514, tiling: 8x81, number of treads: 6, RAM 1333MHz)
Run 1: duration 4:26
Run 2: duration 4:24
Run 3: duration 4:22
Run 4: duration 4:29
Average duration SETUP3: 4:25

SETUP 4 new tiling, 12 treads
(ROMS svn 514, tiling: 8x81, number of treads: 12, RAM 1333MHz)
Run 1: duration 4:26
Run 2: duration 4:25
Run 3: duration 4:25
Run 4: duration 4:27
Average duration SETUP4: 4:25

SETUP 5 faster RAM
(ROMS svn 514, tiling: 8x81, number of treads: 6, RAM 2000MHz)
Run 1: duration 4:02
Run 2: duration 4:02
Average duration SETUP5: 4:02

SETUP 6 faster RAM 12 treads
(ROMS svn 514, tiling: 8x81, number of treads: 12, RAM 2000MHz)
Run 1: duration 3:55
Run 2: duration 3:55
Run 3: duration 3:55
Average duration SETUP6: 3:55


. . . and an update . . . with some additional questions . . .
The XMP option is activated in BIOS, and it tells me that Profile #1 (which I now use is)
Profile Info: 2000MHz-9-9-10-27-1N-1.65V-1.70V
Profile #2 is:
Profile Info: 1866MHz-9-9-10-27-1N-1.65V-1.60V

Quick Boot and Logo are both set to "Disable", but there is no option regarding the Summary screen, but it I think it appears never the less. The only problem is that it is written so fast that I cannot see what is in the beginning of the table.

Question: Is there some way that the summary screen can be accessed afterward as a text file?

Regarding:
...also go to SouthBridge SATA settings and make sure that AHCI
mode is set to "Enabled"
I’m unfamiliar with many of these hardware words/terms, and the closest I can find in the BIOS to the above is a setting in: [MAIN] then [Storage Configuration] then [Configure SATA] where the options are {[IDE],[RAID],[AHCI]}, by default the setting is IDE.

Question: Is this the one option that speeds up I/O?

I just want to be sure as I did some googling and some think that this option has to be set prior to installing the operating system (OS), otherwise there might be complications with the existing OS after this option has been changed!?

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#17 Unread post by shchepet »

So you recover some of the performance loss, but there is still a way to go.
QUESTION: where does the 8x81 tiling come from, memory configurations?
I made it up just as a first guess. Generally for these kind of problems one wants to chose the size of tile to reach the best possible compromise to satisfy all of the following:
(1) the size must be sufficiently small to fit into processor cache [typically this means the outermost-level cache, L3 if the processor has it, or L2];
(2) "perimeter-vs-area" consideration: as tiling is introduced there is a bit of redundant computations takes place along the subdivision lines (literally certain provisional variables [e.g., fluxes, etc...] are computed twice -- when processing a boundary row of a tile, and then the adjacent boundary row of the neighboring tile]. Consequently, if your tiles are too narrow, say only a few points wide, the cost of extra computing along the sides may be not negligible;
(3) length of innermost loop [the i-loop in ROMS] must be large to ensure good processor performance of pipelined execution.

Obviously it is hard to compromise and satisfy all three, but the situation is simplified a bit by realizing that the optimum geometry [size and shape] of tiles does not actually depend on your problem -- horizontal grid dimensions -- but is actually a function of CPU and computer hardware you use. Thus, when going to a larger problem, it makes sense to increase the number of tiles rather than their size.

Some of these ideas are expressed in http://marine.rutgers.edu/po/Workshops/ ... petkin.pdf ...and do not be afraid that it is too old: sometimes a fresh news is just nothing but well forgotten news. What you observe on your i7 machine is the kind of behavior described on page 9 of that poster.

To get a more precise feeling about the optimal tiling for your machine and problem you have to run and compare more possibilities: not necessarily as many as in that plot, but a dozen would be useful to orient yourself.
[Configure SATA] where the options are {[IDE],[RAID],[AHCI]}, by default the setting is IDE.
Question: Is this the one option that speeds up I/O?
Is is the only option? Perhaps not. Advanced Host Controller Interface (AHCI) is Intel's term for a new standard for SATA interface, which may or may not be supported by hard drives and operating systems. It is basically to enable SCSI-like behavior of SATA disk, like native queuing, or simply put, a revision/expansion of SATA commands on hardware level. If you buy a modern disk, it is fully supported, and so does modern Linux. Old (in this context 5-year old) drives do not. That is why it is not enabled by default.

As far as I can tell, AHCI --- non-AHCI makes difference only under heavy load, but barely noticeable on common use, i.e., running model which does computing 99% of the time.


Now lets focus on other insignificant details which may affect your timings:

1. What version, branch and release of Linux do you have?

2. Are you running "desktop" or "server" kernel? What version?

uname -r ???

uname --all ???

To my experience one must use server-type kernel for these kind of computations. Some Linux distributions, notably Mandriva, maintain two or three branches of their kernels, notably "kernel-desktop", "kernel-server", and "kernel-laptop" (although it appears that in the most recent release laptop and desktop are kind of merged). Their Linux installer automatically decides which one to put in depending on what hardware it finds, and desktop version is installed by default on most motherboards. I always un-install desktop and install server, no matter what is the intended usage of the machine.

The difference? Different optimization targets: server is optimized for maximum throughput; desktop is more about latencies of responses to user input. Basically different priority policies in process management. When it is about running multi-threaded jobs, scheduling becomes important. I observer as much as 20%
difference in performance on i7 just due to the kernel alone.


3. If a threaded job, like ROMS is running inside a "konsole" window [konsole is the default shell of KDE, very popular for its versatility] and printing its output to the screen, it actually running slower that than if it is running from an xterm, or if redirecting its standard output into a file. It is bizarre. I do not have explanation for this. In any case, the most reliable timings can be done when all output is redirected into a file, which should always be done.


4. What version if Intel compiler do you use?

ifort -V ???

As of today, the current release is 11.1.073. If you running anything older than that, please update.

5. Compiler FLAGS ??? is it

ifort -fpp2 -openmp -pc80 -axSSE4.2 -axSSE4.2 -auto -stack_temps -O3 -IPF_fma -ip ....

or something else?

In your first post you mentioned that you tried to use a more advanced compiler options but it makes no difference relative to "default". Is it still the case?

Note that this whole process may be iterative: once you fix one thing, you may end up adjusting what you did before. For example, your insensitivity to compiler options may be explained by the code is memory bandwidth limited, so compiler optimizations of what is going on inside processor core do not make much difference (since mostly it waits for data to be retrieved from main memory). But once that is fixed, the effect of different optimization levels may become noticeable.

6. Did you try to see whether #define/#undef CPP switch ASSUMED_SHAPE in file Include/globaldefs.h makes any difference?

Just edit that file and define or undefine it manually, replace

Code: Select all

#if !((defined G95 && defined I686) || defined UNICOS_SN)
# define ASSUMED_SHAPE
#endif
with

Code: Select all

# define ASSUMED_SHAPE
or

Code: Select all

# undef ASSUMED_SHAPE
My own experience that, if using Intel compiler, the code actually runs faster if this switch is undefined. This is somewhat contrary to the recommended setting. But it needs to be ckecked whether it is still the case.

7. Do you limit or unlimit stacksize?

To check, type limit:

Code: Select all

  woossee:/home/alex 146> limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    8192 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024 
memorylocked 64 kbytes
maxproc      30915
to change specific setting, type

Code: Select all

 
  woossee:/home/alex 147> limit stacksize 16M
to verify the effect of your change

Code: Select all

  woossee:/home/alex 148> limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    16384 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1024 
memorylocked 64 kbytes
maxproc      30915 
Some people in ROMS comminity advocate to unlimit stacksize by the command

Code: Select all

 
   woossee:/home/alex 149> unlimit stacksize
This prevents model from crashing complaining about segmentation fault/stacksize violation. However one should be aware of the fact that unlimiting stacksize has a downside: memory allocation by different threads is essentially serialized. The dilemma is that if a thread wants to reserve some memory for its private needs, the operating system must ensure that two threads would not attempt to grab the same chunk of memory. This can be done in two ways: (1) at startup time give each thread its own fixed-size range of addresses in which it allows to allocate memory without worrying what other threads are doing [you can do whatever you want, but within your range only ==> limited stacksize]; or (2) serialize the allocation process -- one thread at a time is allowed to call malloc, others wait. This leads to some performance degradation.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#18 Unread post by BAN »

Once again thank you for the valuable advice, and I apologize for the long time elapse.

I’ve tried some different tilling options and as I understand the tilling partition in ROMS almost anything goes as long as the total number (NtileI * NtileJ) is a multiple of the number of treads (in my case 12).

In the present grid:
Lm (#points in I-direction) = 1398 = 2 * 3 * 233, and
Mm (#points in J-direction) = 726 = 2 * 3 * 11 * 11
As I understand it NtileI and NtileJ should preferably be made up of the prime factors of Lm and Mm, as this would generate equally sized tiles. Fortunately for me the overhead of having unequally sized tiles is not that big.
Based on the testing that I have done, the best choice is NtileI=68 and NtileJ=33. This choice is 8 sec. (7%) faster than 8*81 for 1000 modeled time-steps.

Reply questions 1 and 2 (regarding the operating system):
I’m running Ubuntu 10.04 (Lucid Lynx) on both PC’s
‘uname -r’ gives:
2.6.32-25-generic
‘uname --all’ gives:
2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 2010 x86_64 GNU/Linux

So in short I use the 64bit desktop version and not a server version which you clearly recommend. As the 64bit-Ubuntu has issues with missing 32bit-libraries in the desktop version, I’m somewhat hesitant to plunge into installing the server version . . . (the most stable versions of Ubuntu are the ones with the largest number of users, and not that many use the server version) . . . but again up to 20% gain in performance is a lot. . .

Reply quesion3 (regarding how ROMS is executed):
I use the default bash terminal and all output is piped into a file (run_log). The usual way is the following:

Code: Select all

OMP_NUM_THEADS=12
export OMP_NUM_THEADS
nohup ./oceanO < ocean_CoarGrid.in > run_log
Reply quesion4 (regarding ifort version):
‘ifort -V’ gives:
Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100414, Package ID: l_cprof_p_11.1.072. . .
- So I’m apparently using the second newest package.

Reply to question 5 (regarding compiler flags)
The ROMS default gives following flags for my PC:

Code: Select all

-heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xW -free
The only flags that I have tried to change are are -ip, -O3 and -xW.
The -ipo option would not compile so -ip remains
The less aggressive options -O1, -O2 did not give better results and the more aggressive option -Fast did not compile. So again -O3 remains.
The updated hardware specific option -axSSE4.2 gave slightly better performance (1-2%) so this option has now replaced -xW.

The new series of flags is thus:

Code: Select all

  -heap-arrays -fp-model precise -openmp -fpp -ip -O3 - axSSE4.2 -free
Reply to question 6 (regarding ASSUMED_SHAPE)
I did the changes you suggested and recompiled/rerun the model. The results were as close to equal as possible with the ‘#undef ASSUMED_SHAPE’ being 1 sec faster than ‘#define ASSUMED_SHAPE’

Reply to question 7
On the old cluster I did use the unfortunate -unlimited option for the stack size. On the i7 PC’s the default settings have been working so I have not changed any thing.
‘ulimit -a’ gives:

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited     
So as long as ROMS runs with default stack size there is no point in changing it right? In Ubuntu/bash I can increase the stack size by setting e.g. ‘ulimit -s 16384’.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#19 Unread post by shchepet »

So as long as ROMS runs with default stack size there is no point in changing it right?
In Ubuntu/bash I can increase the stack size by setting e.g. ‘ulimit -s 16384’.
This time you actually made me do my homework because what you are reporting --
essentially insensitivity of your code performance with respect to most, if not all of
the settings I asked you to test -- means that something very wrong because I knew
that there must be some sensitivity.

And this is what I found and ask you to check:

1. Edit file Compilers/Linux-ifort.mk and find a line which looks like

Code: Select all

FFLAGS := -heap-arrays -fp-model precise
and change it to

Code: Select all

FFLAGS := -no-heap-arrays -fp-model precise
i.e., just to add "-no" in front of -heap-arrays. Recompile (meaning "make clean" followed
by "make") and rerun your tests using the best settings your know for your problem.
Adjust stacksize limit as we previously discussed, i.e., your code now should be sensitive.
Compare with what you had before and report the result back to this board.
Include -heap-arrays/-no-heap-arrays comparison.



2. It would be useful to deduce some absolute figures from your test results (i.e.,
we may achieve some gains relatively to what your started with, but we have no idea
how good or bad it is and when it is good enough so we should we stop).

Therefore, since I cannot reproduce your configuration, but I know that this is
basically a 2D problem I ask you to run a standard test problem which has been in
ROMS from the very beginning, which is SOLITON problem. I modified it to be of
sufficient size, and therefore relevant for Core i7 testing.

I made everything preconfigured. Go to web directory
http://www.atmos.ucla.edu/~alex/ROMS/BAN/
and get file roms_v3.4_rev514_soliton_patch.tar.
Create a scratch directory and untar it there. There are five files inside and
a README file which explains what they are:

Code: Select all

rw-r--r-- 1 alex users  5864 2010-11-14 16:52 README
-rw-r--r-- 1 alex users 19278 2010-11-13 20:07 makefile
-rw-r--r-- 1 alex users  5050 2010-11-14 16:31 Compilers/Linux-ifort.mk
-rw-r--r-- 1 alex users 85755 2010-11-13 22:43 ROMS/External/ocean_soliton.in
-rw-r--r-- 1 alex users 32748 2010-11-14 16:23 ROMS/Include/globaldefs.h
-rw-r--r-- 1 alex users   867 2010-11-14 14:21 ROMS/Include/soliton.h
-rw-r--r-- 1 alex users  4841 2010-11-14 14:54 ROMS/Utility/mp_routines.F
These are the only files I have to touch from ROMS v3.4 rev. 514 SVN snapshot.
I do not think that the exact revision number matters, since these files were
unchanged for long time. Just compare these files with the standard and replace/copy
paste as necessary. In principle you can substitute them all in your code: the
changes are minor an all explained in README. All you have to do is edit
Compilers/Linux-ifort.mk and make sure that include and library directories
for netCDF are properly set for your machine (this application does not require
netCDF 4, I use -lnetcdf -lhdf5_hl -lhdf5 -lz because my netCDF library was compiled
with HDF 5 enabled so I have to, but if you use netCDF 3.6.3 or netCDF 4.xx without
enabling HDF, you do not need -lhdf5_hl -lhdf5 -lz after -lnetcdf.

For the reference: the size and the duration of the problem are

Code: Select all

          Lm == 768
          Mm == 256
          NtileI == 6
          NtileJ == 32
          NTIMES == 4800
          DT == 0.0255d0
It takes about 43.5 seconds wall clock time to run this test on a slightly overclocked
Core i7 machine [nominally 2.66GHZ Core i7 920 CPU overclocked to 2.88 by changing
base frequency from 133 to 144 MHz on an ASUS P6T Deluxe V2 with DDR3 1600
CAS 8 memory running at 1440 MHz] using all resources available, i.e. 8 threads.



3. It is instructive to compare the above with the performance of a very old code --
ROMS 1.9 -- the last Hernan's pre-Fortran 90 code dated back to 2003. I can configure
it to be mathematically identical to the above, so literally it goes through the same
stages and produces the same result.

Please go to web directory http://www.atmos.ucla.edu/~alex/ROMS/BAN/
and get the other file file roms_1.9.tar. Make a scratch directory and place
the tar file there. Go there and untar,

Code: Select all

 tar xvf roms_1.9.tar
Edit file Makedefs.IntelEM64T and adjust -I/netcdf/include/directory and
-L/netcdf/library/directory so in can find netcdf.inc file and netCDF library on
your machine, then compile it

Code: Select all

 make mpc
 make
Note, you have to compile "mpc" first before you compile the code for the first time (it wil fail to compile otherwise), but you do not have to recompile it again
after "make clean". Also note that this code allows parallel make, so you can replace
the second "make" with

Code: Select all

 make -j 4
and it takes just few seconds on an i7 machine.


Then run it

Code: Select all

 roms < roms_soliton.in
or

Code: Select all

time roms < roms_soliton.in
and see how it goes. Report it back to this message board.

Note that unlike in modern ROMS 3.4, grid dimensions, Lm,Mm and tiling parameters
NSUB_X,NSUB_E (equivalent of modern NtileI, NtileJ) are set in file param.h,
so you have to recompile it every time when when you need to change any of them.
Time step size and other runtime parameters are controlled in roms_soliton.in, which
is a fixed-format file, so you have to be very careful not to mess up alignment or
introduce blank lines.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#20 Unread post by BAN »

Hi, and once again sorry for the long time elapse.

Tonight I unfortunately only had time to test the first part of what you suggested. The rest will come later. To begin at the end I'm most grateful for the help and guidance you have already provided, and relative to where I began the simulations are running much faster now.

Answer to 1:
When re-compiling after editing the compiler flags (adding the '-no') I got a segmentation fault error. I doubled the stack size (by default the stack size is 8192 kb) and the model could run again. And yes there is a clear difference and quite a good speedup, see table below:

(first two rows give the tilling, the third row gives the execution-time in seconds using the default flags and the forth row gives the corresponding execution-times using the added '-no' flag)

Code: Select all

NtileI 12  12  12  12  24  24  36  36  36
NtileJ 12  24  30  36  24  12  12  24  36
------------------------------------------
'def.' 81  73  71  71  71  78  79  77  80
'-no ' 96  60  62  64  74  66  72  70  74
A 15% decrease in computation time :D

Answer to 2:
. . . is coming later

Answer to 3:
. . . is coming later

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#21 Unread post by shchepet »

...and relative to where I began the simulations are running much faster now.
The point is that you know exactly where you have started, but you do not know where are
you heading in sense that you do not know when it is good enough to stop. And I feel that
it will take several iterations from where we are right now.

First, last time you reported execution time of of about 4 minutes (3:55 at best to be exact).
Now the best time is 60 seconds using 288 tiles (12x24). Are you running a different problem
now? Or different duration of the same problem.

Secondly, the contrast between -heap-arrays and -noheap-arrays you are observing
(roughly 70 --> 60 seconds) is about 15%, while I expected to see something close
to 25%. So it is kind of flatter, which makes me feel that your code has some sort
of "ballast" attached somewhere -- an expensive, computationally intense piece of code
which behaves perfectly "flat" (not sensitive to tiling, compiler flags, etc) and
which takes about 10..20 seconds (out of 60) and dilutes all the sensitivities.

Looking at the upstream portion of this threads, as well as some of your other
posts: Are you doing tides by any chance? Meaning do your computation in
these timing results go through the routine "set_tides.F"?

If the answer is "yes", this is a known CPU time waster. The problem is that it computes
a lot of sines and cosines in two-dimensional loops (covering the entire computational)
domain, but the results of these computations are needed only at the perimeter.
This is aggravated by the fact that you are doing a 2D problem (non defined SOLVE3D).
[In a 3D setup the excessive cost of "set_tides" is mitigated by mod-splitting ratio
"ndtfast".]

The offending loops are: starting at about line 393 of "set_tides.F",

Code: Select all

      Etide(:,:)=0.0_r8
      cff=2.0_r8*pi*(time(ng)-tide_start*day2sec)
      DO itide=1,NTC
        IF (Tperiod(itide).gt.0.0_r8) THEN
          omega=cff/Tperiod(itide)
          DO j=JstrR,JendR
            DO i=IstrR,IendR
              Etide(i,j)=Etide(i,j)+                                    &
     &                   ramp*SSH_Tamp(i,j,itide)*                      &
     &                   COS(omega-SSH_Tphase(i,j,itide))
#  ifdef MASKING
              Etide(i,j)=Etide(i,j)*rmask(i,j)
#  endif
            END DO
          END DO
        END IF
      END DO
and starting at about line 524,

Code: Select all

      Utide(:,:)=0.0_r8
      Vtide(:,:)=0.0_r8
      cff=2.0_r8*pi*(time(ng)-tide_start*day2sec)
      DO itide=1,NTC
        IF (Tperiod(itide).gt.0.0_r8) THEN
          omega=cff/Tperiod(itide)
          DO j=MIN(JstrR,Jstr-1),JendR
            DO i=MIN(IstrR,Istr-1),IendR
              angle=UV_Tangle(i,j,itide)-angler(i,j)
              Cangle=COS(angle)
              Sangle=SIN(angle)
              phase=omega-UV_Tphase(i,j,itide)
              Cphase=COS(phase)
              Sphase=SIN(phase)
              Uwrk(i,j)=UV_Tmajor(i,j,itide)*Cangle*Cphase-             &
     &                  UV_Tminor(i,j,itide)*Sangle*Sphase
              Vwrk(i,j)=UV_Tmajor(i,j,itide)*Sangle*Cphase+             &
     &                  UV_Tminor(i,j,itide)*Cangle*Sphase
            END DO
          END DO
          DO j=JstrR,JendR
            DO i=Istr,IendR
              Utide(i,j)=Utide(i,j)+                                    &
     &                   ramp*0.5_r8*(Uwrk(i-1,j)+Uwrk(i,j))
#  ifdef MASKING
              Utide(i,j)=Utide(i,j)*umask(i,j)
#  endif
            END DO
          END DO
          DO j=Jstr,JendR
            DO i=IstrR,IendR
              Vtide(i,j)=(Vtide(i,j)+                                   &
     &                    ramp*0.5_r8*(Vwrk(i,j-1)+Vwrk(i,j)))
#  ifdef MASKING
              Vtide(i,j)=Vtide(i,j)*vmask(i,j)
#  endif
            END DO
          END DO
        END IF
      END DO
[search for "DO itide=1,NTC" to find them.]

These loops compute arrays Etide, Utive, Vtide, which are then added to the boundary
arrays {zeta,ubar,vbar}_{west,east,south,north} = 12 permutations total, along the
perimeter.

Instead of computing Etide, Utive, Vtide over the 2D range of indices

i,j={ MIN(IstrR,Istr-1):IendR, MIN(JstrR,Jstr-1):JendR }

you should compute them along the perimeter only.
And, in principle, eliminate
these scratch arrays altogether: the results of these computation may be added
directly to the boundary forcing arrays ubar_west, zeta_west, etc...

Transform loops

Code: Select all

      DO itide=1,NTC
        IF (Tperiod(itide).gt.0.0_r8) THEN
          omega=cff/Tperiod(itide)
          DO j=MIN(JstrR,Jstr-1),JendR
            DO i=MIN(IstrR,Istr-1),IendR
                   .....
                   .....
            END DO
          END DO  
        END IF
      END DO
into

Code: Select all

      IF (WESTERN_EDGE) THEN
        i=Istr-1             !<-- or appropriate i-index for boundary U-points
        DO itide=1,NTC
          IF (Tperiod(itide).gt.0.0_r8) THEN
            omega=cff/Tperiod(itide)
            DO j=MIN(JstrR,Jstr-1),JendR
                   .....
                   .....
            END DO
          END IF
        END DO
      END IF
and, similarly

Code: Select all

      IF (EASTERN_EDGE) THEN
         i=Iend+1
         ......

Code: Select all

      IF (SOUTHERN_EDGE) THEN
         j=Jstr-1             !<-- or appropriate j-index for boundary V-points
         ......

Code: Select all

      IF (NOTHERN_EDGE) THEN
         j=Jend+1
         ......
In doing so you will replace each of the 2D loops with four 1D-boundary loops,
so you code will be longer in the end in terms of number of lines, but you will
eliminate most of the computational cost bringing it down to ~1% of the original.

The easiest way to avoid confusion is to get rid of scratch arrays Etide, Utide, and
Vtide altogether along with the MPI exchange calls associated with these arrays
and work directly with ubar_west, etc...


Use the following routine as a template
http://www.atmos.ucla.edu/~alex/ROMS/BAN/set_tides.F
but do not substitute the file as a whole: it is semantically
incompatible with your code.


...and I am still waiting for answers to 2 and 3 from the previous exchange.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#22 Unread post by BAN »

Answer to 2:

Running with FFLAGS := -no-heap-arrays -fp-model precise -openmp -fpp -ip -O3 - axSSE4.2 -free
and increased stack size (16384) using same tilling (6 x 32) on :
12 treads the time was: 39 sec.
8 treads the time was: 44 sec.

Running with
FFLAGS := : -pc80 -xSSE4.2 -auto -stack_temps -openmp -fpp -ip -O3 -free
and increased stack size (16384) using same tilling (6 x 32) and 12treads gave the same time 39sec. Using other tilling (e.g. 12x32 or 12x24) on 12 treads the time went down to about: 37.5 sec

The speedup could, as you pointed out above, probably be faster if I was running on a server setup rather than the pc setup I'm presently using.

Answer to 3:

I usually run current version of ROMS using the build.bash script, where netcdf is located by the following lines:
export NETCDF_INCDIR=/usr/local/include
export NETCDF_LIBDIR=/usr/local/lib

In Makedefs.IntelEM64T, I do not know what to change. If I delete all comment and suffix lines, here is what is left:
CPP = /lib/cpp -traditional -D_OPENMP -D__IFC
OMP_FLAG = -fpp2 -openmp
CFTFLAGS = -pc80 -xSSE4.2 -auto -stack_temps
CFT = ifort $(OMP_FLAG) $(CFTFLAGS) $(LARGE_MEM_FLAG)
LDR = $(CFT)
FFLAGS = -O3 -IPF_fma -ip -warn unused
LDFLAGS =
COMP_FILES =
LCDF = -lnetcdf -lhdf5 -lhdf5_hl
libncar = -lncarg -lncarg_gks -lncarg_c -lXpm -lX11 -lXext -lpng -lz
LIBNCAR = -L$(NCARG_ROOT)/lib -L/usr/lib64 $(libncar)

which line should I edit?

Answer to the new post:

Yes I use set_tides and thank you a lot for the information. I'll look into.

and

Yes in the latest reply I was using a different grid, and a different duration.
I seem to have some trouble right now running the original version again (segmentation fault), but a summation of the gains reported in the posts above show that the runtime should be about a third of what they originally were.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#23 Unread post by shchepet »

I usually run current version of ROMS using the build.bash script, where netcdf is located by the following lines:
export NETCDF_INCDIR=/usr/local/include
export NETCDF_LIBDIR=/usr/local/lib
Then you should set

Code: Select all

CPP = /lib/cpp -traditional -D_OPENMP -D__IFC -I/usr/local/include

LCDF = -L/usr/local/lib -lnetcdf
and leave everything else as before.

Compiler/CPP flags lags are generally cumulative, so there are different ways to express
the same thing, e.g.,

FIRST_SET_FLAGS = -flag1 flag2 -flag3
SECOND_SET_FLAGS = -flag4 -flag5
CFT = ifort $(FIRST_SET_FLAGS) $SECOND_SET_FLAGS)

is the same as

CFT = ifort -flag1 flag2 -flag3 -flag4 -flag5

Grouping flags together is a matter of style, but also convenience because you want to
change them as sets, e.g, when going from high optimization to debugging, you want to
drop -O3 with a few associated things, but keep your -pc80 -xSSE4.2 which defines
instruction set for your CPU. So these flags belong to different groups. Hernan's bash
script of today's ROMS does the same thing, but it is kind of hidden from users.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#24 Unread post by shchepet »


Answer to 2:Running with FFLAGS := -no-heap-arrays -fp-model precise -openmp -fpp -ip -O3 - axSSE4.2 -free and increased stack size (16384) using same tilling (6 x 32) on :
12 treads the time was: 39 sec.
8 treads the time was: 44 sec.
This is a good news: looks like hyperthreading works now.

Running 8 threads on your machine does not make any sense: you have 6 cores, 8 is not
divisible by 6, so you get miss-balance, and possible frequent thread migration from core
to core because the scheduler tries to balance load between the cores. This leads to some
penalty, which is probably partially compensated by hyperthreading (since you have more
threads than physical cores).

What about 6 vs. 12 threads?

...Using other tilling (e.g. 12x32 or 12x24) on 12 treads the time went down to about: 37.5 sec
This is a bit surprising (shortening first dimension to below 100 = shortening vector
loops should slow it down), but any way, always compare the best against the best.

What about 8x24 using 12 threads?

Also, note that that different versions of code may have different optimal tilings.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#25 Unread post by BAN »

Hi here is a update, with some surprises

starting with the last post:

I agree, 8 treads is far from optimal and I just included this as it gave a “semi-comparable” reference time to the 8 treaded run-time you reported.

tilling 8x24 takes 38.6 sec with 12 treads

and a surprise
tilling 12x32 takes 37.5 sec with 12treads and 32.8sec with 6 treads ??
. . . something strange going on with the hyper treading.

Above I used the new flags
Compiler flags : -pc80 -xSSE4.2 -auto -stack_temps -openmp -fpp -ip -O3 -free

If I use the original flags
Compiler flags : -heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xW -free
tilling 12x32 takes 54.8 sec with 12treads and 53.6sec with 6 treads

Using the modified flags
Compiler flags : -no-heap-arrays -fp-model precise -openmp -fpp -ip -O3 -axSSE4.2 -free
tilling 12x32 takes 38.8 sec with 12treads and 35.6sec with 6 treads

So there is a very good speedup between the original flags and the present ones, when running this application.


Something else that is somewhat strange is that all my other applications seem to need to have very large stack size if the are to run with the new compiler flags. At the moment I therefore use unlimited stack size, although I know that this is not the best option. . . when running Ubuntu remotely I can only increase the size once per session, so these stack size iterations require patience. . .


How much has my reference 2D tidal tidal run speeded up so far?
I reran the reference problem, and with the new compiler options and using unlimited stack size. It originally took 8:55 [min:sec] and now it takes 2:44 using 12-treads and 3:16 using 6-treads. So for this application the run time has reduced to only 31% of the original run-time.


And finally to the status of running soliton with ROMSv1.9
I run into the following error:
. . .
mp_routines.f(56): remark #7712: This variable has not been used. [WTIME]
function my_wtime (wtime)
-------------------------^
/lib/cpp -traditional -D_OPENMP -D__IFC -I/usr/local/include -P analytical.F | mpc > analytical.f
analytical.F:446: error: missing binary operator before token "ISWAKE"
ifort -fpp2 -openmp -pc80 -xSSE4.2 -auto -stack_temps -c -O3 -IPF_fma -ip -warn unused analytical.f -o analytical.o
analytical.f(286): remark #7712: This variable has not been used. [X0]
& r, theta, twopi, val1, val2, x0,y0,rd_inner
. . .
ifort -fpp2 -openmp -pc80 -xSSE4.2 -auto -stack_temps -O3 -IPF_fma -ip -warn unused -o roms roms.o initial.o main2d.o . . . zetabc.o -L/usr/local/lib -lnetcdf
/opt/intel/Compiler/11.1/072/lib/intel64/for_main.o: In function `main':
/export/users/nbtester/efi2linux_nightly/branch-11_1/20100415_000000/libdev/frtl/src/libfor/for_main.c:(.text+0x38): undefined reference to `MAIN__'
make: *** [roms] Error 1
Any idea on what is going wrong?

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#26 Unread post by shchepet »

.....
analytical.F:446: error: missing binary operator before token "ISWAKE"
.....
There is a type on line 446: It says

Code: Select all

#if defined ANA_FSOBC && !defines ISWAKE
should be

Code: Select all

#if defined ANA_FSOBC && !defined ISWAKE
i.e., !defines --> defined

On my machines it still compiles and runs correctly -- CPP produces error
message, but does not quit despite the typo.
This piece of code in analytical.F is not needed for this application.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#27 Unread post by shchepet »

/opt/intel/Compiler/11.1/072/lib/intel64/for_main.o: In function `main':
/export/users/nbtester/efi2linux_nightly/branch-11_1/20100415_000000/libdev/frtl/src/libfor/for_main.c:(.text+0x38): undefined reference to `MAIN__'
This is caused by an attempt to compile code without compiling mpc first.

Explanation: If you just untar the tar file and type "make", it tries to compile the first target,
which is main.o, however because CPP output is piped through mpc, and mpc does not exist
yet, the result is an empty file "main.f". Then make quits, complaining about
/bin/sh: mpc: command not found
If, in response to that, you type "make mpc" followed by "make", everything compiles
to the very end, but because it takes the empty "main.f" leftover from the first attempt
to compile the code, the resultant "main.o" is also empty, so you end up with
undefined reference to `MAIN__'
error at the very end.

To fix: You must remove empty "main.f" and "main.o" files before making the second
attempt to compile the code. A sequence of commands

Code: Select all

make mpc
make clean
make
is sufficient to do that, i.e., a simple rule is that if, for whatever reason you decide
to recompile "mpc", always use "make clean" after that.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#28 Unread post by BAN »

That was just it.
I did not use make clean between make mpc and make -j.

Here are some run-times from running soliton on ROMS_1.9

Code: Select all

NSUB_X       6     6     8     8     8     9     3    12    12    12
NSUB_E      32    64    24    36    48    32    32    32    24    48
12treads  31.2  29.6  31.4  30.4  29.4  29.6  36.4  27.8  25.6  27.2
6 treads  28.6  26.8  26.6  25.8  38.6  24.4  34.4  23.0  29.2  32.4

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#29 Unread post by shchepet »

...So what? The best results for the SOLITON problem are:
for ROMS 1.9
23.0 sec using 6 threads, 12x32 tiling
vs. for the new code
32.8 sec also 6 threads, 12x32 tiling,
Compiler flags: -pc80 -xSSE4.2 -auto -stack_temps -openmp -fpp -ip -O3 -free
The above slightly degrades to
35.6 sec with 6 threads, 12x32 tiling
Compiler flags: -no-heap-arrays -fp-model precise -openmp -fpp -ip -O3 -axSSE4.2 -free
and then dramatically degrades to
53.6 sec with 6 treads, 12x32 tiling
Compiler flags: -heap-arrays -fp-model precise -openmp -fpp -ip -O3 -xW -free
Obviously compiler flag -heap-arrays causes a major crippling effect, which we have
identified and repaired. Also hypothreading does not yield any positive effect.
[It also appears that -fp-model precise makes the code run a bit slower, however
not so dramatic.]


But, even after fixing this and leaving hypothreading behind for a moment (it seem
to affect both codes equally) it looks like your eight-year-old ROMS 1.9 code runs
significantly faster, about 1.4 times, than your newest code
.

Do you want to ask why?


At this point it is worth to verify by comparing step2d.F of both codes that, for this
particular problem, they are indeed mathematically equivalent to the last operation.

...Independently of the above, it would be useful to limit stacksize and verify that
1.9 can go with very small stacksize.

BAN

Re: Intel’s new i7 980x CPU gives disappointing speedup

#30 Unread post by BAN »

Running Soliton with default stack size (8192kbytes) is no problem. The problem with large stack sizes is only with my other applications.

Just to be on the safe side I reran some of the cases reported above with default stack size, which were executed with unlimited stack size. There was no clear difference in the run-times.

It must nevertheless be mentioned that the run-times reported above are single run run-times and not averages over several runs, which would be the accurate procedure.

Here is are a few runs from ROMS 1.9 with 12x32 tilling and 6 treads:
. . /ROMS_1.9$ time roms < roms_soliton.in > BAN_run_log
real 0m25.819s
real 0m25.419s
real 0m23.018s
real 0m25.819s
real 0m23.219s
real 0m23.219s
real 0m36.223s
real 0m23.019s

Here is a some runs from recent version of ROMS with 12x32 tilling and 6 treads:
. . /Test_Soliton$ time ./oceanO < ocean_soliton.in > run_log
real 0m33.020s
real 0m35.820s
real 0m36.420s
real 0m35.620s
real 0m32.820s
real 0m33.020s
real 0m41.222s
real 0m32.819s
real 0m44.824s
real 0m49.228s
real 0m47.024s
real 0m32.820s
real 0m33.419s

As you can see there is the possibility that the PC wants to use its resources for otter purposes and thus suddenly gives unexplainable long run-times. But the most frequent run-times, are the lowest times and these seem to be equivalent to the run-times reported in the posts above.

-why there is this big difference in run-times between the old code vs. the new code is a good question!?

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#31 Unread post by shchepet »

It must nevertheless be mentioned that the run-times reported above are single run
run-times and not averages over several runs, which would be the accurate procedure.
No, comparing averages is not the proper way to analyze it. Comparing the best against
the best would be more representative.

This is a bit another topic, but let's just have a brief explanation of what is going on.

I assume that you are not running firefox while doing these tests, and there are no other
competing applications. The best way is also to stop KDE, and, in principle, to stop X-server
completely --just type telinit 3 as root while login remotely, and telinit 5 to restart the
X-server again -- say if you use this machine solely for computing, while some other -- an
older Pentium 4 machine to use as a desktop (KDE, firefox, Latex editing, whatever).
So if you do top while ROMS test is running, you are getting the solid 600% and
not a penny less
if using 6 threads (1200% if using 12).

So machine is perfectly clean in sense that there are no competing jobs running and taking
away CPUs/cores.

Then if you run the same test again and again, say a 1000 times or more (just write a script)
to get enough statistics, and plot a histogram of probability density of having certain
runtime (say, you bin your results ranging from ~23 to ~36 seconds into 0.25-second bins
bins, and see how many runs out of 1000 would end up in each bin), you will discover that
this histogram will have not one, but two peaks. Perhaps even more complex.

The problem is that some of your test runs literally experience mutations, and became slower
by some finite amount than most other runs which are "normal". With some finite probability,
abet not very large. You have 6 physical cores, and you are running 6 threads. You expect
that each thread gets its own core and sits on it throughout the whole run. This is ideally what
should happen and happens most of the time. However, because of CPU hypothreading your
machine appears to the operating system as a 12-CPU machine, and Linux kernel does not
actually distinguish between real physical cores and virtual cores, so it allocates threads to
those CPU cores which it thinks are not busy at the moment -- it is designed to statistically
level the load. Thread allocation is actually a sequential process. The Open MP job starts as
a single thread, then reaches the first parallel region, and creates child threads one at a time.
Because the kernel does not distinguish whether the core it thinks free belongs to the pair
of virtual cores corresponding to the same physical core, or to different, there is a chance
that it puts two threads onto two virtual cores which correspond to the same physical.
This is like a mutation.

In the past, when hypothreading appeared first in year 2003 on "Prestonia"-core Xeon CPUs
(a Xeoned version of Pentium 4 "Northwood") and the follow-up "Nocona", the recommandation
was very plain and simple: go to BIOS and disable hyperthreading. The machine always runs
faster when having two real CPUs, that four virtual (using either 2 or 4 threads). Intel claims
that hyperthreading is useful because if a thread stalls because of cache miss, the resources of
the core will be given to another thread, and therefore it utilizes cycles which otherwise wasted
for the cache miss to be resolved. The problem is that there are trade-offs, and thread switching
is not free: it may flash cache of data previously loaded but not used by the thread which was
switched-off. So it must be reloaded when it comes back. Besides, ROMS, as a coarse-gained
code, uses static scheduling in Open MP, which implies that the resources are available when
requested, and hyperthreading only confuses that. May be other codes behave differently, but
we were not able to extract anything useful from hyperthreading at that time.

The next generation of CPUs -- Core 2 Quad -- did not have hypertreading.

It is back with i7, and, contrary to my pessimistic expectations I was able to extract some
gain (but only in a 3D large-memory configuration). Your newest results, obviously do not
support that. Perhaps the only way to verify whether the fluctuations of run time are caused
by mutations or by something else is to turn hypertheading OFF by BIOS and check that the
results become stable.

P.S.: The inconsistency of run times caused by mutations can be also observed on Dual-CPU
Opteron boards, especially on "dual-dual" (dual-CPU dual-core Opteron 275 for example), and
this puzzled me for some time (I have such machine). About a year ago I found an article
by Allan Porterfield et. al, "Performance Consistency on Multi-socket AMD Opteron Systems"
explaining this: http://www.renci.org/wp-content/pub/tec ... -08-07.pdf
Of course, it has nothing to do with hyperthreading, but the mechanism is similar: Linux
kernel does not distinguish between CPUs/cores belonging to the same or different CPU
socket and, correspondingly, memory system (these CPUs have their individual memory
systems, so some parts memory is "closer" to one CPU than to the other, resulting in
non-symmetry which is ignored by the Linux kernel).

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#32 Unread post by shchepet »

Regarding the performance comparison between ROMS 1.9 vs. 3.4: what
you are observing can be called as Fortran90 penalty. The fact that pre-Fortran 90
codes run faster than their supposedly more advanced successors actively using
new Fortran 90 features is nothing new, and it is actually noticed in ocean modeling
community, not just ROMS, but other models as well.

Perhaps the earliest documented evidence of this dated back in 1999 (I saw it even
earlier, but cannot point to anything on the web), and was initially attributed to
immaturity of F90 compilers of that time. Go to http://www.gfdl.noaa.gov/ocean-model,
scroll down to MOM, download MOM3 Manual, and read it on page 9. It reports
that an attempt to replace common block with Fortran 90 modules degrades the
code performance by as much as 30%.

Sifting through old posts on this Board, you can find few posts dated back
in 2005 complaining that "ROMS 2.1 is again about 2 times slower than 1.8",
viewtopic.php?f=29&t=134&p=596#p596

Most likely related to this problem is a recent comparison of speed between
ROMS and POM, not in favor of ROMS (POM is a basic Fortran 77 style code)
viewtopic.php?f=14&t=1766&p=6346
[I do contend that, if properly used, ROMS (at least some of them) will
decisively outperform POM leaving it no chance whatsoever.]

The stacksize limitation problem is perhaps the most noticed, see
viewtopic.php?f=31&t=216
and
viewtopic.php?f=17&t=1794&p=6462
and
viewtopic.php?f=31&t=216&p=613
among numerous others.

Relative insensitivity computational performance of 2.x and later codes to
the compiler optimization (-O2 vs. -O3) settings, posts dated back in 2005,
viewtopic.php?f=31&t=230&p=468
and
viewtopic.php?f=14&t=252
[Note that for Intel Ifort Compiler flag -O is the same as -O2.] Both of which
report significant increase of compilation time when changing from -O2 to -O3,
but no gain in execution performance.



A quick comparison of step2d.F files between the 1.9 and 3.4 codes illustrates
the stages of code evolution on the long run: 1.9 places all its global/shared arrays
into common blocks, which are stored in separate files and are included where
needed using CPP-style #include. As the result, the global arrays are never passed
as arguments, and there is no need to declare them inside a routine. This is old-
man's Fortran 90
: to have F90-like functionality without having F90 compiler. Placing
common blocks into include files guarantees consistency of their sizes and meaning
of the variables inside. Fortran common blocks map onto global variables in C: one
can declare an array in a piece of Fortran code and place it into a common block
named X, then write another piece of code in C which contains declaration of
a global variable named X_ (i.e. same name appended by a trailing underscore),
compile both pieces by their respective compilers with -c option, the compile the
resultant .o files together using either compiler (note that Fortran and C compilers
from the same vendor are normally bundled together and actually share most of their
libraries. The format of .o object and .a library files is common between the two,
so either compiler can handle .o's regardless whether the original source codes are
written in Fortran or C). Then Fortran array can be accessed from inside C code as
X_, and vice versa. This trick is no secret: the core netCDF and MPI libraries are written
completely in C, but can be called from Fortran programs. [The only parts of netCDF
and MPI packages written in Fortran are Fortran 90 .mod interfaces.] Because placing
variable into a common block is an ancient mechanism to create global variable,
compilers can handle them efficiently with full optimization.

Conversely, the scratch arrays in 1.9 are always passed as arguments from the
driver routine to the driven. In fact, the functionality of the driver is 2-fold:
(1) to decode tile index tile into bounding indices istr,iend,jstr,jend; and
(2) to provide scratch workspace arrays for the internal use in such a way that
dynamic (or automatic array) allocation is completely avoided, and, at the same time,
to guarantee that each thread has its own workspace memory non-overlapping with
that of any other thread. Note the use of !$OMP THREADPRIVATE directive inside
the file "scratch.h". This means that everything placed into this common block is
statically allocated (retained their assigned values throughout the entire ran time),
but is private to individual threads.

In 3.4 it is actually the other way around all global arrays are passed as arguments;
all scratch arrays are not passed as arguments, but instead are declared internally
and therefore became automatic arrays. Why? It is a long story. To make it short,
let's speculate on a procedure to convert 1.9 into 3.4. At first, one can convert each
include file into a Fortran 90 module declaration, and, correspondingly, replace each
CPP-style #include with Fortran 90 USE with that module name. Also move all the
USE statements to just above implicit none. This is straightforward and can be done
relatively quickly.

What to do about !$OMP THREADPRIVATE? Nothing. There were not a much
choice here. As of 2005 and earlier (Open MP Specification Version 2.5, released in
May 2005, see http://openmp.org/wp/openmp-specifications/ for detail) the only way
to create a threadprivate, statically allocated memory is to place the variable into
a common block and put !$OMP THREADPRIVATE directive with the name of that
common block just below. There is nothing like a THREADPRIVATE module.
[Newer/current Open MP Specification Version 3.0 released on May, 2008, see p. 81,
allows declaring individual variables as threadprivate. The same is formally true
for Version 2.5 Specification, see p, 66, but it took some time -- a couple years
for compiler vendors to fully implement this. I personally reported compiler issue
(Intel's term for bug) to Intel when compiler simply ignored a THREADPRIVATE
directive placed inside a module.] "Fortunately", from the mathematical point of
view, ROMS 1.9 did not rely on threadprivate arrays to transmit any information
in- and out-of a subroutine -- the arrays are used in purely scratch mode to store
intermediate results between consecutive loops. It does not matter what is the
state of these arrays on input and output. [Note: this is no longer true for the
present-day AGRIF and UCLA codes, both of which extensively use threadprivate
arrays to transmit data from one subroutine to another.] So back before 2005 the
choice was to either
(1) leave the !$OMP THREADPRIVATE common block declaration alone as it was
before (hence violate the principle of Fortran 90 purity -- no more common blocks), or
(2) discard this mechanism altogether, and let the automatic arrays to be created/
destroyed every time at when a subroutine in entered/exited from.
In the end the purity principle prevailed and the second option was selected.

After some time passed it was discovered that both steps -- converting common
blocks with global array declarations into Fortran 90 modules and getting rid of
THREADPRIVATE common blocks -- have crippling effect on the code performance.

A common feeling (yes, we are entering to a speculative territory where nobody can
say anything for sure) influenced by computer scientists that this degradation of code
performance is caused mainly by the inhibition of compiler optimizations due to
inability of the compiler to determine whether different elements of pointer-based
(or allocatable) arrays correspond to distinct memory addresses (this is often called
aliasing orpointer aliasing; and compiler engineers sometimes refer it as aliasing
hazard
). A desperate attempt to counter this effect is to pass all of the global/
shared arrays (arrays which are declared in modules) through the arguments to each
subroutine rather than via Fortran 90 USE statements statement inside it, which
is nothing else but an attempt to fool the compiler by trying to "hide" the fact that
they are pointer-based and hoping that it will properly optimize the code.

[ Obviously, using F90 USE statements to pass the variables would result in a much
more compact, and, arguably, more elegant code than having up to a dozen arguments
to each subroutine -- e.g., compare step2d.F from 1.9 vs. step2d_LF_AM3.h from 3.4:
In 1.8 the first executable statement in the driven routine occurs on line 82, which is
#include "set_bounds.h". Everything above it is variable declarations and the driver
routine. In contrast, in 3.4 #include "set_bounds.h" in on line 588, so the declaratory
part of the code is much more massive and, arguably, ugly. However, this is not about
aesthetics any more: there is much more at stake here. ]

[ I believe that the performance difference of the common-block vs. F90-module versions of
the code is actually more sophisticated than just scaring off compiler optimizations by the
possibility of aliasing hazard. After all, running nm to see symbolic names inside a .o file compiled
from a source code with module declaration reveals some remapping of names and their
encapsulation into a kind of structure. But I am not aware of any literature which clearly states
whether this by itself may cause a performance hit. Clearly F90 modules allow more run-time
checking array bounds, multidimensional shapes, etc, which may come at some price. ]


This passing of global arrays through arguments exposed the code to another
problem: Fortran 90 compilers sometimes tend to introduce temporary variables to
copy the data before passing it into the subroutine, and copy it back upon return.
If this happens, it slows down everything significantly. To minimize the unintended
copying in Fortran 90 one should use explicit interfaces for all procedures, either
with INTERFACE blocks, or with module USE statements, or by nesting one
procedure inside another with CONTAINS. The cause for copying/not copying is
quite complex and is related to the fact that F90 arrays, unlike F77, can be
noncontiguous in memory, say http://www.pathscale.com/node/151. For this
reason, modularization of subroutines was introduced into ROMS resulting in
the familiar structure,

Code: Select all

      MODULE step2d_mod
      PRIVATE
      PUBLIC  :: step2d
      CONTAINS
# include "step2d_LF_AM3.h"
      END MODULE step2d_mod
as it appears in modern-day ROMS. This, along with the use of ASSUMED_SHAPE
array declaration inside the working routine, and with some luck prevents the unwanted
copying. Some relevant experiences are reported here:
viewtopic.php?f=29&t=134&p=572&hilit=ASSUMED_SHAPE#p572

The above more or less completes the explanation of why the 3.4 looks as it looks
today, but does not fully answer why the 1.9 is still faster, at least for some configurations.
Nor does it explain why the code performance was gradually depleted step-by-step in front
of the eyes of so many so few voices complaining.

There never been a deficit of outside advisors. The transition to F90 was influenced
by the appearance of various official Guidelines and Recommendations about how the
code should be written and maintained. Perhaps the earliest document of this kind is
European Standards for Writing and Documenting Exchangeable Fortran 90 Code,
written by Phillip Andrews (Met Office), Gerard Cats (KNMI/HIRLAM),
David Dent (ECMWF), Michael Gertz (DWD), Jean Louis Ricard (Météo-France)
http://www.scribd.com/doc/7058020/Europ ... an-90-Code
Note the date -- 23 October 1996 -- at that time SGI (our main computing platform of
that time, also see below) Fortran 90 compilers were in process of transitioning from
non-existing to immature status. In contrast, MIPS F77 compilers from were very
advanced, fully supporting multi-threading, software pipelining, and providing extensive
diagnostics (including software tools to check software pipelining and to measure
absolute performance of the code in terms of practical MFlops/sec -- a capability
unmatched today. Does anybody remember SGI pixie-profiler?)

[ After a long history of being fired, rehired, fired again, then SGI own demise, the very same
MIPS people started they own compiler company, PathScale http://www.pathscale.com.
Ekopath and all known optimizations are their product and motto. Back in SGI/MIPS era their
compilers were able to extract up to 30...50% of the theoretical peak performance from an R10k
processor. In practice, a 200MHz Origin 200 match or run faster than Sun Enterprise with twice
the clock speed. I doubt that this level of efficiency is ever match today. ]


The above Guidelines were adopted by UK Metoffice Fortran 90 Standards,
http://research.metoffice.gov.uk/resear ... dards.html

Programming Guidelines for PARAMESH Software Development
http://www.physics.drexel.edu/~olson/pa ... Guide.html

I recall a similar official document from ONR/NRL dated back in 1998, but can
no longer find it on the web.

All the documents are consistent with each other, advocating code clarity, readability,
consistent indenting rules, proper on-line code documentation, not using obsolescent
arithmetic ifs, gotos, non-integer loop indices, asigns, equivalences (use Fortran 90
pointers instead), etc. Common blocks must be eliminated from the modern codes
-- Fortran 90 modules is the better way to do it
. Open MP-like directives existed in
form of proprietary sets from different vendors, but were not standardized at that time
so they were not discussed or even mentioned in Fortran 90 Guidelines. [All the directives
were similar in semantics to the original Cray Y-MP/C-90 directives from which they were
derived, but were always altered because of obvious copyright issues.] Neither it was a
concern of the authors of the Guidelines that certain F90 features were poorly supported
(resulting in significant performance penalties), not supported at all, and/or interfering
with parallel directives.

A modern, 2009 document Recommendations for Writing Fortran 95 Code
by S.-A. Boukabara and P. Van Delst.
http://projects.osd.noaa.gov/spsrb/stan ... un2009.pdf
among other things states:

p. 6: Do not use external routines (subroutine not contained within a module and not within
the CONTAINS statement of the main program) as in some cases, these functions need interface
blocks that would need to be updated each time the interface of the external routine is changed.

p. 7: No Common blocks. Modules are a better way to declare/store...

p. 7, below: No implicit changing of the shape of an array when passing it into a subroutine.
Although actually forbidden in the standard it was very common practice in FORTRAN 77 to pass
'n' dimensional arrays into a subroutine where they would, say, be treated as a 1 dimensional array.
This practice, though banned in Fortran 90, is still possible with external routines for which no Interface
block has been supplied. This only works because of assumptions made about how the data is stored.
[ Note that ROMS 1.9 practice of passing private scratch arrays from into an _tile
subroutine from its driver only narrowly escapes this on technicality. It is prohibited
to pass a multidimensional array and treat it as one-dimensional inside. ROMS 1.9
does the opposite: it passes a one-dimensional (hence contiguous in memory) array
which is then treated as multidimensional inside. ]

The use of dynamic memory allocation was encouraged and even insisted on. Especially
in the operational community -- ideally one can provide an executable file called roms.exe
which can be used for any configuration without recompiling: just read grid dimensions
and all the configuration parameters from roms.in or from a netCDF file.

[ Here it should be noted that approximately at the same time at least two ocean
modeling communities -- POP/Los Alamos and MOM/GFDL sometime later -- decided
to get rid of C-preprocessor directives altogether: CPP was declared an "evil piece of
software". Fortran run-time if-statements were adopted instead. ]

However the encouragement of dynamic memory is not universal.

CCM4 code standard - NESL's Climate & Global Dynamics (CGD)
Coding Standard for CCM4

http://www.cgd.ucar.edu/cms/ccm4/codingstandard.shtml
and another set of Official Guidelines, NCAR/CCSM,
http://www.cesm.ucar.edu/working_groups ... node7.html
states:

Memory management: The use of dynamic memory allocation is not discouraged
because we realize that there are many situations in which run-time array sizing is
desirable. However, this type of memory allocation can cause performance problems on
some machines, and some debuggers get confused when trying to diagnose the contents
of such variables. Therefore, dynamic memory allocation is allowed only "when necessary".
The ability to run a code at a different spatial resolution without recompiling is not
considered to be an adequate reason
to use dynamically allocated arrays.

The loss of performance was initially obscured by the failure to take Linux computing
seriously, at least in the US. While for younger people Linux is the only UNIX-like
computer they ever knew (recent free-BSD-based MAC OS is also similar, ...sort of), the
common attitude back in around 2000 was that Linux is merely a cheap substitute for
a UNIX workstation. The most widely used everyday workstation environment of that era
was Sun, and supercomputing environment was dominated by SGI Origin 2000, IBM SP2,
and, somewhat earlier, Cray T3D. The most powerful some time before and the most easy
to use Cray C90 was fading away, and so did Cray J90 (a CMOS replica of even earlier
Y-MP designed with cost in mind -- no more cryogenic cooling). All were UNIX systems,
http://www.youtube.com/watch?v=dFUlAQZB9Ng.
The strategy for code optimization were very different from what we have
now: all these machines had very different ratio between processor ability to execute
floating-point operations vs. ability to load/store numbers were more close to 1:1
because the contrast between clock frequency inside CPU core and memory was
nowhere close to what we have in PC architectures then and today; and, at the same
time, logical operations were considered relatively expensive to floating-point. Cache
effects were discussed at that time sometimes, but this had no practical consequences
in the design of ocean modeling codes. Linear algebra was hot topic. SGI people were
promoting their out of order execution idea in R10k and follow-up CPU architectures.
IBM engineers were very excited about then new Power4 processor and were talking
about hiding latencies by pre-fetching data before cache miss occurs. Simply put,
if a miss occurs, record the address; second miss --recover it, then extrapolate the
address trying to predict where the next miss will occur (assuming that memory
access pattern is regular as it occurs in a very long do-loop) and immediately start
pre-loading data from there. Subsequently, if the date is needed, the stream is
confirmed
, so keep extrapolating and pre-loading; If not (hence a new miss) cancel
the stream
, recalculate the extrapolated address based on the new miss. This is all
implemented in hardware. Obviously, they have assumed that they have a very
excessive
memory bandwidth for this strategy to work -- this would never be the
case for PC-type of architecture. Some of these influences are still remaining today,
even though they no longer have any basis. The truth is, that as of the end of year
2000 a dual Pentium III machine with Linux and proper code and compiler can match
an entry-level, dual-CPU Origin 200, which was at least 10 times more expensive.
However, nobody cared. While F90 compilers were reasonably mature for UNIX
workstations -- SGI, Sun, and IBM, in Linux world... forget about it: the only
compiler available at that time is g77. Free Intel compiler, IFC 5.0, became available
late in 2001, and this changed the equation completely. Still, it took several years
before it was taken seriously.

As far as I remember, conversion of ROMS to Fortran 90 was complete long before
it was tried for first time on a Linux computer.

Did not I mention the Connection Machine? Remember the nedry guy from the iconic
movie? http://www.youtube.com/watch?v=c602hzsL0VA? Two Connection Machines
were donated by the Navy in mid-199x: one to Rutgers and one to MIT. In both cases
the consequences were devastating. On one hand, these were the biggest computer
resources available to these groups and they were in-house: no queues, no waiting.
On the other, running codes on required very a special programming style which was
not software supported on any other machine/compiler/operating system, with, and
it was clear at that time, that there will be no successor in development along that
line. [Thinking Machines Corporation filed for bankruptcy in Nov. 1994; the extensions
to Fortran standard pioneered by TMC CM Fortran (notably the FORALL and WHERE
statements and the array syntax) were later incorporated into F90 Standard and
eventually finalized in F95; however hardware support of these was and still is very
poor -- during the Connection Machine era CPU, memory clocks, and frequencies of
signals transmitting data between the nodes were approximately the same, < 100MHz,
so the idea of having too many CPUs work under the condition of very tight
synchronization actually worked. Cache? What is cache? What one would need
it for?] I estimate that development of MITgcm was delayed by at least 3 years, and,
curiously enough, late in 199x they were busy rewriting their CM F90 code into plain
F77 (absolutely no F90 extensions) in order to be able to run it on Linux clusters,
where the only available compiler was g77. The present-day MITgcm,
http://mitgcm.org/public/source_code.html
still stays pretty much this way as of Jan 25, 2011. It is learned hard way.

[ Los Alamos POP code was rewritten into array syntax originally to run it
on a Connection Machine. It still remains in this form today. ]

...This review may continue, but it is already way too long.

The paradox of this overall situation is that if one would start developing
an ocean modeling code today from scratch, most likely he would start writing
an advanced F90-style code, and, because there is no legacy code to
compare with
, the penalty in performance would never be noticed.

The point is that the different causes of performance loss tend to conceal each
other. I hope it is clear from the above of this thread: one loss causes another one
to go unnoticed. If someone's code has poor cache utilization, compiler optimizations
do not matter much: the code shows little or no sensitivity to -O3 vs. -O2 settings.
If the optimizations are crippled for whatever reason, then cache does not matter any
more, and the code is not sensitive to tiling. And if you have no tiling -- and most
ocean modeling codes do not have this capability -- then there is no way to discover
that it is sensitive to tiling, hence no awareness that cache matters. In order to loose
something, one must have have something to loose at the first place. We have.

ezaron
Posts: 16
Joined: Mon Oct 26, 2009 3:06 am
Location: Oregon State University

Re: Intel’s new i7 980x CPU gives disappointing speedup

#33 Unread post by ezaron »

Alex,

Thanks a lot for writing this review and taking the time to engage with the original poster. It is no small task to assemble the list of examples and history into a narrative as you have done. And I think your final take-home point is very well said. If you would consider writing a topical piece for Ocean Modelling I imagine it would be an instant hit with new graduate students.

All the best,
Ed

mathieu

Re: Intel’s new i7 980x CPU gives disappointing speedup

#34 Unread post by mathieu »

In the 90s the talk was that well optimized fortran
libraries were about 30% faster than the same C code.
The common explanation was that in fortran there is no
pointer and all arrays are directly declared which gives
the compiler much room to optimize the code. This is as
opposed to C code where pointer code essentially cannot
be optimized.
Most of the above explanations seem to refer to the use
of automatic arrays and pointers. Why was it felt necessary
for ROMS to use them?

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Intel’s new i7 980x CPU gives disappointing speedup

#35 Unread post by kate »

Why knowingly change to a less efficient style? Perhaps this quote sums it up:
I well remember when this realization first came on me with full force. The EDSAC was on the top floor of the building and the tape-punching and editing equipment one floor below. [...] It was on one of my journeys between the EDSAC room and the punching equipment that "hesitating at the angles of stairs" the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs. -Maurice Wilkes
Do we spend our time staring at the code or waiting for the numbers to come out? Do we want the fastest code that's perhaps more challenging to debug? I'm sure each of us has a different answer to that - and to what "looks good".

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#36 Unread post by shchepet »


Do we spend our time staring at the code or waiting for the numbers to come
out? Do we want the fastest code that's perhaps more challenging to debug?
Kate, or perhaps anybody else: just compare step2d.F from v.3.4 vs. v.1.9 and
explain to me and to everybody else why v.3.4 is easier to debug than v.1.9?

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#37 Unread post by arango »

Well, I didn't wanted to put my biases in the conversation here. There are a lot of very useful information here to fine tune computer hardware running ROMS.

I just wanted to note that we are talking about a scientific coding and programming scientists. As programming scientists, we all have our biases and preferences. Everybody has them. Diversity makes our life more interesting. This is not that different to the political discourse that we see nowadays between political parties and their leaders. However, science is unbiased and we should follow the scientific methodology. Comparing elapsed time between two versions of code requires a more careful mathematical analysis. Usually during code evolution we enhance code by adding additional capabilities, which may require more computations. If a comparison, as the one mentioned above, is to be made we better be sure that the number of floating point operations are the same. Otherwise, we may rich the wrong conclusion.

I don't know how the timing reported here were collected and how the associated code between the two versions mathematically differs. One will have to look into the assembly code and start counting the floating point operations. The floating point operations per second (FLOPS) is not trivial in nowadays cheap computer architecture. I recall the CRAY computers gave us that kind of information when running programs about two decades ago. Accurate code benchmarking is difficult. There are too many parameters to consider. Then, we have the random computer architecture and software behaviors...

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Intel’s new i7 980x CPU gives disappointing speedup

#38 Unread post by kate »

One could argue that since it is Hernan's goal that users never change the base code, it shouldn't matter how ugly it is. Fast code for all should be worth the pain of the few.

Sasha - have any other compilers caught up with old Fortran? Any good C codes out there? C++?

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#39 Unread post by arango »

Actually, that is not my goal. The issue here is that I need to support all the ROMS algorithms in it is totality. This includes all the adjoint-based algorithms. The nonlinear ROMS model is only a fraction of all the algorithms that we support. If you are curious, please check any of the 32 different drivers that we offer currently in the ROMS/Drivers sub-directory. The nonlinear model driver (nl_ocean.h) is just one them. The problem is that changing any single line in the computational nonlinear kernel requires an equivalent change in its adjoint model (sub-directory Adjoint) and perturbation and fine amplitude tangent linear models (sub-directories Tangent and Representer, respectively). This includes re-testing the symmetry of all these operators at each discrete point. This is extremely time consuming and we don't have many volunteers or experts to do so. However, changes are still coming but at slower pace. This is one of the reasons why I have delayed the release of the nesting capabilities. I have been re-writing and testing this development for more than a year now.

When the ROMS code was designed over 10 years ago, we did not foresee the algorithms that we offer today. There are still more complex drivers coming... During the developing of all these new algorithms, we needed to rework several aspects of the code. For example, we cannot have any redundant computations or assigments anywhere because it will yield the incorrect adjoint. The private storage was reworked to guarantee the correct adjoint and symmetry within the linearized model operators. In the early days, we have private storage equivalence by C-preprocessing re-assignment of internal variable names. For example, in step2d.F for ROMS 1.8 we had:

Code: Select all

#define zwrk UFx
#define gzeta UFe
#define gzeta2 VFx
#define gzetaSA VFe
This use of private storage to minimize memory requirements is fatal and not possible in adjoint computations when the model is run backwards in time. Therefore, we have to re-write and change strategies. In my personal opinion, the continuous evolution of any ocean numerical model is essential for its survival. Otherwise, such models become stagnant and less attractive to the user community. Our scientific knowledge and literature is never stagnant and always evolving. So nothing is written in stone and always open for revision. One of the attractive properties about ROMS is that offers an extensive set of options and capabilities to thousands of users in the community.

At the end, the model is offered free to the ocean modeling community worldwide. The user has a choice and we are not recommending anyone to use a particular version of ROMS or any other model.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Intel’s new i7 980x CPU gives disappointing speedup

#40 Unread post by kate »

Another question for Sasha: People are going to more and more processors these days, not faster processors. I've used about 100 at a time, but don't feel I can usefully use 1000. How do we get there?

I'm sure it wouldn't surprise you that our new cluster with two hex-core chips per node is slower per core than our old cluster with two dual-core chips per node.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#41 Unread post by shchepet »

...This use of private storage to minimize memory requirements is fatal and
not possible in adjoint computations when the model is run backward in time...
Hernan, the horrendous practice in v.1.9 you are referring to is merely to
reuse the same scratch memory for two different purposes in a sequence.
It is explained as follows:

Code: Select all

   real, dimension(TILE_SIZE_ARRAY) :: wrk1,wrk2,wrk3,wrk4
 ......
   do i,j=...
     wrk1(i,j)= [ scratch variables    ]
     wrk2(i,j)= [ needed for computing ]
     wrk3(i,j)= [ barotropic pressure- ]
     wrk4(i,j)= [ gradient terms       ]
   enddo
   do i,j=...
     rubar(i,j) = [ finite-difference expressions ]
     rvbar(i,i) = [ involving wrk1,wrk2,wrk3,wrk4 ]
   enddo
         !--> discard wrk1,wrk2,wrk3,wrk4 because their
              content not needed beyond this point

   do i,j=...
     wrk1(i,j)= [ fluxes for  ]
     wrk2(i,j)= [ barotropic  ]
     wrk3(i,j)= [ advection   ]
     wrk4(i,j)= [ terms       ]
   enddo
   do i,j=...
     rubar(i,j) = rubar(i,j) + [ finite-difference expressions ]
     rvbar(i,i) = rvbar(i,i) + [ involving wrk1,wrk2,wrk3,wrk4 ]
   enddo
         !--> discard wrk1,wrk2,wrk3,wrk4
To make it more human readable by having more interpretable names of the
scratch variables instead of rather the meaningless wrk1,wrk2,wrk3,wrk4, as
well as to highlight the purpose of each scratch variable and to limit the scope
of its existence/usage/need, the above was rewritten into

Code: Select all

   real, dimension(TILE_SIZE_ARRAY) :: UFx,UFe,VFx,VFe
 ......

#define zwrk UFx
#define gzeta UFe
#define gzeta2 VFx
#define gzetaSA VFe
   do i,j=...
     zwrk(i,j)=    [ scratch variables    ]
     gzeta(i,j)=   [ needed for computing  ]
     gzeta2(i,j)=  [ barotropic pressure- ]
     gzetaSA(i,j)= [ gradient terms       ]
   enddo
   do i,j=...
     rubar(i,j) = [ finite-difference expressions ]
     rvbar(i,i) = [ involving zwrk, gzeta,gzeta2,gzetaSA ]
   enddo
         !--> discard zwrk,gzeta,gzeta2,gzetaSA
#undef gzetaSA
#undef gzeta2
#undef gzeta
#undef zwrk


   do i,j=...
     UFx(i,j)= [ fluxes for  ]
     UFe(i,j)= [ barotropic  ]
     VFx(i,j)= [ advection   ]
     VFe(i,j)= [ terms       ]
   enddo
   do i,j=...
     rubar(i,j) = rubar(i,j) + [ finite-difference expressions ]
     rvbar(i,i) = rvbar(i,i) + [ involving UFx,UFe,VFx,VFe ]
   enddo
         !--> discard wrk1,wrk2,wrk3,wrk4
which is exactly as it is done in step2d.F from v.1.8.

In principle one could treat UFx,UFe,VFx,VFe exactly the same way, so
they, and zwrk,gzeta,gzeta,gzetaSA are treated symmetrically, i.e., leave
the array the declaration as in the original version, but apply #define/#undef
for both sets of scratch variables,

Code: Select all

   real, dimension(TILE_SIZE_ARRAY) :: wrk1,wrk2,wrk3,wrk4
...
define zwrk wrk1
#define gzeta wrk2
#define gzeta2 wrk3
#define gzetaSA wrk4
   do i,j=...
     ....          compute, use, and discard
     ....          zwrk,gzeta,gzeta2,gzetaSA
   enddo
#undef gzetaSA
#undef gzeta2
#undef gzeta
#undef zwrk

#define UFx wrk1
#define UFe wrk2
#define VFe wrk3
#define VFx wrk4
   do i,j=...
     ....          compute, use, and discard
     ....               UFx,UFe,Ve,VFx
   enddo
#undef VFx
#undef VFe
#undef UFe
#undef UFx
In terms of substance the three versions above are exactly equivalent.
Which one to prefer is simply a matter of programming style and taste.
The point here is that give human eye what human eye likes most (i.e.,
meaningful names), and give the compiler/machine what they can handle
best (i.e., fewer arrays and less memory).

I simply do not buy the argument that this horrendous practice
complicates writing the adjoint code.
In fact, I actually see the opposite.
After all, adjoint is about tracking dependencies in the reverse, down-to-top
order. The second and the third version of the code visually expresses the
fact that data dependencies transmitted through wrk1,...,wrk4 originate and
terminate within the two segments of the code, and are not transmitted
between the two. Therefore, the dependency chains are visually shortened.

If, for whatever reason, this explanation is not satisfactory, then one can
always revert to the first version of the code -- explicitly use array names
wrk1,....,wrk4. It is expressed in plain standard Fortran. It is not a big deal.


P.S.: The above also exposes the deficiency of Fortran, and, as a matter
of fact, any programming language. The comment line

Code: Select all

 !--> discard wrk1,wrk2,wrk3,wrk4
should be replaced with a compiler directive

Code: Select all

!$DIR DO_NOT_SAVE_FROM_CACHE_TO_MEMORY :: wrk1,wrk2,wrk3,wrk4
after the relevant enddo. Imagine wrk1,wrk2,wrk3,wrk4 go out of cache?
The machine will still handle them as any other variables, i.e., save: compiler
cannot distinguish whether they will be needed in future on not, hence it will
do the useless operation of storing them. There must be a way to tell it not
to, but I am not aware of any such directive. C used to have volatile attribute,
but it is kind of outmoded now days. [/color]

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#42 Unread post by arango »

Writing the adjoint of simple codes is trivial. However, it becomes really tricky with advanced kernels, like the one in ROMS, which includes multiple time-levels, time-averaging, integration, quadratic or higher dependencies, recurrence, implicit algorithms, and so on. Sometimes, we need to save intermediate solutions when linearizing high order dependencies with the state (adjointable; time-dependent) variables. So we need to achieve a balance between storage or recomputing forward solutions in the intermediate computations. It requires a lot of skill and years of practice to do this accurately. Then, there is the issue of clarity and debugging that may take months. For example, the adjoint of biological models is very difficult and complex as the number of compartments increases. If you are curious, check ad_npzd_iron.h, for example. The semi-Lagrangian sinking is a nightmare.

I recall playing with F90 pointer association of variables, for example:

Code: Select all

      gzeta => SCRATCH(ng) % wrk2d(:,:,5)
the performance penalty was much higher than having automatic arrays. I don't recall the slow down factor, but it was high for step2d. This pointer association (equivalence) implied copying and initialization is some compilers. This is not a good strategy to manage private arrays step2d. I haven't checked the performance of this type of assignment in several years. Perhaps, it is better.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#43 Unread post by shchepet »


People are going to more and more processors these days, not faster processors.
I've used about 100 at a time, but don't feel I can usefully use 1000. How do we get there?
Kate, Do you mean "cores" or "CPUs" or MPI nodes or hardware nodes
(motherboards)? We routinely use 256 cores "in house", and the maximum,
I believe, some of us have used up to 512 (personally I have used 384),
and I do not feel that I hit the ceiling: the code runs faster on 384 than
on 256, and we do observe superlinear scaling from 128 to 256. But, of
course, we are running much larger problems than we used to. Typical
MPI subdomain sizes are 100 x 50 as seen from above, and we customary
partition each MPI subdomans into 2 or 4 tiles -- this helps a bit.

Controlling your node placement (which MPI-node runs on which
motherboard) is absolute essential. A plain

Code: Select all

mpiexec -np 256 roms 
is just a non-starter. The correct way to go is

Code: Select all

mpiexec -np 256 -machinefile machine_list  roms 
where "machine_list" is the list of hosts in the proper order. The point is
to partition your grid into machines first, say, if -np 256 means 32 nodes 8
cores each (two quad-core CPUs), then partition your grid first to 32 using
the best perimeter vs. ratio idea (thus, ignoring the fact that there are
multiple processors inside) -- these "macro-subdomains" are approximately
squares. Then partition them: usually one-dimensionally, 1x8 (most of the
time), seldom 2x4, if the squares are not squares, but are elongated in
XI-directon. Then create machine_list in such a way that MPI-ranks
belonging to the same "macro-subdomain" end up on the same motherboard.
Optionally, depending on their size ans shape, partition "micro-subdomains"
into tiles.

In-house the names of the compute nodes are known, so the whole procedure
is done by hand. In NCSA the names of the nodes are not known a priori
before the job starts, but NCSA provides an environmental variable containing
the list of the nodes just when the jobs starts. So dump this into a file, read
it by a special Fortran program and create machine_list. Then run mpiexec.

I/O starts worrying me, but it is not the stopper yet. Thus far we have
by-passed the problem (not solved it): every MPI node writes its own
netCDF file. Post-processing tools to assembly the partial files into the
big one are provided, as well as tools to partition input netCDF files.
I/O becomes completely irrelevant as the scaling bottleneck (in house we
write onto local scratch disk on each individual compute node; in NCSA
the lustre filesystem is sufficiently fast, so it copes with it). Of course,
post-processing assembly requires baby-sitting your running job, but with
the number of cores we are using thus far, we are able to assembly data
faster than it is created by the running MPI job.

What to do with the data, besides admiring it using ncview, is another
problem. Most post-processing is done by Matlab, and this has certain
limits.

I'm sure it wouldn't surprise you that our new cluster with two hex-core chips per
node is slower per core than our old cluster with two dual-core chips per node.
Going from dual- to six-core CPUs sounds a lot of time for me.
Dual-, I guess, means either dual-core Opterons of 27x-series, early
versions of 5000-series Xeons. That dates back to late 2006 -- early 2007.
Six-core means 5600-series Xeons -- the second generation of Nehalem
core starting from the second half of year 2100.

No, cores only become faster and faster, and so do memory systems.
But getting the best utilization of them became a bit harder lately,
although from the historical point of view the tendency is not monotone.

Let's do basic some estimates. Imagine mid-2004 Pentium 4 "Prescott" core.
Single processor, single core, 3.2GHz, 800MHz FSB, 1M cache. Memory is
dual-channel DDR (first generation of DDR=DDR1, also known as DDR400 and
PC3200). PC3200 means 3.2GBytes/sec bandwidth per channel, so the machine
has total bandwidth of 6.4GByte/sec, which translates into 800M/sec of
double-precision numbers [6.4 divided by 8 Bytes/number]. This is how
many numbers can go from main memory into L2 cache per second. At the same
time the CPU (meaning core) can load or store 1 double-precision number
per clock cycle, meaning that 1 number can go from L1 cache into registers
per each clock cycle of the core. The clock speed is 3.2GHz, so our
beloved Pentium 4 can load/store 4 times as many numbers from/into
its registers than the memory bandwidth of its 875P Northbridge can
provide
[3.2G/sec numbers vs. 800M/sec].

Not a very promising idea, if all what you intend to do is linear algebra,

Code: Select all

     do i=1,some_very_large
        a(i)=b(i)+c(i)*d(i)
     enddo
assuming a dream compiler capable to pre-fetch everything in advance,
the above loop would go only at 1/16 of the theoretical computational
peak speed of the machine [Pentium 4 can do one multipliy-add per clock
cycle in a pipelined loop]. In practice, however it will be even worse,
because it will go from one cache-miss (stall) to the next, wasting about
40 clock cycles during each miss and loading 4 numbers from main memory
at a time.

Why 40 clock cycles?

CAS latency for typical DDR of that era is 2.5 (ranging from 2.0 HYPERx
to 3.0 for typical ECC memory) meaning that the actual latency is 2.5 of the
period corresponding to 200MHz clock of memory. Hence 12.5 nanoseconds.

Why 4 numbers at a time?

Because its cache line is equivalent to 4 double-precision numbers. Very
simple. Each DIMM has 184 pins, 128 of which are signals, Dual-channel
means that there are 2*128 = 256 signaling wires going from memory to
the Northbridge chip. 256 wires = 256 bits transmitted simultaneously,
or 32 Bytes, or 4 REAL(kind=8) numbers.

...Any way, from now on just keep in mind the can load/store 4 times
as many numbers
highlighted above. This is our reference point.
Obviously, the only way to keep the floating-point units of that CPU busy
is to perform mathematical operations again and again on the same numbers
loaded into L2 cache.

How other designs are doing in comparison with this?

Xeon's contemporary to that 2004 Pentium 4 together with their E7505 chip
set are actually much worse. The same CPU clock speed, but FSB is 667MHz
instead of 800MHz, and the two CPUs sharing the same memory bandwidth.
The above 4 times as many becomes 9.6 times as many.

Comparing it with slightly earlier versions Xeons of 2002-2003 -- FSB was
even lower, 533 and 400MHz, and so does memory speed (at that time and
earlier memory clock was tied to FSB, but Xeons's, unlike Pentium 4, had
BIOS-selectable step multiplier). It is easy to calculate that the ratios
were:

11.5 times as many for 3.06GHz/533MHz FSB Xeon, E7501 chip set,
533MHz dual-channel DDR memory.

14 times as many suffocating ratio for 2.8GHz/400MHz FSB "Prestonia"
core Xeon, i860 chip set, RDRAM memory, further penalized by its 45
nanosecond memory latency.

This overall observation of Xeon's having lower FSB/memory speed than
Intel's single-processor CPUs contemporary to them is referred as Xeon
gap
in Poor's Man Computing. he final version of Xeon of that generation
called "Nocona" actually caught up with its Pentium 4 counterpart in both
FSB and clock frequency, resulting in 8 times as many, obviously 2*4
because now 2 CPUs share the same memory bandwidth.

The first generation of AMD Operon debuted in 2004, the single-core, the
200-series was actually a much more balanced design. Say, Opteron 248
-- 2.2GHz, but now each CPU has its own dual-channel memory system,
DDR 400MHz, resulting in only 2.75 times as many. This was the most
successful machine of its time: while the measured single-processor
performance of running ROMS code was slightly slower that of 3.2GHz
Pentium 4 [comparing one Opteron (the second idle) vs. Pentium], 248s
decisively outperformed "Nocona" in dual-CPU vs. dual-CPU comparison,
despite the latter having much higher clock speed of 3.2GHz.

Dual-CPU Opteron 248 also outperformed dual-CPU MacIntosh G5 (made
of IBM PowerPC 4 CPUs) for those who remembers it.

This balanced design was so successful, that drop-in replacement of
Opteron 248 with Opteron 275 (same 2.2GHz, but dual-core) resulted in
still viable and competitive machine comparable to Core 2 Quad Q6600
arriving late summer 2007. The ratio is 5.5 times as many.

Going to more modern times, we first note that Pentium 4 becomes
Pentium D in 2006, and DDR memory becomes DDR2, so now it is
dual-core, but memory also improves by nearly a factor of two -- in practice
slightly less. So the proportionality changed only slightly (became worse).
Assuming a dual-core 3.2GHz Pentium D with dual-channel DDR2-667MHz
results in 4.8 times as many

Dual-core Xeon 5000-series -- the first generation of Xeons designed to
fit into the new socket 771 -- had exactly the same ratio, 4.8 times as many.
This is because this time Intel took a radically different approach in designing
chpsets for dual Xeon motherboards: the memory system remains shared
between the two CPUs, but it becomes quad-channel, thus doubling
not only the aggregate bandwidth, but also bandwidth available to each CPU
individually. The memory ended up being FBDIMM (FB = fully buffered) and
the engineering merit of this that it uses serial signals to communicate
with Northbridge chip, somewhat similar to that of SATA and PCI-express.
The point is that it would be hard to increase the number of wires (due to
doubling the number of channels) and still keep arrival of signals travelling
through essentially a parallel bus in sync with each other. The somewhat
penalty here is extra latency needed to convert the signals, extra cost,
and extra heat generated by each memory module (in practice it is actually
outrageous -- 4 DIMMs generate as much heat as 1 CPU).

Note that this is an overall well balanced design.

Core 2 Duo arrived somewhen late 2006 and Intel declared that Pentium D,
and the overall Pentium architecture is dead end. Intel said that it has
a better idea. The rumor was that Core 2 evolved from Core -- a family of
laptop CPUs, and Core evolved from .... Pentium III "Tualatin" core.
Remember back in 2001 Pentium 4 was nicknamed "another recount for Al
Gore"? This is because several times Intel contested that P4 was faster
than Athlon, and every time Athlon beat it. And, embarrassingly enough,
the practically measured computing performance per clock cycle of
then new P4 was lower than that not only of Athlon, but also of Intel's
own earlier design of "Tualatin" core (the final version of Pentium III).
It took about a year or more for Intel to rectify the issue back then.

Whatever the cause, and whether it is true or not that Core 2 is deeply
redesigned Pentium III, from this moment the practical performance can
no longer be judged by CPU clock alone. 2.4GHz Core 2 Duo is significantly
faster 3.2GHz Pentium D. And, there is another reason to mention Al Gore
here: Core 2 Duo was the first Intel's desktop CPU designed with
environmental consciousness in mind. The ratio? Formally speaking,
3.6 times as many, assuming DDR2-677 and 2.4GHz clock. Again,
it can load one double-precision number from L1 to registers at each
clock cycle.

The situation changed to the worse when Core 2 Quad arrived late
summer 2007. The ratio doubled, 7.2 times as many, assuming 2.4GHz
Q6600 with DDR2-667 (around Atmos. UCLA it is known as "student
computer"). The raw performance is excellent: in practical ROMS computing
using all resources available it is six times faster than our
canonical Pentium 4, assuming optimal tiling. The scaling from using 1 to
2 threads is 1.9, and then going from 2 to 4 is another 1.3 ... 1.4, so
it is about 2.5 total when comparing single thread vs. 4 threads.

WRF people are not very happy: WRF does not scale well.

Quad-core Xeon 5400-series in 2008. Our primary "in house" work
horse E5420 is 2.5GHz/1333MHz FSB, 12MB cache quad-core Xeon,
based on essentially the Core 2 core. Quad-channel DDR2-667
FBDIMM memory. The ratio is 7.5 times as many. ...Actually it starts
looking like "Nocona" 5 years earlier. Of course, the giant 12MB L2 cache
mitigates this, but only if the code is optimized for cache.

Core i7 920. This is the first CPU of i7 family. We have 4 of them here in our
group in UCLA. The clock speed is 2.66GHz, 1333MHz FSB, 8 MByte cache.
The clock speed is only slightly faster than Q6600, cache size is the same
(also it now it is unified and shared among all four cores, not 4MB + 4MB).
What is radically new is the tripple-channel DDR3 memory, resulting in
3 times the bandwidth (a factor of 2 due to DDR3-1333MHz vs.
DDR2-667MHz, and another factor of 1.5 due to triple- vs. dual-channel).
The suffocating ratio is pushed back to 8/3=2.666 times as many.
This is a very balanced design!
In practical ROMS computations using
all resources available this machine is 2.1 times faster than Q6600, and
it matches/outperforms a dual-E5420 machine as well.

Quad-core Xeon 5500-series ...I guess, the same as i7: ~2.5 times as
many
, depending on clock speed. Unlike all previous Xeon designs, each
CPU has its own memory system (like Opterons), so having two CPUs on
the board also doubles the memory bandwidth. Thought they are significantly
more expensive than i7 with the same clock speed, so the clock speeds are
usually lower, hence the ratio.

Hexa-core Xeon 5600-series Same as above, but now they are packing
six cores to share the same memory bandwidth. The ratio is ~4 times as
many
, ...same as Pentium 4 we have started with.

I think this explains it all.

P.S.: CAS latency is not going anywhere during the same period of
time since early 2004: DDR2-667MHz CAS 5 translates into the same
~15 nanosecond delay as DDR400 CAS 3; and so does
DDR3-1333MHz CAS 9.

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#44 Unread post by shchepet »

Kate, Hernan, Bárður, and whoever cares,

Here is a little test program to illustrate finiteness of cache lines -- the fact
that cache_line is more than just one number, as well as the pitfalls associated
with memory system architecture. It is worth spending 15 minutes playing with it.

All the program does is performs "in-place" matrix transpose of a fairly large
out-of-cache square matrix,

Code: Select all

        A(i,j) ---> A(j,i)
repeated sufficient number of times (an odd number) to get reliable timing results.

The idea come from an SGI training workshop I participated 15 years ago while
still in Tallahassee, FL. I do not remember the last name of the SGI person who
gave the lecture, and I no longer have the "SGI Green Power Book" -- his lecture
notes. His first name is Jerry, and he was around 50 at that time. It was a 3-day
workshop, and he was talking all day long all three days.

But I remember the idea of what he wanted to show, so now I am making it up myself.

Just copy-paste the code below into a file called "transp.F", compile it,

Code: Select all

 ifort -o transp ransp.F 
and run it as

Code: Select all

 transp 
or, for example,

Code: Select all

 transp 32 
(it can take one argument, an integer number).

(1) play with the number m=32 -- make it 8, 10, 16, 30, 32, 60, 100, 200, etc.
See how long does it takes, and whether you find optimum.

(2) recompile it using -O3

Code: Select all

 ifort -O3 -o transp ransp.F 
and see whether it makes difference relative to -O = -O2 (default).

(3) change matrix size,

Code: Select all

integer, parameter :: N=4100 
to N=4096 and also try N=4095, N=4097, N=4098, N=4099 and see whether
you observe any sensitivity. Use your best known m, as well as m=0, or just
play with it.

If you chose to reply, please, also specify details about your hardware
(CPU, memory type, number of channels), compiler, etc.

The code is as follows:

Code: Select all

      program transp

! A program to demonstrate effect of cache_line length going beyond
! storing just a single number.  It performs "in-place" transpose of
! a large-size (out of L2-cache) square matrix and reports time needed
! to do so.  Two  mathematically equivalent algorithms are compared
! against each other: "transp_simple" and "transp_blocked".  To use:
! compile and run it as
!
!       transp
! or
!       transp m
!
! where "m", and integer number, is block size. Setting m=0 (same as
! having no argument) causes the use "transp_simple", while m>0 uses
! "transp_blocked" which does the transpose using square blocks of
! m X m.  The purpose is to! demonstrate that "m" matters and there
! is an optimal block size.

      implicit none
      integer, parameter :: N=4100
      integer(kind=4) :: A(N,N), error
      common /AA/ A
      character(len=8) arg
      integer bsize, iter, i,j, iargc
      integer iclk_start, iclk_end, iclk_rate, iclk_max

      if (iargc() == 1) then
        call getarg(1,arg)
        read(arg,'(I8)') bsize
        write(*,*) 'bsize=', bsize
      else
        bsize=0
      endif

      write(*,*) 'initializing...'
      do j=1,N
        do i=1,N
          A(i,j)=i+(j-1)*N
        enddo
      enddo

      call system_clock (iclk_start, iclk_rate, iclk_max)
      write(*,*) 'starting transpose...'
      do iter=1,161  !<-- this must be an odd number!!
        if (bsize.gt.0) then
          call transp_blocked(A, N, bsize)
        else
          call transp_simple(A,N)
        endif
      enddo
      call system_clock (iclk_end, iclk_rate, iclk_max)
      write(*,'(/1x,A,F12.4,1x,A/)') 'Elapsed Wall Clock Time =',
     &          dble(iclk_end-iclk_start)/dble(iclk_rate), 'sec'

      write(*,*) 'checking...'
      error=0
      do j=1,N
        do i=1,N
          error=max(error, abs(j+(i-1)*N -A(i,j)))
        enddo
      enddo
      write(*,*) 'error =', error
      stop
      end

      subroutine transp_simple(A, N)
      implicit none
      integer N,i,j
      integer(kind=4) :: A(N,N), tmp
      do j=1,N
        do i=1,j-1
          tmp=A(i,j)
          A(i,j)=A(j,i)
          A(j,i)=tmp
        enddo
      enddo
      return
      end

      subroutine transp_blocked(A, N, bsize)
      implicit none
      integer N, bsize, i,j, nblocks, ii,jj, istr,iend, jstr,jend
      integer(kind=4) :: A(N,N), tmp

      nblocks=(N+bsize-1)/bsize  !<-- division with roundoff up

      do ii=0,nblocks-1                 ! Processing blocks
        istr=1 + ii*bsize               ! located on the main
        iend=min(istr+bsize-1, N)       ! diagonal
        do j=istr,iend
          do i=istr,j-1
            tmp=A(i,j)
            A(i,j)=A(j,i)
            A(j,i)=tmp
          enddo
        enddo
      enddo

      do jj=0,nblocks-1                 ! Processing
        jstr=1 + jj*bsize               ! OFF-diagonal
        jend=min(jstr+bsize-1, N)       ! blocks
        do ii=0,jj-1
          istr=1 + ii*bsize
          iend=min(istr+bsize-1, N)
          do j=jstr,jend
            do i=istr,iend
              tmp=A(i,j)
              A(i,j)=A(j,i)
              A(j,i)=tmp
            enddo
          enddo
        enddo
      enddo
      return
      end

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: Intel’s new i7 980x CPU gives disappointing speedup

#45 Unread post by arango »

Thank you. I will check sometime this week on the computers that I work for ROMS development, testing, and debugging. I have two 4-CPUs iMacs, one at home and the other at the office. The one at the office is faster than the one at home. It will be interesting how fast is, we bought them recently. I also have a 8-CPUs Linux desk top. The nice thing is that I can controls everything that run on them. I have noticed that the iMacs give me a lot of variations in the elapsed time for the same application. I haven't been able to figure out why. I am not using any external communications to our storage disks.

balbin

Re: Intel’s new i7 980x CPU gives disappointing speedup

#46 Unread post by balbin »

Dear all:
Maybe this is the place to ask this question: Why do I get so different performance using similar machines?
I have two machines:
1.- HP xw8600, with 2 Intel Xeon Quad Core X5450@3 GHz, Ram 16 GB DDR-2@667 MHz and writing down to 2x15000 rmp raid 0 drives. Running Ubuntu 10.10 (maverick) 2.6.35-24 server.
WS> uname -a
Linux ieofisica 2.6.35-25-server #44-Ubuntu SMP Fri Jan 21 19:09:14 UTC 2011 x86_64 GNU/Linux

2.- Mac Pro (early 2009), with 2 Intel Xeon Quad Core X5500@2,26 GHz, Ram 16 GB DDR-3@1066 MHz and writing down to 2x7000 rmp raid 0 drives. Running Mac OS 10.6.6
Mac> uname -a
Darwin Mac-Pro-de-Rosa-Balbin.local 10.6.0 Darwin Kernel Version 10.6.0: Wed Nov 10 18:11:58 PST 2010; root:xnu-1504.9.26~3/RELEASE_X86_64 x86_64


They are running exactly the same problem, same input files. At the end (Total HP time) / (Total Mac Pro time) = 1.76

I will try Shchepet checks but I am not sure I am doing any very basic mistake. I attach a brief of the output of both machines, the only lines where the differences are relevant. Thanks
Attachments
HP.txt
(3.55 KiB) Downloaded 1317 times
MacPro.txt
(3.57 KiB) Downloaded 1313 times

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

Re: Intel’s new i7 980x CPU gives disappointing speedup

#47 Unread post by shchepet »

HP.txt
Compiler flags : -heap-arrays -fp-model strict -openmp -fpp -ip -O3 -msse2 -free
Resolution, Grid 01: 0384x0176x030, Parallel Threads: 8, Tiling: 001x016
Compiler flags are not optimal: -heap-arrays ---> -no-heap-arrays, but you have to adjust stacksize limit.
Instruction set: -msse2 --> -xSSE4.1 because your processor in Xeon 5400-series, not Pentium4 "Northwood"
or Opteron 248.

Tiling is not optimal, my guess, you may try 4x22 or 3x16, but overall this is worth
some experimentation.

...Read the earlier posts on this thread from the very beginning -- they may explain these issues.

MacPro.txt
Compiler flags : -heap-arrays -fp-model strict -openmp -fpp -ip -O3 -axP -free
Resolution, Grid 01: 0384x0176x030, Parallel Threads: 8, Tiling: 001x016
Exactly the same issues as above, except -axP --> -xSSE4.2

Mac Pro Hardware Setup
Mac Pro (early 2009), with 2 Intel Xeon Quad Core X5500@2,26 GHz, Ram 16 GB DDR-3@1066 MHz..
Intel 5500-series Xeons have tri-channel memory controller. You have two of them,
so your total memory should be divisible by 6, sat is 12GB or 24GB.

Are you running it in tri-channel, or dual-channel, or some kind of mixed mode?
Can you verify by BIOS or something that it is indeed, tri-cannel configuration?

I never saw a MacPro inside, but I do know that some of the LGA1366 board designs
are very confusing: they have tri-channel capability, but also have the 4th memory slot,
for example, Intel's own DX58SO,
http://www.newegg.com/Product/ImageGall ... otherboard
The Instruction says that "for best memory performance" (i,e., tri-cannel configuration)
install 3 identical sticks of memory into blue slots and do not put anything into black.
This makes perfect sense. What does not make any sense is why Intel designed it to
have the black slot instead of having no black slot at all, or to have six (three blue and
three black, like most other LGA1366 boards). The same applies to 5500-series Xeon
boards.

...Chances are that you may have to open you Mac and remove the two excessive
memory sticks to regain the tri-channel symmetry of each CPU memory system.

Finally,
...running exactly the same problem... At the end (Total HP time) / (Total Mac Pro time) = 1.76
Assuming that all what you do is linear algebra and is limited solely by memory access
(hence completely neglecting the time needed to do the arithmetic operations), and
assuming that your Mac Pro is tri-channel (needs to be confirmed), so you have a total
of 6 DDR3 channels vs. 4 DDR2 channels of HP machine, the ratio should be:

Code: Select all

          6 * 1066MHz
         -------------- = 2.4
           4 * 667MHz
On the other hand, neglecting time spent for memory access and assuming that all
what the code does is arithmetic calculations (a grossly naive assumption in practice),
the ratio should be the ratio of CPU clock speeds times the number cores,

Code: Select all

         8 * 2.26GHz
        -------------- = 0.753
          8 * 3.0GHz
In reality you should get something in between -- a kind of weighted roduct of the two
factors. A very crude model is that if you HP time is

Code: Select all

   X + Y
where X is computational time and Y memory access time, then your Mac Pro time should be

Code: Select all

        X          Y
     -------- +  ------
      0.753       2.4
Knowing the practical ratio of 1.76 you can even get an idea of what is X/Y ratio for your code.

[Note this model is crude because it neglects the fact that computations and memory
accesses are overlapped in time; Y is really a kind of "cache miss time" -- time spent
when the processors/cores stall because of cache misses.]

balbin

Re: Intel’s new i7 980x CPU gives disappointing speedup

#48 Unread post by balbin »

Thank you very much for your interest and for the answer.

I will check with the -no-heap-arrays flag, probably next week.
My stacksize limit is 16M
Mac> limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 16384 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
descriptors 256
memorylocked unlimited
maxproc 266

and
WS> limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 16384 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc unlimited

Instruction set: -msse2 --> -xSSE4.1 because your processor in Xeon 5400-series, not Pentium4 "Northwood"
or Opteron 248.
Ok, I will. I am not an expert on this and I was only using intel suggestions for the deprecated options without understanding what was going on behind it. Same applies for the macPro.

I tried to optimize tiling for the macPro and attached med_benchmark_tiles is what I got. I also played with number of threads, attached med_benchmark_threads, and I did not include the number I got using 16 threads for the 1x16 case because it was so bad that I did not take it into account. As I was happy with the results for the MacPro I used the same settings for the HP. Maybe this is not correct.
Intel 5500-series Xeons have tri-channel memory controller. You have two of them,
so your total memory should be divisible by 6, sat is 12GB or 24GB.

Are you running it in tri-channel, or dual-channel, or some kind of mixed mode?
Can you verify by BIOS or something that it is indeed, tri-cannel configuration?
Good question. No idea, and I am not sure of the answer. This are the MacPro technical specs: http://support.apple.com/kb/SP506
And the only thing I could find regarding memory was a note in the manual that explains how to handle memory slots
Note: Populating slot 4 or 8 slightly drops maximum memory bandwidth, but depending on the applications used, overall system performance may benefit from the larger amount of memory.
http://manuals.info.apple.com/en_US/Mac ... Ms_DIY.pdf
It looks like 12GB is better than 16GB, at least under the circumstances of Bare Feats' benchmark. This suggests the Mac Pro is forced into dual channel mode with all slots filled. http://barefeats.com/nehal04.html
But, this is what intel says http://www.intel.com/support/motherboar ... 011965.htm
I could not find any clear answer at apple site.
I have 16GB installed but my roms never uses more than 1GB. I only run out of memory using Matlab and it is probably because I am doing something wrong. I will check everything again removing 2 of my 8 DIMMs.
Up to now I was happy with the MacPro performance, but better if I can increase it. I also would like to increase the HP. I will check your suggestions.
Thanks again.
Attachments
med_benchmark_threads.jpg
med_benchmark_tiles.jpg

robertson
Site Admin
Posts: 227
Joined: Wed Feb 26, 2003 3:12 pm
Location: IMCS, Rutgers University

Re: Intel’s new i7 980x CPU gives disappointing speedup

#49 Unread post by robertson »

This thread was getting off topic so I moved the discussion about optimization flag errors to a new thread:

viewtopic.php?t=2180

Please use the above topic for discussion related to ifort optimization flag issues on Intel i7 machines and reserve this topic for performance related posts.

Thank you.

Post Reply