------- Forwarded Message

From: Chris Ruemmler <ruemmler@hpperf1.cup.hp.com>
Subject: HP machines for lmbench paper
To: staelin@hplms2.hpl.hp.com
Date: Sat, 11 Nov 95 1:23:18 PST

Carl,

I looked over the new lmbench paper and it looks really good.  The
organization is much better than before.  I have a few changes
plus things you should think about:

1).  In Figure 1 there are a few things:
     a.  For the Linux/i586 and Linux/i686 rows, you switches
         the multi/uni with the OS column.
Fixed.
     b.  In the OS column you may want to put the rev # (such
         as 10.01 for HP-UX) or just get rid of the rev.  Only
         SunOS has the rev (5.5).
Fixed.
     c.  For the CPU, you should put the processor type instead
         of PowerPC or PA.  So PowerPC should be PowerPC 604,
         PA should be HP-PA 7200, and Alpha should be Alpha
         21064, 21064A, or 21164 where appropriate.
Fixed.
     d.  I don't believe the cost of the DEC 8400 is 50k.  This
         is more like a 150k box.  Why is this box even included
         since you don't include it's numbers in the other tables
         (except for latency).  Is it's performance that bad --
         I wouldn't think so.
Fixed.
     e.  What exactly does MP mean?  Does it mean that the box
         can be run MP, but only 1 processor was tested or does
         it mean it can be run MP and it was tested in some MP
         configuration?   If it means the latter, then you need
         to state the # of cpus used in the test.  If it means
         the former, then you need to explain why an MP box might
         have a disadvantage vs a strictly uniprocessor box (ie
         the Power2 path to memory gets wider and wider as you 
         add memory, you probably could not do this on an MP box!).
         Other things include not being able to run the bus as
         fast due to being able to accept more "loads".  One
         things that might be nice to note is that the HP machine (819)
         has a processor/memory bus that runs at the processor 
         frequency (100 MHz, and 120MHz for the K410 which I am
         going to run for you).  I believe the next fastest
         bus may be ultraSparc's 83MHz bus.  So processor bus
         speed might be an interesting column.
         

2). In Figure 2, why is the value for the i686 so HIGH for 
    the "memory read" test.  Is something happening wrong here?
    Has it broke the benchmark.  I would not expect it to be
    greater than the Power2.  It is also much greater than it's
    bcopy number.
I need to double check w/ Intel on this.

3).  For the memory read latency stuff (Figure 6), the HP machine
     really has only a Level 1 cache and no Level 2 cache!  So it's
     Level 1 cache is 256KB.  I'm not sure about Power2.  It might
     actually have a Level 1 cache (a small one), but it get's 
     lost in the wash due to Power2's very large cache line (I think
     it is 128 bytes at least, maybe 256).  
Fixed.

     The fact that both of these two machines have 1 cycle access time
     for a very LARGE area of memory relative to the other boxes makes
     them very good for Real world apps.  So both Power2 and HP
     have 256KB of data accessable in 1 cycle, while the others are 
     1 or 2 cycles for only 8-16KB!  I think you should probably point
     this out as one of the design trade-offs.  Also, the P6 and Ultra are
     good because they have a LARGE 256 or 512KB cache that is only
     6 cycles away (not as good as 1, but better than the others).  The
     Alpha at 300MHz is really hurting with a LARGE 2nd level cache at 22
     cycles away main memory at 133 cycles away.  Kindof hard to hide
     that much latency!  Also the PowerPC604, i586 boxes, and the 
     SC1000 all have really bad L2 caches.
Neeed to add these comments.

     The memory latency and cache/memory organization is really 
     interesting in all of these machines.  The Power2 is by far
     the best overall (LARGE 1 cycle cache plus kick butt memory).
     Another thing to notice:  If you look at the results for the
     first memory latency test (a word at a time), you will notice
     that some boxes have really low latency even when accessing 8MB
     files.  Some of these boxes are the HP, Power2, and DEC 300MHz.
     The reason this is can be quite different.  One reason may be
     a very large cache line (so you get 1 cycle hit on many words,
     and 1 long memory access on 1 word).  I believe the Power2 has
     128 byte (maybe 256 byte) cache lines (this machine can access
     data from memory at 2 GB/sec).  A second reason is because
     of very fast 1st level cache times.  The DEC @300MHz can access
     a cache word in only 3ns so even if it is 32 bytes the you get
     (7*3 + 1*400)/8 = 53ns (which is about what it gets).  Finally,
     the CPU can be doing pre-fetching (which is what happens with the
     HP box).  So for HP (with a 32 byte cache line) you get 
     (7*10 + (430-7*10))/8 = 54ns.  If the HP box was not doing prefetch
     you would get (7*10 + 430)/8 = 63ns!  The HP value is right around
     50ns.  I also happen to know that the processor does data-prefetch.
     This is also another very interesting thing that you can observe
     from the memory latency test data.  

     You could actually write a whole paper just on the memory 
     read latency test! 

4).  Why aren't all machines included in all tables?

I'm going to try and run the benchmark on a 1-way K210 (120MHz PA7200),
a 1-way T520 (120MHz PA7150), and a 1-way E55 (96MHz PA7100LC).  I
will do the machines in that order.  I am interested in general in
how these machines compare.  It is basically the best of the mid-range,
high-end, and low-end respectively for HP.

I am going in on Sunday to do this, so I should have results by Sunday
night.  We may want to not put some results in if they look bad.  I
imagine the K210 will NOT look bad.

- --Chris
- --Chris

Thanks!  I reran the K210 and got much better results (looks like
I'm going to have to have a talk with our compiler guys!).  The 
new results are included at the end of this mail.  I think we 
will have the top pipe bandwidth #, much better bcopy result (74MB/sec),
and much more even memory results (both read and write over 110MB/sec).
We will still also have good latency and the best disk latency.

You can go ahead and definitely use the K210 result in your paper.
I will also try and get the E55 done with gcc.  I'm still not sure
about the T520 (we have announced it, but I don't think we are 
shipping it until the beginning of next year).  I'll try and rerun
on the T520 after I find out if we can use it or not on Monday.
I doubt that I'll be able to get on the 890 again.  You might
just want to use it as an example in the memory latency test only
to show an extreme (2MB of 1 cycle cache that is actually 
set-associative).

Actually, I'll try and run this again tomorrow under single user
mode if I can (I ran this from home, so I'm not sure if any of 
the tests may have been hit by any extra activity).

Date: Sat, 11 Nov 1995 22:45:10 -0800
From: Chris Ruemmler <ruemmler@hpperf1.cup.hp.com>
To: ruemmler@hpperf1.cup.hp.com, staelin@hplms2.hpl.hp.com
Subject: lmbench results for HP

Carl,

Here are the lmbench results.  Unfortuneately we seem to have a 
compilation performance bug in our compiler, so some of the numbers
on the K210 are about 1/2 of what they should really be.  The
compiler is generating a weird sequence of instructions under
- - -O.  If I change the compilation to +O3 (instead of -O which is +O2),
then some numbers go up (pipe bandwidth, file re-read, bcopy hand,
and mem write) while others go down (memory re-read and memory read).
I talked with Larry, and he said just go with -O because he wants
to measure the "out of the box" performance.  So we will look a little
bad on the benchmarks above that improved with +O3.

Here is a description of the machines I tested (I also threw in an
890 just for fun.  You might want to only include it in the memory
latency section):

K210:

Vendor&Model Uni/MP    OS       CPU        MHz  Year  Int92  List  BusSpeed
HP 9000/K210   MP   HP-UX 10.01 HP-PA 7200 120  1995  167    35K    120Mhz 

T520:
Vendor&Model Uni/MP    OS       CPU        MHz  Year  Int92  List  BusSpeed
HP 9000/T520   MP   HP-UX 10.01 HP-PA 7150 120  1995   160    135K    60Mhz 

E55:
Vendor&Model Uni/MP    OS       CPU          MHz  Year  Int92  List  BusSpeed
HP 9000/E55   Uni   HP-UX 10.01 HP-PA 7100LC  96  1994   108   10K    64Mhz??

890:
Vendor&Model Uni/MP    OS       CPU          MHz  Year  Int92  List  BusSpeed
HP 9000/890   MP   HP-UX 9.04 HP-PA 7000     60   1991?   ?     ?     60Mhz


It is important to use the Letter Names instead of the numbers, so here is
the conversion:

	HP 9000/859 == K210
	HP 9000/856 == E55
    HP 9000/892 == T520
	HP 9000/890 == 890  (we no longer sell this box and changed naming
                         schemes right around when we stopped selling it)

Now a little bit about each machine and where it shines:

9000/K210:

The K210 is a memory mover.  It can copy and move memory very well.
Unfortunately due to the compiler bug, the memory move speeds are
about 1/2 of what they should be.  For example, for Pipe bandwidth
I was able to get about 85 MB/sec with +O3, but get only 54 MB/sec
on the test with -O.  Also, the loop unrolled bcopy is 74 MB/sec with
+O3 but only 38 MB/sec with -O.  So you can see what I mean!

The K210 really shines though on the memory latency test.  This is
an MP box, but has a memory latency of about 349ns!  So it beats
the DEC 8400 box that was claiming superior latency to the competition.
This is only 43 cycles of stall for main memory (which is much better
than DEC's 133 and the same as the uniprocessor UltraSparc which has
a 45 cycle latency).  The IBM machine is still the champ at only
17 cycles to main memory!

The fork/exec times are much better than before, and the context
switch times are better or the same.

The networking latency numbers are much better and the TCP connection
time is the best number now.

Finally, the K210 has a very efficient and fast I/O subsystem, so it's
Disk I/O latency time is great 1109 usecs which beats the SGI Indigo2
by a fair amount (1265 usec).

The K210 has a 256KB d-cache and 256KB i-cache + a 2KB on-chip 
assist cache on the d-side (basically anti-thrash cache).  Both 
the assist and d-cache are accessed in parallel, so they look
like 1 cache (and both have a 1 cycle latency for a hit).

9000/T520:

This box is built for high-end OLTP, so it's biggest feature is 
a HUGE 1st level cache.  The T520 has 1MB i-cache and 1MB d-cache 
both running at 120MHz, so it can access 2MB of data in 1 cache 
cycle!  It's main memory speed is slower than the K210 by quite a 
bit, but appears to be much faster than other MP boxes except for 
the DEC 8400.

The T520 does not appear to do well for bcopy tests, but if you look
closer, you will see the small tests it performs very well on because
the data fits in 1MB, so it is only 1 cycle away.

I need to make sure that we can release these numbers (I think they
are fine, but marketing might be cautious).  So don't put in
this box until I give you the go ahead on Monday.

I thought this would be a good box, however, due to it's very large
1st level cache and somewhat respectable main memory latency for 
a large SMP.

9000/E55:

This is the low end of our product line and features a combined
1MB 1st level cache (both instruction and data stored in the cache).
It is only a uniprocessor and has very good performance for the 
cost. 

It tends to get hurt by the large tests that flush the cache because
this also flushes the instruction cache!  So it actually does not
look very well on too many tests.  It's main memory latency is low
(about 260ns) and it has a large 1st level cache.  Again, this
is a box designed for OLTP instead of a workstation designed for
CAD/Design.

9000/890:

This is the ultimate OLTP workhorse from the early 90's.  It has
a 2MB 2-way set-associative d-cache and 2MB 2-way set-associative
I-cache both 1 cycle away from the cpu!  It also has huge TLBs.
The latency test is really interesting because the machine gets
a 16ns latency for over 2MB of memory on the d-side, which is 
by far the largest of any of the machines tested.  There are not
many OLTP workloads that don't fit in it's cache.

Other than that, not too interesting.  HP-UX 9.04 may have some
things that are better than 10.01 and some worse.  Just sift
through the data.

Overall I think this machine configuration aspect is really interesting:
You sortof have the following design points:

1).  HP  --> as much 1st level cache as possible (and with K210, low
             latency memory).  This makes your processor very efficient
             for code that fits in cache.  Due to locality principles 
             this is a big win.
2).  DEC+SGI --> Crank the processor frequency, but have a huge 2nd level
                 cache so you don't have to go to memory which would stall
                 the processor for a huge # of cycles.  CPI not as
                 good as HP, so you're not as efficient, but does save
                 you for apps that are really big and don't have much 
                 locality (although then 4MB might not even save you!).
3).  IBM 990 --> forget the cache, just make memory really close and 
                 suck it in fast.  This is more expensive than HPs method
                 but get's better results (more of memory has a low 
                 latency on average, especially for large apps).  It 
                 is also harder to make MP.
4).  SUN --> Kindof lost until UltraSparc.  Small onchip, 2nd level too slow.

5).  Intel --> P5 has same problem as SUN pre Ultra, but hey it's a PC chip.
               P6 is following more in HPs mold, but still too far away
               for 1.5 cache (36ns or 6 cycles).  This may not be too much 
               latency to handle given the out-of-order features in P6
               however.

Remember that the T520 might not be able to go into the paper, so 
I'll let you know.
