Kudos to you for your tenacity! However, I don't think this is the end of it.
I did a little research on strcmp's answer. 2.6.9 was released in 2004 and is standard with RHEL 4, which shipped with glibc 2.3.4. Pentium 3's were old in 2004. RHEL 5 ships with kernel 2.6.18 and glibc 2.5.12.
So I did some benchmarks.
I followed strcmp's suggestion and used a "falling timer" method, where the loop starts and ends after the time() call notes a change in seconds. There's a 10 to 100 ms variance on either side of the fall, so I took an average of several runs. Then I divide the ops/s number by the CPU speed (cycles/s) to get "tics per op".
- 2.6.18 / P3 / 800 MHz: 548300/s (average, 19 runs) = 1459 tics/op
- 2.6.18 / AMD Opteron 285 / 2.6 GHz: 1689138 (avg 6 runs) = 1539 tics/op
- 2.6.18 / AMD Opteron 270 / 1.0 GHz: 974228 (avg 7 runs) = 1026 tics/op
- 2.6.9 / Xeon / 3.6 GHz: 917196 (avg, 4 runs) = 3925 tics/op
- 2.6.9 / P3 / 1.25 GHz : 733927 (avg, 5 runs) = 1703 tics/op
- 2.6.9 / Xeon / 2.3 GHz: 1127894 (avg, 10 runs) = 2608 tics/op
For tics/op, smaller is better. So the 2.6.18 kernel is indeed faster than the 2.6.9 kernels. The Xeon is MUCH
slower. Presumably the kernels were compiled by a lowest common denominator. No Optimization flags were enabled, but there was a difference in compilers: the 2.6.9 hosts used gcc 3.4.6, while the newer ones were with gcc 4.1.1. Also, it should be noted that we don't have an AMD running 2.6.9 nor a Xeon running 2.6.18.
It very may well be that the problem is that these kernels were not compiled optimally for the various architectures. Why the Xeons are so much slower is quite surprising, given their characteristic use as HPC components.
Regardless, none of these results seem to explain the fundamental question: Why is SCO so much faster??