Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Thanks Given: 260
Thanked 420 Times in 361 Posts
My favorite quote in this area is:
... premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.)
-- wikipedia article, see below
I have been fortunate enough to work on Big iron for much of my professional life:
Control Data (CDC): 160, 1604, 6600 and follow-ons; 203, 205, ETA-10
Cray Research (CRI): CRAY- 1, CRAY-2, CRAY X-MP
IBM: 3090 (AIX)
Thinking Machines (TMC): CM-2 (& 200), CM-5
You may have done your homework on performance issues, but if not, I suggest you look at -- a quick-and-dirty-off-the-top-of-my-head-list:
A Google search on faster program, optimized code, techniques and similar phrases
An older book that you might be able to find used is:
Title: High Performance Computing
Subtitle: RISC Architectures, Optimization & Benchmarks
Author: Charles Severance, Kevin Dowd
Date: July 2, 1998
Categories: high performance, optimization, programming, software design
Comments: 5 stars (4 reviews, Amazon, 2007.12)
Comments: ( I have 1st edition, 1993 )
Most of the suggestions listed above by posters are appropriate at some time in the optimization process. I have a few principles that I advise folks to think about:
-1: Does this program / process / code absolutely, positively need to be faster?
0) Make it run right before you make it faster
1) Spend most of your personal time finding the best algorithm. There is a story in Programming Pearls, J Bentley, about the comparison between an algorithm implemented in compiled Fortran on a Cray-1 versus a better algorithm in interpreted Basic on a Radio Shack TRS-80. As you might guess, the Cray-1 crushed the TRS-80 -- at least at a small problem size. As the size went up, the TRS-80 eventually overcame the mighty Cray-1, and for the largest size listed, the Cray would have taken 95 years, the TRS-80 5.4 hours.
Another story about algorithms has to do with advances in hardware. There are many algorithms that have been discarded because they were too slow -- at least on scalar machines. When parallel processing became a reality, some of those really inefficient algorithms turned out to be spectacularly useful on parallel boxes. The CM-2 (200) above had 32,000 processors, but they were bit-slice computers. Most people used the mode where they ganged them by 32s to get a 1,000 processor box -- quite respectable for that time in computing history. If you used the right algorithm applied to right problem, that machine really cranked out results. (That was a "half-gallon" machine, the "one gallon" had 64K processors.)
2) Profile / instrument your code; obtain measurements to see where it is spending its time, then spend your precious time in those areas. A few years back, I did the opposite of what I had usually done. A client asked me to take a code that previously ran on a Cray and port it to run on a PC. It was far too complex a code to consider an algorithm change (although I suggested that their domain experts look at that). I profiled it and saw that it spent a lot of time doing IO. The best approach at that point was to allocate as much memory as feasible to a RAMdisk. That affected the models that I was using by decreasing the real time by 30% (we might have expected more, but this was all done with filesystem drivers, so that code did not need to be modified). If there was more to be done, a RAID-0 across several disks would have been next.
If you have some money, perhaps all you need is more memory, or a box that has two or more CPUs, an account at a computing service bureau, etc. However, I suggest that you take a step back and consider all your options and possibilities, to avoid the premature optimization trap.