Just to inject another point of vie. hergp and jllaigre are correct about threading.
To answer your 32/64bit question: time it yourself on the same dataset with two different compiles. We have done that on a different solaris architecture and found only a small amount of improvement.
If your number cruncher uses big arrays it is possible that you are wasting cpu. If your code constantly forces the cpu to bring in pages of data and to do a lot of searching in the cached pages, you are possibly wasting cpu.
Consider running your code and at the same time run trapstat. Thanks to this we got data to support using larger pagesize effectively. This DOES NOT nesessarily involve coding. A minor change to the way you invoke your code is needed. See the ppgsz man page for a very simple way to do this. Do this if and only if you have an MMU issue issue revealed by trapstat. And I do not know much about your architecture, this may not be as beneficial as it was on our M4000.
Have a read:
Multiple Page Size Support - Siwiki