In my case 'the fix' was to put whole-core constraint on most utilized ldoms (databases) and keep the VCPUs count inside core boundary.
For instance, t5-2 sparc has 256 available VCPU (threads), which translates into 32 cores or 2 sockets 16 core each. For best performance one should give VCPU resources multiples of 8.
Be sure to reboot the hypervisor after such major changes.
Regarding HPVM (now vpars and Integrity VM) i would recommend using VPAR since they are configured only is such manner(dedicated cores for virtual machines and hypervisor).
Integrity VM can suffer from such 'misconfiguration' as well since it (can) share cores.
Often cache is just reloaded from the lower, slower layers on the new core, and eventually snooped empty on the old core when the data is modified. This means that while a different core may be available at an instant, it is better to wait a bit for the old core, which may not be 100% busy in the longer term. Of course, some caches cache keyed on virtual addresses, not physical ones, and may be flushed when other processes use the core. For them, dispatching multiple threads of the same process in succession reduces cache flushing. So, while you have asked for concurrent threads, that might actually be made less true in the fine by the system.
Hyperthreading is only for same-process threads, as they share the same virtual space. It is a nice way to increase use of CPU resources, with some added delay when threads' needs collide. It is an interesting alternate direction to the trend in modern CPU design to do speculative operations that are 50% or more a waste of the resource, but speed the critical thread. I find it reminiscent of the old Honeywell-800, where the CPU ran instructions of up to 8 threads more or less in rotation. (If you loaded the accumulator, it did not 'hunt', so many programmers used the accumulator as a register to hog the CPU and speed their thread.) I used to fix this stuff, before it crawled inside a chip!