Plus, if you repeatedly call malloc/free for varying very large sized chunks of memory, malloc will gladly fragment heap to the point where it becomes less efficient. This is due in part to the fact that some OS flavors may reclaim memory after a free call. Especially if there are other processes calling for memory chunks. Numa also plays into big chunk operations.
Several years ago we ran a test on a non-prod Solaris 10 box with 64GB of memory. We malloced one single giant chunk, never called malloc again. We reused the chunk over and over with varying sized buffers. By adding back in the malloc/free calls between every operation on new "new" chunk, the same test code ran about 15% slower and spent most of that extra time in kernel mode.
NUMA really slows down accessing large memory allocations because of locality issues. The system cannot relocate gigantic memory chunks to more convenient locations. Since you have a commodity cpu (multicoore x86 ) then NUMA is a concern.
You need to look into cpu affinity for threads.
If you are reading from and then writing to vastly distant memory chunks you need to be aware of the order of accessing neighboring memory rather than doing something like copying the contents of arr[0] to arr[2000000], then reading in arr[1000000]. Each one of those example actions can mean reloading an L2 cache - as an example. As it is nowadays, memory is about an order of magnitude or more slower than your cpus.
Edit: You really should consider this article:
http://www.akkadia.org/drepper/cpumemory.pdf
It is somewhat old, but still completely applicable.