Unix/Linux Go Back    


High Performance Computing Message Passing Interface (MPI) programming and tuning, MPI library installation and management, parallel administration tools, cluster monitoring, cluster optimization, and more HPC topics.

Memory Barriers for (Ubuntu) Linux (i686)

High Performance Computing


Closed    
 
Thread Tools Search this Thread Display Modes
    #15  
Old Unix and Linux 06-19-2010
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 25 March 2017, 10:57 PM EDT
Location: Saskatchewan
Posts: 22,008
Thanks: 1,063
Thanked 4,127 Times in 3,819 Posts
Quote:
Originally Posted by gorga View Post
But with those atomic operations I can avoid sending threads to sleep and having to wake them up.
Your threads will mostly sleep anyway. Let me illustrate something for you.

Imagine a massive 100-core machine running 100,000 threads. Only 100 of them will run, 99.90% will sleep. Assuming 1000 context switches a second(the standard for desktop linux, servers are usually 100) this machine will pick up each thread once per second for one millisecond each.

When your 99,999 reader threads are busy, your writer gets one millisecond, once a second, to add jobs. If your 99,999 threads are out of work, your writer still gets one millisecond, once a second, to fill the job queue. But it needed 50, so some starve. The more threads you have, the more jobs your writer needs to add, and the less time it has to do so!

Let's jack up it's priority one increment, once per second is just dumb -- voila, no more starvation. The writer gets a full 50 milliseconds per second to fill the queue -- and 950 more to suck sludge with. Oops. A a high-priority thread won't be switched out to run a low one unless it blocks, so the writer's wasting 950/1000 milliseconds on an entire CPU.

Okay, maybe this super-mega-ultra-high-performance nonblocking queue writer can stand a tiny bit of blocking when we know it's sucking sludge. The polite way to tell the OS is usleep(), so we add usleep(1000) to the bottom of the loop. That's a little better, wasting just 50% CPU -- since it's still high priority it gets shoved ahead in line whenever usleep's done, every other task. Jacking the value of usleep brings it lower until it's wasting almost nothing, riding the line. Perfect.

Then a disk interrupt happens. This throws off the timing, letting readers starve. Oops -- you can't trust usleep()'s accuracy in a multitasking system!

Okay, so priority wastes time, and Blocking Is Evil. What else could be tried? Well, to run the writer thread more often without changing its priority, have more writers... One writer gets 1/50 of the time it needs, so how about fifty of them? 50 threads is a drop in the bucket when we have 100,000 of them. But how to distribute jobs to the writers in a parallel way? Another, miniature super-mega-ultra-high-performance nonblocking queue system to feed the big one? It'd need many threads, or high system priority, to get them to the writers in time, and a really intelligent management system of some sort to keep the right numbers of threads around with the right delays to keep things flowing efficiently.

Assuming, of course, that this is remotely efficiently when so much time is spent checking for readiness instead of doing actual work. If only there was a way to tell a thread to shut up and go away until there's anything for it to do...

There was an old lady who swallowed a spider,
That wiggled and wiggled and tickled inside her.
She swallowed the spider to catch the fly.
But I dunno why she swallowed that fly -
Perhaps she'll die.

You've got a mental (b)lock about (b)locking. You've got to keep perspective:
  1. In a loaded system, almost everything's sleeping all the time no matter how nonblocking your code is.
  2. Every thread that goes to sleep lets another thread wake.
  3. If it has to wait for something anyway, it's wasting time by not blocking, time another thread could've done work in.
  4. If your threads won't politely block when convenient, the OS impolitely pre-empts them when inconvenient, because:
  5. The scheduler isn't psychic. If you won't give any hints by potentially blocking, it's left free to pick the stupidest possible order and priority.
  6. You can't assume no starvation with the stupidest possible order and priority.
To sum up, blocking is there for a reason. Why wouldn't readers starve when they're 99,999 times more likely to get execution time than the writer? You're trying to build an Erlang here. Don't expect the same behavior in C.

Last edited by Corona688; 06-19-2010 at 03:36 PM..
Sponsored Links
    #16  
Old Unix and Linux 06-20-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
That's an impressive reply Corona, you made your case very persuasively, but it's not "my case". The problem is you're not aware of the overall system, its context, aims and design (which is not your fault of course). I'll try and clarify a few things...

Quote:
Originally Posted by Corona688 View Post
Imagine a massive 100-core machine running 100,000 threads.
In my system, if a had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Yes, I am fully aware that in a multitasking OS there are still going to be more threads and processes going on, and my threads will be put to sleep and context-switched out occasionally, but why would I increase the frequency of this by adding more superfluous threads? Sure, I could split queues up between more threads but the "illusion" will only allow fairness, which I'm not really concerned about anyway.

(Recall, the thread pool allows the creation of light-weight tasks, not threads. Threads are oblivious to the higher layer.)

Quote:
When your 99,999 reader threads are busy, your writer gets one millisecond, once a second, to add jobs. If your 99,999 threads are out of work, your writer still gets one millisecond, once a second, to fill the job queue. But it needed 50, so some starve.
But this doesn't really reflect accurately what's happening in the system. "Reading" in this case is "executing tasks", "writing" is when a "reader" "creates a new task" (on another core's task-queue). Readers are writers and visa-versa. "add jobs" and "filling a job queue" suggests a task's work involves creating multiple sub-tasks on the other queues in one instance, which isn't a frequent scenario (although, it may be that a task needs to broadcast a message to n other tasks, but in that case "lock-stepping" through the recipients still isn't required).

Talking about thread's being "out of work" isn't a frequent scenario either. Approx 95% of the time there's an affluence of tasks (roughly) equally distributed, the remaining 5% indicates there's no work to do anyway.

Quote:
The more threads you have, the more jobs your writer needs to add, and the less time it has to do so!
No, this isn't an accurate description of the behaviour of the system. A task (writer) doesn't necessarily create more tasks (jobs) just because there are more tasks (threads) around. Like I said, imagine an n-ary tree with paths of equal length and communication only takes place between nodes residing on the same path. (In many cases nodes on the same path will be running on the same underlying thread, hence no syncrhonisation is necessary for those to communicate).

Quote:
You've got a mental (b)lock about (b)locking. You've got to keep perspective:To sum up, blocking is there for a reason.
Let me put it this way, I have a list of jobs to do, as I progress through the list I find some jobs aren't ready. I can either block/sleep etc and wait for the job to become ready, or I can move onto the next task and execute that one instead. If I build my system with locks, I'll adhere to the first approach, whereas what I'm trying to achieve (aided by atomic flags and memory barriers) is the second.

Surely you're not saying that the first one is an equivalent approach to exploiting scalable parallel execution just because pre-emption is something that occurs in the OS anyway?
Sponsored Links
    #17  
Old Unix and Linux 06-20-2010
fpmurphy's Unix or Linux Image
fpmurphy fpmurphy is offline Forum Staff  
who?
 
Join Date: Dec 2003
Last Activity: 12 June 2016, 11:03 PM EDT
Location: /dev/ph
Posts: 4,996
Thanks: 73
Thanked 475 Times in 437 Posts

Code:
In my system, if I had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Just out of curiosity how are you going to ensure that each thread runs on a designated core? And have you thought through the overhead of attempting to ensure that each thread runs on a designated core?
    #18  
Old Unix and Linux 06-20-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by fpmurphy View Post
Just out of curiosity how are you going to ensure that each thread runs on a designated core? And have you thought through the overhead of attempting to ensure that each thread runs on a designated core?
At start-up I used the pthread_setaffinity_np function. Each pthread exists on its core for the life-time of the application, i.e:


Code:
    ret_code = pthread_setaffinity_np(thread->tid, sizeof(cpu_set_t), &cpuset);
    if(ret_code != 0)
    {
        fprintf(stderr, "error assigning thread to core\n");
        exit(ret_code);
    }

Is there a problem with doing it this way? (When I tested the code it appeared to be allocating threads to cores, correctly).
Sponsored Links
    #19  
Old Unix and Linux 06-20-2010
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 25 March 2017, 10:57 PM EDT
Location: Saskatchewan
Posts: 22,008
Thanks: 1,063
Thanked 4,127 Times in 3,819 Posts
Quote:
Originally Posted by gorga View Post
Let me put it this way, I have a list of jobs to do, as I progress through the list I find some jobs aren't ready. I can either block/sleep etc and wait for the job to become ready, or I can move onto the next task and execute that one instead.
...and when you don't assume best-case, you could be going through the next few thousand not-ready tasks. Meanwhile every idle worker's doing the same thing. I begin to understand why you're concerned about contention for memory.
Quote:
If I build my system with locks, I'll adhere to the first approach, whereas what I'm trying to achieve (aided by atomic flags and memory barriers) is the second.
But because it doesn't block, you'll be wasting time scanning the list anyway, time that could have been spent doing actual work. And since your system's as busy idle as it is when actually busy you'll have a difficult time guessing how much. If I read you correctly, the jobs are all tiny. How tiny? How much more work is it to do a job than to scan the list? (And don't assume it could never, ever become mostly empty, that's the goal, not the proof.) If they're even in the same ballpark, you're going to be wasting a worrying proportion of CPU time scanning your list.

Anyway, the queue needn't block like you're describing. Put jobs in the queue when they become ready, don't just stick them there in advance, that way threads won't block when picking up jobs unless you're actually out of jobs -- in which case you want them to block. If the queue's big enough and jobs can be added fast enough, things can run smoothly. You can also do other things to streamline the queue -- hand out jobs 16 at a time instead of one at a time, switch between multiple queues, etc.

Last edited by Corona688; 06-20-2010 at 09:33 PM..
Sponsored Links
    #20  
Old Unix and Linux 06-21-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
...and when you don't assume best-case, you could be going through the next few thousand not-ready tasks.
Bear in mind though that the "best" case is also the typical one here.

Quote:
Meanwhile every idle worker's doing the same thing. I begin to understand why you're concerned about contention for memory.
Not a problem here, the idle workers are searching through separate queues.

Quote:
But because it doesn't block, you'll be wasting time scanning the list anyway, time that could have been spent doing actual work. And since your system's as busy idle as it is when actually busy you'll have a difficult time guessing how much.
Fair enough, in rare cases, it could behave like that. Perhaps the time gained by not blocking would be spent searching.

Quote:
If I read you correctly, the jobs are all tiny. How tiny? How much more work is it to do a job than to scan the list?
Varies, some tasks are tiny, short-lived tasks, others persist and involve a greater amount of work...at a rough guess 50-50.

Quote:
Put jobs in the queue when they become ready, don't just stick them there in advance, that way threads won't block when picking up jobs unless you're actually out of jobs -- in which case you want them to block.
Okay, let's explore that for a minute. This queue would have to be able to dynamically grow and shrink right, so not only would there be a lock for accessing the queue, there'd also be a lock for allocating the memory on the heap. In addition, the repeated following of a pointer into allocated memory would likely cause repeated cache misses, which would also degrade performance.

So let's say, instead of allocating a task at a time, I allocate a chunk of tasks at a time, say 100 using calloc for example. Now my tasks are contiguous in memory and I've reduced the heap contention to a 100th while improving the cache hit rate cos I can bring multiple tasks into the cache in one go. Great, but many tasks don't persist for long, so why not recycle the memory in the list when a task has completed and then I don't have to allocate more chunks quite so often.

Now how do I achieve that, with a flag to say the task is ready perhaps, has to be atomic though because it's shared between threads and I have to ensure instructions are not reordered by using memory barriers.

But I'm not sure about memory barriers, I know, I'll ask the folks on the Unix forum...Linux

You see where this is leading.

Last edited by gorga; 06-21-2010 at 10:01 AM..
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
i686, x86 64, ppc superblacksmith Linux 3 03-18-2008 08:19 PM
Memory-waste in Ubuntu/Debian? riwa UNIX for Dummies Questions & Answers 0 03-04-2006 04:01 PM



All times are GMT -4. The time now is 11:32 PM.