Memory Barriers for (Ubuntu) Linux (i686)

06-19-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by gorga

But with those atomic operations I can avoid sending threads to sleep and having to wake them up.

Your threads will mostly sleep anyway. Let me illustrate something for you.

Imagine a massive 100-core machine running 100,000 threads. Only 100 of them will run, 99.90% will sleep. Assuming 1000 context switches a second(the standard for desktop linux, servers are usually 100) this machine will pick up each thread once per second for one millisecond each.

When your 99,999 reader threads are busy, your writer gets one millisecond, once a second, to add jobs. If your 99,999 threads are out of work, your writer still gets one millisecond, once a second, to fill the job queue. But it needed 50, so some starve. The more threads you have, the more jobs your writer needs to add, and the less time it has to do so!

Let's jack up it's priority one increment, once per second is just dumb -- voila, no more starvation. The writer gets a full 50 milliseconds per second to fill the queue -- and 950 more to suck sludge with. Oops. A a high-priority thread won't be switched out to run a low one unless it blocks, so the writer's wasting 950/1000 milliseconds on an entire CPU.

Okay, maybe this super-mega-ultra-high-performance nonblocking queue writer can stand a tiny bit of blocking when we know it's sucking sludge. The polite way to tell the OS is usleep(), so we add usleep(1000) to the bottom of the loop. That's a little better, wasting just 50% CPU -- since it's still high priority it gets shoved ahead in line whenever usleep's done, every other task. Jacking the value of usleep brings it lower until it's wasting almost nothing, riding the line. Perfect.

Then a disk interrupt happens. This throws off the timing, letting readers starve. Oops -- you can't trust usleep()'s accuracy in a multitasking system!

Okay, so priority wastes time, and Blocking Is Evil. What else could be tried? Well, to run the writer thread more often without changing its priority, have more writers... One writer gets 1/50 of the time it needs, so how about fifty of them? 50 threads is a drop in the bucket when we have 100,000 of them. But how to distribute jobs to the writers in a parallel way? Another, miniature super-mega-ultra-high-performance nonblocking queue system to feed the big one? It'd need many threads, or high system priority, to get them to the writers in time, and a really intelligent management system of some sort to keep the right numbers of threads around with the right delays to keep things flowing efficiently.

Assuming, of course, that this is remotely efficiently when so much time is spent checking for readiness instead of doing actual work. If only there was a way to tell a thread to shut up and go away until there's anything for it to do...

There was an old lady who swallowed a spider,
That wiggled and wiggled and tickled inside her.
She swallowed the spider to catch the fly.
But I dunno why she swallowed that fly -
Perhaps she'll die.

You've got a mental (b)lock about (b)locking. You've got to keep perspective:

In a loaded system, almost everything's sleeping all the time no matter how nonblocking your code is.
Every thread that goes to sleep lets another thread wake.
If it has to wait for something anyway, it's wasting time by not blocking, time another thread could've done work in.
If your threads won't politely block when convenient, the OS impolitely pre-empts them when inconvenient, because:
The scheduler isn't psychic. If you won't give any hints by potentially blocking, it's left free to pick the stupidest possible order and priority.
You can't assume no starvation with the stupidest possible order and priority.

To sum up, blocking is there for a reason. Why wouldn't readers starve when they're 99,999 times more likely to get execution time than the writer? You're trying to build an Erlang here. Don't expect the same behavior in C.

Last edited by Corona688; 06-19-2010 at 04:36 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-20-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

That's an impressive reply Corona, you made your case very persuasively, but it's not "my case". The problem is you're not aware of the overall system, its context, aims and design (which is not your fault of course). I'll try and clarify a few things...

Quote:

Originally Posted by Corona688

Imagine a massive 100-core machine running 100,000 threads.

In my system, if a had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Yes, I am fully aware that in a multitasking OS there are still going to be more threads and processes going on, and my threads will be put to sleep and context-switched out occasionally, but why would I increase the frequency of this by adding more superfluous threads? Sure, I could split queues up between more threads but the "illusion" will only allow fairness, which I'm not really concerned about anyway.

(Recall, the thread pool allows the creation of light-weight tasks, not threads. Threads are oblivious to the higher layer.)

Quote:

When your 99,999 reader threads are busy, your writer gets one millisecond, once a second, to add jobs. If your 99,999 threads are out of work, your writer still gets one millisecond, once a second, to fill the job queue. But it needed 50, so some starve.

But this doesn't really reflect accurately what's happening in the system. "Reading" in this case is "executing tasks", "writing" is when a "reader" "creates a new task" (on another core's task-queue). Readers are writers and visa-versa. "add jobs" and "filling a job queue" suggests a task's work involves creating multiple sub-tasks on the other queues in one instance, which isn't a frequent scenario (although, it may be that a task needs to broadcast a message to n other tasks, but in that case "lock-stepping" through the recipients still isn't required).

Talking about thread's being "out of work" isn't a frequent scenario either. Approx 95% of the time there's an affluence of tasks (roughly) equally distributed, the remaining 5% indicates there's no work to do anyway.

Quote:

The more threads you have, the more jobs your writer needs to add, and the less time it has to do so!

No, this isn't an accurate description of the behaviour of the system. A task (writer) doesn't necessarily create more tasks (jobs) just because there are more tasks (threads) around. Like I said, imagine an n-ary tree with paths of equal length and communication only takes place between nodes residing on the same path. (In many cases nodes on the same path will be running on the same underlying thread, hence no syncrhonisation is necessary for those to communicate).

Quote:

You've got a mental (b)lock about (b)locking. You've got to keep perspective:To sum up, blocking is there for a reason.

Let me put it this way, I have a list of jobs to do, as I progress through the list I find some jobs aren't ready. I can either block/sleep etc and wait for the job to become ready, or I can move onto the next task and execute that one instead. If I build my system with locks, I'll adhere to the first approach, whereas what I'm trying to achieve (aided by atomic flags and memory barriers) is the second.

Surely you're not saying that the first one is an equivalent approach to exploiting scalable parallel execution just because pre-emption is something that occurs in the OS anyway?

gorga

View Public Profile for gorga

Find all posts by gorga

06-20-2010

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Code:

In my system, if I had a massive 100-core machine, I would only create 100 threads, if I had a 8 cores- 8 threads, 16 cores -16 threads etc. There is a 1:1 mapping of cores to threads where each thread runs on a designated core.

Just out of curiosity how are you going to ensure that each thread runs on a designated core? And have you thought through the overhead of attempting to ensure that each thread runs on a designated core?

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

06-20-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by fpmurphy

Just out of curiosity how are you going to ensure that each thread runs on a designated core? And have you thought through the overhead of attempting to ensure that each thread runs on a designated core?

At start-up I used the pthread_setaffinity_np function. Each pthread exists on its core for the life-time of the application, i.e:

Code:

    ret_code = pthread_setaffinity_np(thread->tid, sizeof(cpu_set_t), &cpuset);
    if(ret_code != 0)
    {
        fprintf(stderr, "error assigning thread to core\n");
        exit(ret_code);
    }

Is there a problem with doing it this way? (When I tested the code it appeared to be allocating threads to cores, correctly).

gorga

View Public Profile for gorga

Find all posts by gorga

06-20-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by gorga

Let me put it this way, I have a list of jobs to do, as I progress through the list I find some jobs aren't ready. I can either block/sleep etc and wait for the job to become ready, or I can move onto the next task and execute that one instead.

...and when you don't assume best-case, you could be going through the next few thousand not-ready tasks. Meanwhile every idle worker's doing the same thing. I begin to understand why you're concerned about contention for memory.

Quote:

If I build my system with locks, I'll adhere to the first approach, whereas what I'm trying to achieve (aided by atomic flags and memory barriers) is the second.

But because it doesn't block, you'll be wasting time scanning the list anyway, time that could have been spent doing actual work. And since your system's as busy idle as it is when actually busy you'll have a difficult time guessing how much. If I read you correctly, the jobs are all tiny. How tiny? How much more work is it to do a job than to scan the list? (And don't assume it could never, ever become mostly empty, that's the goal, not the proof.) If they're even in the same ballpark, you're going to be wasting a worrying proportion of CPU time scanning your list.

Anyway, the queue needn't block like you're describing. Put jobs in the queue when they become ready, don't just stick them there in advance, that way threads won't block when picking up jobs unless you're actually out of jobs -- in which case you want them to block. If the queue's big enough and jobs can be added fast enough, things can run smoothly. You can also do other things to streamline the queue -- hand out jobs 16 at a time instead of one at a time, switch between multiple queues, etc.

Last edited by Corona688; 06-20-2010 at 10:33 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-21-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

...and when you don't assume best-case, you could be going through the next few thousand not-ready tasks.

Bear in mind though that the "best" case is also the typical one here.

Quote:

Meanwhile every idle worker's doing the same thing. I begin to understand why you're concerned about contention for memory.

Not a problem here, the idle workers are searching through separate queues.

Quote:

But because it doesn't block, you'll be wasting time scanning the list anyway, time that could have been spent doing actual work. And since your system's as busy idle as it is when actually busy you'll have a difficult time guessing how much.

Fair enough, in rare cases, it could behave like that. Perhaps the time gained by not blocking would be spent searching.

Quote:

If I read you correctly, the jobs are all tiny. How tiny? How much more work is it to do a job than to scan the list?

Varies, some tasks are tiny, short-lived tasks, others persist and involve a greater amount of work...at a rough guess 50-50.

Quote:

Put jobs in the queue when they become ready, don't just stick them there in advance, that way threads won't block when picking up jobs unless you're actually out of jobs -- in which case you want them to block.

Okay, let's explore that for a minute. This queue would have to be able to dynamically grow and shrink right, so not only would there be a lock for accessing the queue, there'd also be a lock for allocating the memory on the heap. In addition, the repeated following of a pointer into allocated memory would likely cause repeated cache misses, which would also degrade performance.

So let's say, instead of allocating a task at a time, I allocate a chunk of tasks at a time, say 100 using calloc for example. Now my tasks are contiguous in memory and I've reduced the heap contention to a 100th while improving the cache hit rate cos I can bring multiple tasks into the cache in one go. Great, but many tasks don't persist for long, so why not recycle the memory in the list when a task has completed and then I don't have to allocate more chunks quite so often.

Now how do I achieve that, with a flag to say the task is ready perhaps, has to be atomic though because it's shared between threads and I have to ensure instructions are not reordered by using memory barriers.

But I'm not sure about memory barriers, I know, I'll ask the folks on the Unix forum...

You see where this is leading.

Last edited by gorga; 06-21-2010 at 11:01 AM..

gorga

View Public Profile for gorga

Find all posts by gorga

High Performance Computing

Memory Barriers for (Ubuntu) Linux (i686)

4 More Discussions You Might Find Interesting

1. Ubuntu

XP and Linux (Ubuntu) on same disk, Can I install Ubuntu on not-yet partitioned portion of disk?

Discussion started by: C.Weidemann

2. Programming

Getting the total virtual memory for ubuntu in c++

Discussion started by: xiaojesus

3. Linux

i686, x86 64, ppc

Discussion started by: superblacksmith

4. UNIX for Dummies Questions & Answers

Memory-waste in Ubuntu/Debian?

Discussion started by: riwa