Unix/Linux Go Back    


High Performance Computing Message Passing Interface (MPI) programming and tuning, MPI library installation and management, parallel administration tools, cluster monitoring, cluster optimization, and more HPC topics.

Memory Barriers for (Ubuntu) Linux (i686)

High Performance Computing


Closed    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 06-16-2010   -   Original Discussion by gorga
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Memory Barriers for (Ubuntu) Linux (i686)

Hi all,

(Hope this is the right forum for this question)

I have some multi-threaded C code (compiled with GCC 4.4.3) which accesses shared variables. Although I've marked those variables with volatile to guard against compiler re-ordering, I'm concerned that processor out-of-order execution may cause my code to fail, and I'm looking for a "low-cost" method of guaranteeing ordering is maintained in my code.

For example, I have something like...


Code:
memset(&task, 0, sizeof(task_t));/* null memory */
task.id.prefix = prefix_id;
task.id.instance = instance_id;
/* write-memorybarrier required here */
task.state = task_ready;

Where I need to ensure that the "task state" is only set to "task_ready" after the previous instructions have been committed. As the "task" is shared between threads, another thread seeing the state as "ready" may try to access its member variables, so it's vital that the tasks "prefix" and "instance" have been updated.

I know this is a common problem and mutexes and semaphores provide in-built memory barriers to address this problem but I'm trying to build a scalable application and I want to avoid their use if possible. I also know GCC provides built-in atomic operations but I see they involve locking the data-bus, and I've heard about system primitives like "smp_wmb()" but I'm not sure how to incorporate these into my "user-space" program as they are platform dependent.

Therefore can anyone provide pointers or advise on how best (in terms of scalability and speed) to guarantee ordering is maintained?

Thanks.
Sponsored Links
    #2  
Old Unix and Linux 06-16-2010   -   Original Discussion by gorga
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 21 November 2017, 3:22 PM EST
Location: Saskatchewan
Posts: 22,518
Thanks: 1,154
Thanked 4,273 Times in 3,946 Posts
I know you want to avoid mutexes, but this is really what they're for. You could try using Linux futexes, which handle the nonblocking case completely in userspace, but I'm not sure what that'd do for memory barriers(this MAY have something to do with a special mmap flag to create the futex pages, though I think futexes have changed several times since I began fiddling with them) -- and modern Linux NPTL pthreads is based on futexes anyway, so you'd just be reinventing pthreads.

Besides, what if you need to move it to MIPS or something?

[edit] One thought does occur to me. How large are these structures? Could you perhaps reorder it to put prefix, instance, and state in order in memory? You could assemble the data in an MMX or SSE register, then overwrite several structure members in one assembly op.

Last edited by Corona688; 06-16-2010 at 03:15 PM..
Sponsored Links
    #3  
Old Unix and Linux 06-16-2010   -   Original Discussion by gorga
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Hi Corona, (small world Linux)

Quote:
Originally Posted by Corona688 View Post
You could try using Linux futexes, which handle the nonblocking case completely in userspace, but I'm not sure what that'd do for memory barriers
I hadn't heard of futexes until you mentioned them, but I did some reading and it seems they still use atomic instructions to update shared variables. In that case I could just use one of GCC's built-in atomic operations like "__sync_fetch_ and_ add" or "__sync_bool_compare_and_swap" as described here...

Atomic Builtins - Using the GNU Compiler Collection (GCC)

The thing with these is they use the asm op-code "lock", which issues a hardware lock on the data-bus effectively locking every other process out of memory. Because I'm writing an application that should be scalable for a system with many cores, I'm discouraged by this.

Quote:
Besides, what if you need to move it to MIPS or something?
Could be a possibility, I believe they have made advances into highly parrallel architectures recently, but the project is at a research stage right now so if I can get it to work well on x86 that's good enough for now. I like the sound of this idea though...

Quote:
Could you perhaps reorder it to put prefix, instance, and state in order in memory? You could assemble the data in an MMX or SSE register, then overwrite several structure members in one assembly op.
This could be a good solution, but I'm not sure how to do it. Do you have any examples of similar code as a guide?

Quote:
One thought does occur to me. How large are these structures?
prefix and instance are both uint32_t while state is an enum (guess that means its a uint32_t also?).
    #4  
Old Unix and Linux 06-17-2010   -   Original Discussion by gorga
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 21 November 2017, 3:22 PM EST
Location: Saskatchewan
Posts: 22,518
Thanks: 1,154
Thanked 4,273 Times in 3,946 Posts
Small world, how so? Linux
Quote:
Originally Posted by gorga View Post
I hadn't heard of futexes until you mentioned them, but I did some reading and it seems they still use atomic instructions to update shared variables.
Well, yes. It has to synchronize somehow. One way or another you must interrupt other cores with this change in status, or they may never know.
Quote:
In that case I could just use one of GCC's built-in atomic operations like "__sync_fetch_ and_ add" or "__sync_bool_compare_and_swap" as described here...

Atomic Builtins - Using the GNU Compiler Collection (GCC)
Wow, those are nice.

Quote:
The thing with these is they use the asm op-code "lock", which issues a hardware lock on the data-bus effectively locking every other process out of memory.
I think you're overreacting... Any memory I/O monopolizes the bus*, LOCK just guarantees one instruction gets two ops in a row.

Also. The original 8088 has precisely one instruction worth of cache, so locking the bus stalls it instantly... The huge caches, multiple independent memory buses, and cache communication systems in recent NUMA systems usually let cores keep going or find something else to do. I'm not sure LOCK XCGH even forces a real memory fetch anymore(might be simple to test, try to get back to you on that.)

Lastly, if you're doing no mutexing, what are you doing instead -- polling? That's not going to be more efficient, untold amounts of CPU will be expended on what amounts to a while(1) loop.

I really think pthreads is still what you're looking for. They've made it as fast as they know how, significantly changing the kernel to accommodate it.

* Exceptions exist for very special-purpose memory chips like video RAM.
Sponsored Links
    #5  
Old Unix and Linux 06-17-2010   -   Original Discussion by gorga
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
Small world, how so? Linux
Noting your help on the Programming forum too!

Quote:
I think you're overreacting... Any memory I/O monopolizes the bus*, LOCK just guarantees one instruction gets two ops in a row.
Are you suggesting then, that if I used such an instruction relatively frequently (say once in a loop of maybe a 100 execution statements, per core), I shouldn't notice a significant drop in throughput of the application?

Quote:
I'm not sure LOCK XCGH even forces a real memory fetch anymore(might be simple to test, try to get back to you on that.)
You'd expect that each core accessing the XCHG variable though would have to get the value from memory though as soon as it accessed it, otherwise what use would CMPXCHG be? Not sure about this area to be honest, (but I read that these atomic operations do create a memory barrier so a core cannot execute instructions either side of said barrier out of order).

Quote:
Lastly, if you're doing no mutexing, what are you doing instead -- polling? That's not going to be more efficient, untold amounts of CPU will be expended on what amounts to a while(1) loop.
What I'm building is a thread-pool with n pthreads equal to the number of cores (so I am using pthreads). The pthreads continually execute a list of "lightweight tasks". The user can create tasks and send "messages" between them. (If you've ever used Erlang, something similar to the abstraction provided there but in my case using C).

The pthreads occasionally check the "value" of a task "state", when they reach that task in the queue, therefore if the "state" isn't "ready" they simply move on to the next task (hence the pthread has more work to do and isn't polling continuously). You see what this means, as long as a pthread "eventually" discovers a task is "ready" that's okay, even if it's not asap. It seems like a lock would be unnecessary here then, but a pthread shouldn't detect that the task state is "ready" before its other data members have been updated (hence the need for a memory barrier).

If using these atomic operations isn't going to impact throughput, then great they solve the problem, but even that seems like overkill when I only need to ensure that a handful of statements are executed in a certain order.
Sponsored Links
    #6  
Old Unix and Linux 06-17-2010   -   Original Discussion by gorga
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 21 November 2017, 3:22 PM EST
Location: Saskatchewan
Posts: 22,518
Thanks: 1,154
Thanked 4,273 Times in 3,946 Posts
hit reply too soon. editing

---------- Post updated at 02:53 PM ---------- Previous update was at 02:35 PM ----------

Quote:
Originally Posted by gorga View Post
Are you suggesting then, that if I used such an instruction relatively frequently (say once in a loop of maybe a 100 execution statements, per core), I shouldn't notice a significant drop in throughput of the application?
Don't panic. 'lock' is not a mutex. A single memory bus always works like its "single-threaded" just because it's physically impossible for it to do anything else.

Imagine two threads running XCHG in an infinite loop. This is what the order they're let access memory might end up as.

... 1 1 2 2 3 3
Thread A Read ... Write ... Read ...
Thread B ... Read ... Write ... Read
Since this isn't LOCK XCHG, this operation isn't guaranteed to be atomic. Writes might happen interleaved like this, or in any order.

Now, with LOCK XCHG:
... 1 1 2 2 3 3
Thread A Read Write ... ... Read Write
Thread B ... ... Read Write ... ...
The same amount of waiting is happening, LOCK just forces an atomic order.

Quote:
You'd expect that each core accessing the XCHG variable though would have to get the value from memory
memory, or cache. In recent x86 systems, cores can update each others' cache; that's what Hypertransport's for, a shortcut between caches.
Quote:
(but I read that these atomic operations do create a memory barrier so a core cannot execute instructions either side of said barrier out of order).
Excellent.
Quote:
The pthreads occasionally check the "value" of a task "state", when they reach that task in the queue, therefore if the "state" isn't "ready" they simply move on to the next task (hence the pthread has more work to do and isn't polling continuously).
And when they run out of jobs completely?
Quote:
You see what this means, as long as a pthread "eventually" discovers a task is "ready" that's okay, even if it's not asap. It seems like a lock would be unnecessary here then, but a pthread shouldn't detect that the task state is "ready" before its other data members have been updated (hence the need for a memory barrier).
Your readers would spend a lot of time scanning a mostly-empty list, and your writer would spend a lot of time scanning a mostly-full one. That's a lot of time wasted. You'd be much better off just using a normal one-writer-many-reader queue. It won't block when there's lots of work, and will actually put threads to sleep when there's none, instead of everything spinlocking forever.
Quote:
If using these atomic operations isn't going to impact throughput, then great they solve the problem, but even that seems like overkill when I only need to ensure that a handful of statements are executed in a certain order.
Using spinlocks to handle a work queue, now that's overkill.

Last edited by Corona688; 06-17-2010 at 06:07 PM..
The Following User Says Thank You to Corona688 For This Useful Post:
gorga (06-17-2010)
Sponsored Links
    #7  
Old Unix and Linux 06-17-2010   -   Original Discussion by gorga
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Your readers could spend a lot of time scanning a mostly-empty list, and your writer could spend a lot of time scanning a mostly-full one.
I've accounted for that possibility in the elaborate design of the "task-queues", (trust me on this).

Quote:
That's a lot of time wasted, you'd be much better off just using a normal one-writer-many-reader queue. It won't block when there's lots of work, and will actually put threads to sleep when there's none, instead of everything spinlocking forever.
The "thread-pool" is a customised in design, dedicated for the application sitting atop of it. This application generates a high number of tasks, and possesses much implicit parallelism. I originally used it with existing thread-pools, but found I needed more control over the allocation of "tasks" to "cores", hence I'm making my own. When there is no work it means there is no work in the application, so that's okay.

After reading your post, I think I'll run with the atomics, see how that fares. Thanks for your help (again).
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
i686, x86 64, ppc superblacksmith Linux 3 03-18-2008 09:19 PM
Memory-waste in Ubuntu/Debian? riwa UNIX for Dummies Questions & Answers 0 03-04-2006 05:01 PM



All times are GMT -4. The time now is 06:56 PM.