Memory Barriers for (Ubuntu) Linux (i686)

06-17-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

P.S. On a two-core single-CPU system, the overhead of XCHG vs LOCK XCHG with five seperate processes:

Code:

$ ./a.out & ./a.out & ./a.out & ./a.out & ./a.out &
12225 !Lock     time = 0 M 8 S 657 ms 205 us = 0.116 Mops/s
12229 !Lock     time = 0 M 8 S 801 ms 676 us = 0.114 Mops/s
12227 !Lock     time = 0 M 8 S 896 ms 459 us = 0.112 Mops/s
12228 !Lock     time = 0 M 8 S 958 ms 739 us = 0.112 Mops/s
12226 !Lock     time = 0 M 9 S 157 ms 723 us = 0.109 Mops/s
12228 Lock      time = 0 M 8 S 610 ms 749 us = 0.116 Mops/s
12227 Lock      time = 0 M 8 S 719 ms 860 us = 0.115 Mops/s
12225 Lock      time = 0 M 9 S 49 ms 622 us = 0.111 Mops/s
12226 Lock      time = 0 M 8 S 608 ms 304 us = 0.116 Mops/s
12229 Lock      time = 0 M 9 S 48 ms 352 us = 0.111 Mops/s

The code is a million loops of this:

Code:

                        "LOOP1:                 \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       loop    LOOP1   \n"

except once with LOCK XCHG, one with just XCHG. No significant difference.

---------- Post updated at 03:25 PM ---------- Previous update was at 03:14 PM ----------

Quote:

Originally Posted by gorga

I originally used it with existing thread-pools, but found I needed more control over the allocation of "tasks" to "cores"

How so?

Quote:

hence I'm making my own.

I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead? It's best to write portably if possible anyway.

Last edited by Corona688; 06-17-2010 at 06:42 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-17-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

No significant difference.

Looks like you're right judging by those results. But if I ran it on 4/8/16/32 etc cores, would it still be the case? I have 4-core at work, I'll try it tomoro. Although if LOCK just causes a "re-ordering" of bus-access I suppose theoretically it should impact throughput.

Quote:

How so?

I want to control which queue, and ultimately which pthread runs which task, based on the fact that in the upper application, some tasks communicate very frequently and some never. Also, there is much scope for assigning equal work loads across cores, (think of a n-ary tree structure where the n-paths are of equal length and communication is restricted to nodes of the same path.) I looked at Threading Building Blocks "tasks" at first, but found it too blunt a tool for what I want.

Quote:

I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead?

Not really, I originally wanted to use pthreads but they didn't offer the high number of threads and "lightweightness" I needed, (in the order of 10s of 1000s, with many short-lived threads), but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved. So what I've done, with some inspiration from "protothreads", is provide an abstraction on top of pthreads which provides what I need in the form of lightweight tasks.

I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable. There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.

thanks!

gorga

View Public Profile for gorga

Find all posts by gorga

06-18-2010

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Quote:

but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved.

Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

06-18-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by fpmurphy

Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.

I was referring to the older GNU Portable Threads...

GNU Pth - The GNU Portable Threads

"Pth doesn't require any kernel support, but can NOT benefit from multiprocessor machines."

Hmm...

"In practice, this is no problem, because multiprocessor systems are rare, and portability is almost more important than highest concurrency."

(Written some time ago I presume)

You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.

gorga

View Public Profile for gorga

Find all posts by gorga

06-18-2010

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

The old system(usually known as linuxthreads) has been abandoned for many moons now. It was designed to operate without modifying the kernel, which made it a very strange beast -- it created pretend-thread processes with 100% shared memory, did all communication with signal traps, and had some very....unique bugs that turned out to be fundamental design flaws(zombie threads! Wow!)

So, no. It was not a high-performance threading model. There's only a very few things(like ulibc in embedded systems) that still stick with it these days.

With kernel 2.5 and later, they built enough things into the kernel to let them do threading properly. I've shown you bits of its code -- some fundamental things are almost down to the instruction level. NPTL is much, much better, and I've found it quite good.

Quote:

I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable.

Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now. I've looked in its code and shown you some of it; it's not bloated.

Quote:

There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.

I remain stolidly unconvinced that spinlocking is more efficient than blocking. If your writer really can keep up with your readers, a proper queue might not block at all even in pthreads.

Quote:

But anyway, I needed something with more control than was on offer with something featuring a standard posix api.

Are you sure of that? You only discovered thread-specific data last week.

Last edited by Corona688; 06-18-2010 at 02:42 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-18-2010

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Quote:

You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.

NGPT was developed by a small team which included some people - Bill Abt, for example - from IBM. Work was funded by IBM so IBM got the kludos. Yes NGPT was shelved at v2.2 in favor of NPTL. For a while we looked at scheduler activations (see New GNU thread library) but then moved away from that concept towards a 1x1 model as proposed by Ingo Molnar. NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.

Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.

BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU. Hugh performance gains can be achieved if you can solve the parallel programming issues. Another approach might be to use a multithreaded parallel programming language such as Intel's CILK.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

06-18-2010

Registered User

18, 0

Join Date: Jun 2010

Last Activity: 21 June 2010, 10:45 AM EDT

Posts: 18

Thanks Given: 3

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now.

But with those atomic operations I can avoid sending threads to sleep and having to wake them up.

Quote:

I remain stolidly unconvinced that spinlocking is more efficient than blocking.

I would agree if no other progress is being made, but to reiterate, the thread-pool I'm building is for dedicated use with a custom application which features high numbers of tasks. 95% of the time I would expect my threads to have tasks to execute, a plenty of them. The remaining 5% would be a situation where there's no work to do, so it's not a problem.

Quote:

Are you sure of that? You only discovered thread-specific data last week.

Pretty sure, I've already built prototypes of my system in Erlang and I found a need for such control.

---------- Post updated at 12:41 AM ---------- Previous update was at 12:30 AM ----------

Quote:

Originally Posted by fpmurphy

NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.

Are you saying it's possible to create tens of thousands of pthreads on linux? I tried this some time ago and I couldn't generate anywhere near that amount, doesn't the kernel impose strict limits on the number of threads that can be generated anyway?

Quote:

Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.

Thanks, I have encountered similar problems. But to explain, I'm not working at the abstraction of pthreads in the "thread-pool" but rather very lightweight tasks of execution that involve little more than a function call, some memory and no context switching. Something along the lines of protothreads. The pthreads underneath simply run these tasks in a continous loop.

Quote:

BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU.

Thanks, I have examined CUDA and OpenMP, Threading Building Blocks etc, but the nature of what I'm doing involves an expansion of state, whereas these libraries typically feature a mapping of many parallel elements onto parallel resources, assigning a "thread" of execution to each iteration of a loop for example.

As mentioned in the previous post I began my work with Erlang, which offers 10s of 1000s of lightweight processes, communicating via message passing. But I found I needed more control than was on offer.

gorga

View Public Profile for gorga

Find all posts by gorga

High Performance Computing

Memory Barriers for (Ubuntu) Linux (i686)

4 More Discussions You Might Find Interesting

1. Ubuntu

XP and Linux (Ubuntu) on same disk, Can I install Ubuntu on not-yet partitioned portion of disk?

Discussion started by: C.Weidemann

2. Programming

Getting the total virtual memory for ubuntu in c++

Discussion started by: xiaojesus

3. Linux

i686, x86 64, ppc

Discussion started by: superblacksmith

4. UNIX for Dummies Questions & Answers

Memory-waste in Ubuntu/Debian?

Discussion started by: riwa