Memory Barriers for (Ubuntu) Linux (i686)


 
Thread Tools Search this Thread
Special Forums UNIX and Linux Applications High Performance Computing Memory Barriers for (Ubuntu) Linux (i686)
# 8  
Old 06-17-2010
P.S. On a two-core single-CPU system, the overhead of XCHG vs LOCK XCHG with five seperate processes:

Code:
$ ./a.out & ./a.out & ./a.out & ./a.out & ./a.out &
12225 !Lock     time = 0 M 8 S 657 ms 205 us = 0.116 Mops/s
12229 !Lock     time = 0 M 8 S 801 ms 676 us = 0.114 Mops/s
12227 !Lock     time = 0 M 8 S 896 ms 459 us = 0.112 Mops/s
12228 !Lock     time = 0 M 8 S 958 ms 739 us = 0.112 Mops/s
12226 !Lock     time = 0 M 9 S 157 ms 723 us = 0.109 Mops/s
12228 Lock      time = 0 M 8 S 610 ms 749 us = 0.116 Mops/s
12227 Lock      time = 0 M 8 S 719 ms 860 us = 0.115 Mops/s
12225 Lock      time = 0 M 9 S 49 ms 622 us = 0.111 Mops/s
12226 Lock      time = 0 M 8 S 608 ms 304 us = 0.116 Mops/s
12229 Lock      time = 0 M 9 S 48 ms 352 us = 0.111 Mops/s

The code is a million loops of this:
Code:
                        "LOOP1:                 \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       loop    LOOP1   \n"

except once with LOCK XCHG, one with just XCHG. No significant difference.

---------- Post updated at 03:25 PM ---------- Previous update was at 03:14 PM ----------

Quote:
Originally Posted by gorga
I originally used it with existing thread-pools, but found I needed more control over the allocation of "tasks" to "cores"
How so?
Quote:
hence I'm making my own.
I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead? It's best to write portably if possible anyway.

Last edited by Corona688; 06-17-2010 at 06:42 PM..
# 9  
Old 06-17-2010
Quote:
Originally Posted by Corona688
No significant difference.


Looks like you're right judging by those results. But if I ran it on 4/8/16/32 etc cores, would it still be the case? I have 4-core at work, I'll try it tomoro. Although if LOCK just causes a "re-ordering" of bus-access I suppose theoretically it should impact throughput.

Quote:
How so?
I want to control which queue, and ultimately which pthread runs which task, based on the fact that in the upper application, some tasks communicate very frequently and some never. Also, there is much scope for assigning equal work loads across cores, (think of a n-ary tree structure where the n-paths are of equal length and communication is restricted to nodes of the same path.) I looked at Threading Building Blocks "tasks" at first, but found it too blunt a tool for what I want.

Quote:
I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead?
Not really, I originally wanted to use pthreads but they didn't offer the high number of threads and "lightweightness" I needed, (in the order of 10s of 1000s, with many short-lived threads), but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved. So what I've done, with some inspiration from "protothreads", is provide an abstraction on top of pthreads which provides what I need in the form of lightweight tasks.

I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable. There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.

thanks!
# 10  
Old 06-18-2010
Quote:
but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved.
Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.
# 11  
Old 06-18-2010
Quote:
Originally Posted by fpmurphy
Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.
I was referring to the older GNU Portable Threads...

GNU Pth - The GNU Portable Threads

"Pth doesn't require any kernel support, but can NOT benefit from multiprocessor machines."

Hmm...

"In practice, this is no problem, because multiprocessor systems are rare, and portability is almost more important than highest concurrency."

(Written some time ago I presume)

You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
# 12  
Old 06-18-2010
The old system(usually known as linuxthreads) has been abandoned for many moons now. It was designed to operate without modifying the kernel, which made it a very strange beast -- it created pretend-thread processes with 100% shared memory, did all communication with signal traps, and had some very....unique bugs that turned out to be fundamental design flaws(zombie threads! Wow!)

So, no. It was not a high-performance threading model. There's only a very few things(like ulibc in embedded systems) that still stick with it these days.

With kernel 2.5 and later, they built enough things into the kernel to let them do threading properly. I've shown you bits of its code -- some fundamental things are almost down to the instruction level. NPTL is much, much better, and I've found it quite good.

Quote:
I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable.
Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now. I've looked in its code and shown you some of it; it's not bloated.
Quote:
There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.
I remain stolidly unconvinced that spinlocking is more efficient than blocking. If your writer really can keep up with your readers, a proper queue might not block at all even in pthreads.
Quote:
But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
Are you sure of that? You only discovered thread-specific data last week.

Last edited by Corona688; 06-18-2010 at 02:42 PM..
# 13  
Old 06-18-2010
Quote:
You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
NGPT was developed by a small team which included some people - Bill Abt, for example - from IBM. Work was funded by IBM so IBM got the kludos. Yes NGPT was shelved at v2.2 in favor of NPTL. For a while we looked at scheduler activations (see New GNU thread library) but then moved away from that concept towards a 1x1 model as proposed by Ingo Molnar. NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.

Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.

BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU. Hugh performance gains can be achieved if you can solve the parallel programming issues. Another approach might be to use a multithreaded parallel programming language such as Intel's CILK.
# 14  
Old 06-18-2010
Quote:
Originally Posted by Corona688
Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now.
But with those atomic operations I can avoid sending threads to sleep and having to wake them up.

Quote:
I remain stolidly unconvinced that spinlocking is more efficient than blocking.
I would agree if no other progress is being made, but to reiterate, the thread-pool I'm building is for dedicated use with a custom application which features high numbers of tasks. 95% of the time I would expect my threads to have tasks to execute, a plenty of them. The remaining 5% would be a situation where there's no work to do, so it's not a problem.

Quote:
Are you sure of that? You only discovered thread-specific data last week.
Pretty sure, I've already built prototypes of my system in Erlang and I found a need for such control.

---------- Post updated at 12:41 AM ---------- Previous update was at 12:30 AM ----------

Quote:
Originally Posted by fpmurphy
NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.
Are you saying it's possible to create tens of thousands of pthreads on linux? I tried this some time ago and I couldn't generate anywhere near that amount, doesn't the kernel impose strict limits on the number of threads that can be generated anyway?

Quote:
Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.
Thanks, I have encountered similar problems. But to explain, I'm not working at the abstraction of pthreads in the "thread-pool" but rather very lightweight tasks of execution that involve little more than a function call, some memory and no context switching. Something along the lines of protothreads. The pthreads underneath simply run these tasks in a continous loop.

Quote:
BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU.
Thanks, I have examined CUDA and OpenMP, Threading Building Blocks etc, but the nature of what I'm doing involves an expansion of state, whereas these libraries typically feature a mapping of many parallel elements onto parallel resources, assigning a "thread" of execution to each iteration of a loop for example.

As mentioned in the previous post I began my work with Erlang, which offers 10s of 1000s of lightweight processes, communicating via message passing. But I found I needed more control than was on offer.
Login or Register to Ask a Question

Previous Thread | Next Thread

4 More Discussions You Might Find Interesting

1. Ubuntu

XP and Linux (Ubuntu) on same disk, Can I install Ubuntu on not-yet partitioned portion of disk?

My PC (Esprimo, 3 yeas old) has one hard drive having 2 partitions C: (80 GB NTFS, XP) and D: (120 GB NTFS, empty) and and a 200 MB area that yet is not-partitioned. I would like to try Ubuntu and to install Ubuntu on the not-partitioned area . The idea is to have the possibility to run... (7 Replies)
Discussion started by: C.Weidemann
7 Replies

2. Programming

Getting the total virtual memory for ubuntu in c++

Hi guys , i need to get the total virtual memory in ubuntu but i need to write a C++ code for that, any idea on how to go about doing it? any references? or website that i can refer to ? (6 Replies)
Discussion started by: xiaojesus
6 Replies

3. Linux

i686, x86 64, ppc

Hi, i am quite new to linux. I am interested in fedora linux distro. Fedora Project I dont know which one to choose, either i686, x86 64 or ppc. I prefer a live cd, coz its easy to use. And what is the difference between "Fedora Desktop Live Media" and "Fedora KDE Live Media". (3 Replies)
Discussion started by: superblacksmith
3 Replies

4. UNIX for Dummies Questions & Answers

Memory-waste in Ubuntu/Debian?

I have 512 mem on this laptop, though 'top' tells me I only have 380. However, Ubuntu is using 288 mb of memory, when I only have 3 terminals, running lynx, vim(for this file) and (of course) top. Considering it I have lynx running a 600 page txt file, which of course would eat some memory but 300?... (0 Replies)
Discussion started by: riwa
0 Replies
Login or Register to Ask a Question