Unix/Linux Go Back    


High Performance Computing Message Passing Interface (MPI) programming and tuning, MPI library installation and management, parallel administration tools, cluster monitoring, cluster optimization, and more HPC topics.

Memory Barriers for (Ubuntu) Linux (i686)

High Performance Computing


Closed    
 
Thread Tools Search this Thread Display Modes
    #8  
Old Unix and Linux 06-17-2010
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 18 August 2017, 6:48 PM EDT
Location: Saskatchewan
Posts: 22,336
Thanks: 1,116
Thanked 4,216 Times in 3,899 Posts
P.S. On a two-core single-CPU system, the overhead of XCHG vs LOCK XCHG with five seperate processes:


Code:
$ ./a.out & ./a.out & ./a.out & ./a.out & ./a.out &
12225 !Lock     time = 0 M 8 S 657 ms 205 us = 0.116 Mops/s
12229 !Lock     time = 0 M 8 S 801 ms 676 us = 0.114 Mops/s
12227 !Lock     time = 0 M 8 S 896 ms 459 us = 0.112 Mops/s
12228 !Lock     time = 0 M 8 S 958 ms 739 us = 0.112 Mops/s
12226 !Lock     time = 0 M 9 S 157 ms 723 us = 0.109 Mops/s
12228 Lock      time = 0 M 8 S 610 ms 749 us = 0.116 Mops/s
12227 Lock      time = 0 M 8 S 719 ms 860 us = 0.115 Mops/s
12225 Lock      time = 0 M 9 S 49 ms 622 us = 0.111 Mops/s
12226 Lock      time = 0 M 8 S 608 ms 304 us = 0.116 Mops/s
12229 Lock      time = 0 M 9 S 48 ms 352 us = 0.111 Mops/s

The code is a million loops of this:
Code:
                        "LOOP1:                 \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       xchg    %ebx, a \n"
                        "       loop    LOOP1   \n"

except once with LOCK XCHG, one with just XCHG. No significant difference.

---------- Post updated at 03:25 PM ---------- Previous update was at 03:14 PM ----------

Quote:
Originally Posted by gorga View Post
I originally used it with existing thread-pools, but found I needed more control over the allocation of "tasks" to "cores"
How so?
Quote:
hence I'm making my own.
I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead? It's best to write portably if possible anyway.

Last edited by Corona688; 06-17-2010 at 05:42 PM..
Sponsored Links
    #9  
Old Unix and Linux 06-17-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
No significant difference.


Looks like you're right judging by those results. But if I ran it on 4/8/16/32 etc cores, would it still be the case? I have 4-core at work, I'll try it tomoro. Although if LOCK just causes a "re-ordering" of bus-access I suppose theoretically it should impact throughput.

Quote:
How so?
I want to control which queue, and ultimately which pthread runs which task, based on the fact that in the upper application, some tasks communicate very frequently and some never. Also, there is much scope for assigning equal work loads across cores, (think of a n-ary tree structure where the n-paths are of equal length and communication is restricted to nodes of the same path.) I looked at Threading Building Blocks "tasks" at first, but found it too blunt a tool for what I want.

Quote:
I don't see how using a different structure excludes pthreads. You wanted to avoid pthreads since it used atomic ops, and are prepared to use atomic ops instead?
Not really, I originally wanted to use pthreads but they didn't offer the high number of threads and "lightweightness" I needed, (in the order of 10s of 1000s, with many short-lived threads), but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved. So what I've done, with some inspiration from "protothreads", is provide an abstraction on top of pthreads which provides what I need in the form of lightweight tasks.

I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable. There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.

thanks!
Sponsored Links
    #10  
Old Unix and Linux 06-18-2010
fpmurphy's Unix or Linux Image
fpmurphy fpmurphy is offline Forum Advisor  
who?
 
Join Date: Dec 2003
Last Activity: 12 June 2016, 11:03 PM EDT
Location: /dev/ph
Posts: 4,996
Thanks: 73
Thanked 475 Times in 437 Posts
Quote:
but user-level threads like GNU threads don't offer multi-core exploitation because the kernel isn't involved.
Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.
    #11  
Old Unix and Linux 06-18-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by fpmurphy View Post
Huh? NPTL is a 1:1 threading model. NGPT is an M:N threading model. Of course the kernel is involved.
I was referring to the older GNU Portable Threads...

GNU Pth - The GNU Portable Threads

"Pth doesn't require any kernel support, but can NOT benefit from multiprocessor machines."

Hmm...

"In practice, this is no problem, because multiprocessor systems are rare, and portability is almost more important than highest concurrency."

(Written some time ago I presume)

You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
Sponsored Links
    #12  
Old Unix and Linux 06-18-2010
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 18 August 2017, 6:48 PM EDT
Location: Saskatchewan
Posts: 22,336
Thanks: 1,116
Thanked 4,216 Times in 3,899 Posts
The old system(usually known as linuxthreads) has been abandoned for many moons now. It was designed to operate without modifying the kernel, which made it a very strange beast -- it created pretend-thread processes with 100% shared memory, did all communication with signal traps, and had some very....unique bugs that turned out to be fundamental design flaws(zombie threads! Wow!)

So, no. It was not a high-performance threading model. There's only a very few things(like ulibc in embedded systems) that still stick with it these days.

With kernel 2.5 and later, they built enough things into the kernel to let them do threading properly. I've shown you bits of its code -- some fundamental things are almost down to the instruction level. NPTL is much, much better, and I've found it quite good.

Quote:
I wanted to avoid the pthreads syncrhonisation structures like mutexes because I sought to avoid their overhead and keep it scalable.
Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now. I've looked in its code and shown you some of it; it's not bloated.
Quote:
There are ways to distribute work such that mutexes aren't necessary as long as an ordering of instructions can be guaranteed, hence following your advice, I'll try those atomic instructions from GCC.
I remain stolidly unconvinced that spinlocking is more efficient than blocking. If your writer really can keep up with your readers, a proper queue might not block at all even in pthreads.
Quote:
But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
Are you sure of that? You only discovered thread-specific data last week.

Last edited by Corona688; 06-18-2010 at 01:42 PM..
Sponsored Links
    #13  
Old Unix and Linux 06-18-2010
fpmurphy's Unix or Linux Image
fpmurphy fpmurphy is offline Forum Advisor  
who?
 
Join Date: Dec 2003
Last Activity: 12 June 2016, 11:03 PM EDT
Location: /dev/ph
Posts: 4,996
Thanks: 73
Thanked 475 Times in 437 Posts
Quote:
You're referring to Next Generation Posix Threads developed by IBM though. I'm not sure but wasn't this abandoned in favour of simpler 1:1 threading in NPTL? But anyway, I needed something with more control than was on offer with something featuring a standard posix api.
NGPT was developed by a small team which included some people - Bill Abt, for example - from IBM. Work was funded by IBM so IBM got the kludos. Yes NGPT was shelved at v2.2 in favor of NPTL. For a while we looked at scheduler activations (see New GNU thread library) but then moved away from that concept towards a 1x1 model as proposed by Ingo Molnar. NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.

Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.

BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU. Hugh performance gains can be achieved if you can solve the parallel programming issues. Another approach might be to use a multithreaded parallel programming language such as Intel's CILK.
Sponsored Links
    #14  
Old Unix and Linux 06-18-2010
gorga gorga is offline
Registered User
 
Join Date: Jun 2010
Last Activity: 21 June 2010, 10:45 AM EDT
Posts: 18
Thanks: 3
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
Yes, and the overhead you were worried about was the same atomic operations you're hellbent on using now.
But with those atomic operations I can avoid sending threads to sleep and having to wake them up.

Quote:
I remain stolidly unconvinced that spinlocking is more efficient than blocking.
I would agree if no other progress is being made, but to reiterate, the thread-pool I'm building is for dedicated use with a custom application which features high numbers of tasks. 95% of the time I would expect my threads to have tasks to execute, a plenty of them. The remaining 5% would be a situation where there's no work to do, so it's not a problem.

Quote:
Are you sure of that? You only discovered thread-specific data last week.
Pretty sure, I've already built prototypes of my system in Erlang and I found a need for such control.

---------- Post updated at 12:41 AM ---------- Previous update was at 12:30 AM ----------

Quote:
Originally Posted by fpmurphy View Post
NPTL has been proven in numerous benchmarks to satisfactorily scale to tens of thousands of threads of execution.
Are you saying it's possible to create tens of thousands of pthreads on linux? I tried this some time ago and I couldn't generate anywhere near that amount, doesn't the kernel impose strict limits on the number of threads that can be generated anyway?

Quote:
Frankly if you wish to scale to tens of thousands of threads you have lots more problems to worry about than what you have discussed so far. See The C10K problem for example.
Thanks, I have encountered similar problems. But to explain, I'm not working at the abstraction of pthreads in the "thread-pool" but rather very lightweight tasks of execution that involve little more than a function call, some memory and no context switching. Something along the lines of protothreads. The pthreads underneath simply run these tasks in a continous loop.

Quote:
BTW, an alternative approach to scaling your application might be use something like CUDA and offload the application to a GPU.
Thanks, I have examined CUDA and OpenMP, Threading Building Blocks etc, but the nature of what I'm doing involves an expansion of state, whereas these libraries typically feature a mapping of many parallel elements onto parallel resources, assigning a "thread" of execution to each iteration of a loop for example.

As mentioned in the previous post I began my work with Erlang, which offers 10s of 1000s of lightweight processes, communicating via message passing. But I found I needed more control than was on offer.
Sponsored Links
Closed

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
i686, x86 64, ppc superblacksmith Linux 3 03-18-2008 08:19 PM
Memory-waste in Ubuntu/Debian? riwa UNIX for Dummies Questions & Answers 0 03-04-2006 04:01 PM



All times are GMT -4. The time now is 09:42 AM.