Multi thread awk command for faster performance


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Multi thread awk command for faster performance
# 8  
Old 04-29-2012
(sem is part of parallel) You may want to limit the number of files running at the same time to avoid eating the system when other users want resources:
Code:
for dir in `find /home -type d -name "Folder*"`; do
   sem -j 10 first $dir
done 
sem


for cpu intensive operations sem (part of parallel) can control the number of cpu cores being used. Or limit access to a resource like a semaphore, hence the name. This example limits sem to the number of available cores on the system

Code:
for dir in `find /home -type d -name "Folder*"`
do
  sem -j+0  first $fname
done
sem

sem on the last line wait for all the other sem invocations to complete.

parallel is perl, so it runs on systems with perl 5.8 or higher

http://ftp.gnu.org/gnu/parallel/
These 3 Users Gave Thanks to jim mcnamara For This Post:
# 9  
Old 04-29-2012
Your program already does use multiple processes -- the tr happens simultanelusly -- but it's a bit of a waste really, since doing it in that fashion isn't faster than doing it inside awk.

awk is hardly a one-trick pony, you can run it once here to replace everything you've been doing by running awk, tr, and echo 10,000 times apiece. Since there is a large cost to running small programs over and over, this will speed up performance a lot.

Perhaps something like this:

Code:
awk -v OFS="" 'FNR==1 { printf("\n") }; /<.*/ , /.*<\/.*>/' *.txt

This User Gave Thanks to Corona688 For This Post:
# 10  
Old 04-29-2012
I think with thousands of files: globbing *.txt is a problem with ARG_MAX on lots of UNIX platforms. Correct me if I messed something. I thought that was why the OP used find to start with.
# 11  
Old 04-29-2012
Actually OP didn't use find. I did, to get list of directories containing files to be processed Smilie
# 12  
Old 04-29-2012
Hi.

The purpose of my post was to:

1) clarify the difference between the terms thread and process,

2) demonstrate that running simultaneous processes can be easy (but may require some tinkering with options, checking times for completion of a set of tasks, etc.),

I didn't know specifically about sem -- thanks Jim. Apparently version 20111122 didn't have it (at least on the install), but I see that ... GNU sem is an alias for GNU parallel --semaphore ... Other options can set the number of jobs relative to the number of CPUs or cores -- potentially very useful.

The man page for parallel contains lots of examples as well as comparisons between parallel and other utilities of the same kind, e.g. xargs, paexec, etc.

Best wishes ... cheers, drl

UPDATE:

I finally found sem in the parallel install directory, along with niceload, etc., including man pages.

( Edit 2: add sem discovery )
( Edit 1: correct minor typo )

Last edited by drl; 04-30-2012 at 11:52 AM..
# 13  
Old 04-30-2012
Thanks drl.

It helped me understand the concept.

---------- Post updated at 02:06 AM ---------- Previous update was at 01:11 AM ----------

Quote:
Originally Posted by Corona688
Code:
awk -v OFS="" 'FNR==1 { printf("\n") }; /<.*/ , /.*<\/.*>/' *.txt

Hi Corona,

Actually i wanted the XML to be in one single line from one file.
So i was using script that way.

Can you please let me know how to get the xml into one single line from one file using the code above?
Code:
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'

Thanks,
Chetan.C
# 14  
Old 04-30-2012
Quote:
Originally Posted by chetan.c
Actually i wanted the XML to be in one single line from one file.
So i was using script that way.
I know what you're trying to do, and thought my script did that, but it had an error:

Code:
awk -v ORS="" 'FNR==1 { printf("\n") }; /<.*/ , /.*<\/.*>/' *.txt

Try it again, please.

It ought to do everything you were trying to do in 5 lines and 3 external programs and a parallel program, in one line with one program, faster...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time. Each line has the following structure: 127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br) These text files are... (21 Replies)
Discussion started by: brenoasrm
21 Replies

2. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
Discussion started by: Peu Mukherjee
13 Replies

3. Shell Programming and Scripting

How to substract selective values in multi row, multi column file (using awk or sed?)

Hi, I have a problem where I need to make this input: nameRow1a,text1a,text2a,floatValue1a,FloatValue2a,...,floatValue140a nameRow1b,text1b,text2b,floatValue1b,FloatValue2b,...,floatValue140b look like this output: nameRow1a,text1b,text2a,(floatValue1a - floatValue1b),(floatValue2a -... (4 Replies)
Discussion started by: nricardo
4 Replies

4. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Hi, I have a large number of input files with two columns of numbers. For example: 83 1453 99 3255 99 8482 99 7372 83 175 I only wish to retain lines where the numbers fullfil two requirements. E.g: =83 1000<=<=2000 To do this I use the following... (10 Replies)
Discussion started by: s052866
10 Replies

5. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Discussion started by: SkySmart
8 Replies

6. Shell Programming and Scripting

Multi thread shell programming

I have a unix directory where a million of small text files getting accumulated every week. As of now there is a shell batch program in place which merges all the files in this directory into a single file and ftp to other system. Previously the volume of the files would be around 1 lakh... (2 Replies)
Discussion started by: vk39221
2 Replies

7. Programming

Multi thread data sharing problem in uclinux

hello, I have wrote a multi thread application to run under uclinux. the problem is that threads does not share data. using the ps command it shows a single process for each thread. I test the application under Ubuntu 8.04 and Open Suse 10.3 with 2.6 kernel and there were no problems and also... (8 Replies)
Discussion started by: mrhosseini
8 Replies

8. UNIX for Dummies Questions & Answers

Which command will be faster? y?

i)wc -c/etc/passwd|awk'{print $1}' ii)ls -al/etc/passwd|awk'{print $5}' (4 Replies)
Discussion started by: karthi_g
4 Replies

9. Programming

Multi threading using posix thread library

hi all, can anyone tell me some good site for the mutithreading tutorials, its application, and some code examples. -sushil (2 Replies)
Discussion started by: shushilmore
2 Replies
Login or Register to Ask a Question