Making use of multiple cores for running sed and awk scripts


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Making use of multiple cores for running sed and awk scripts
# 1  
Old 08-20-2011
Making use of multiple cores for running sed and awk scripts

Hi All,

After reading that the sort command in Linux can be made to use many processor cores just by using a simple script which I found on the internet, I was wondering if I can use similar techniques for programs like the awk and sed?

Code:
#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*

Previously, I used to use sort without using the above script and it used to take several minutes to sort a very large file. By default sort command only uses one core of the processor.

My school has just purchased a 16 core server with Linux and 96 GB RAM, so I am currently fiddling with it. Smilie

Now, a thought comes to my mind: Can sed and awk be used in the same way so that they make use of all the 16 cores of the processor?

I ask this because once I tried to fiddle with a huge Wikipedia file dump which I downloaded from the internet. The XML file is 30 GB in size and contains some 3.5 million articles.

I then ran this script in order to parse the individual articles and store them in separate files:

Code:
awk '/<page>/{c++}{print > c ".dat"}' wikipedia_dump.xml

To my horror, it took about 10-12 days to complete the task. I am wondering, if it is possible to use awk in such a way that it could use all the cores of the processor and run in a multi-threaded fashion? I ran the above awk script on the same new server running Linux.
# 2  
Old 08-20-2011
You can try GNU parallel. This command should help:
Code:
cat wikipedia_dump.xml | parallel --pipe --recstart '<page>' awk '...'

===

No... I'm afraid it will use a fresh counter on each chunk and new files would overwrite older ones... You should split somehow your file on 16 files and then do something like this
Code:
cat filelist | parallel awk '/<page>/{c++} {print > FILENAME "-" c ".dat"}

---------- Post updated at 10:56 PM ---------- Previous update was at 10:04 PM ----------

And there are almost one million articles. It would be hard to work with them as separated files. It better use some xml-streamed parser and put information to database. But this is another story, of course.

Last edited by yazu; 08-20-2011 at 12:15 PM.. Reason: Wrong advise
This User Gave Thanks to yazu For This Post:
# 3  
Old 08-20-2011
Hi Shoaib,

Have you developed any tool to parse xml wikipedia dump?

Regards

Satheesh
# 4  
Old 08-20-2011
Hi,

Not any tool though (as I think sed and awk are the best tools to parse the Wikipedia XML dump) and I just used a simple regular expression technique to parse and extract the Wikipedia articles from one huge file available for download. But the problem was, it took days to parse the entire dump so I thought why not parallelize the entire thing so that it could be done fast?
Though even after parsing lots of prepossessing needs to be done which I feel is easy just by using certain heuristics and then running sed or awk on those heuristics.

But if you are looking for tools parse the XML Wikipedia dump, you may look here:
Experiments on the English Wikipedia &mdash; gensim

Wikipedia Preprocessor (WikiPrep)

Hope this helps. Smilie
# 5  
Old 08-21-2011
Thank you Shoaib. Please let me know in case you succeed running your code by spawning multiple threads.

Thanks
Satheesh
# 6  
Old 08-21-2011
Quote:
Originally Posted by yazu
Code:
cat filelist | parallel awk ...

useless use of cat. Yes, there are times it's useful and this ain't it. If you're pining for the fjords same order, you can do < filename command to get the same effect.
This User Gave Thanks to Corona688 For This Post:
# 7  
Old 08-22-2011
From GNU parallel man page:

Quote:
EXAMPLE: Rewriting a for-loop and a while-read-loop

Code:
for-loops like this:

  (for x in `cat list` ; do
    do_something $x
  done) | process_output
and while-read-loops like this:

  cat list | (while read x ; do
    do_something $x
  done) | process_output
can be written like this:

cat list | parallel do_something | process_output

If the processing requires more steps the for-loop like this:

Code:
(for x in `cat list` ; do
   no_extension=${x%.*};
   do_something $x scale $no_extension.jpg
   do_step2 <$x $no_extension
 done) | process_output
and while-loops like this:

 cat list | (while read x ; do
   no_extension=${x%.*};
   do_something $x scale $no_extension.jpg
   do_step2 <$x $no_extension
 done) | process_output
can be written like this:

cat list | parallel "do_something {} scale {.}.jpg ; do_step2 <{} {.}" | process_output

You can always send a patch them or report them about the bug.

Last edited by yazu; 08-22-2011 at 01:54 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Running Multiple scripts at a time

Hello! I have a scriptA.ksh and in this script I need to call script1.ksh, script2.ksh, script3.ksh, script4.ksh and script5.ksh. But want to run in two batches like 1st script1.ksh, script2.ksh, script3.ksh, once all 3 are completed then script4.ksh script5.ksh I have given the syntax... (1 Reply)
Discussion started by: karumudi7
1 Replies

2. UNIX for Advanced & Expert Users

Running Multiple Scripts for different business date

Hi Team, I have the below 4 scripts which I will be running in sequential order. This run will start for today's business date. If all the 4 scripts are success for today that means script has ran succesfully. Howver if any one of these 4 scripts failed then it has to take the next... (1 Reply)
Discussion started by: Deena1984
1 Replies

3. Shell Programming and Scripting

Issue with SUNOS running sed scripts

Hi I probably dont have GNU extended sed in my SUNOS . and its creating lot of problems ex: a simple sed command like this is not working sed '/WORD/ a\ sample text line 1 \ sample text line 1 ' filename sed: command garbled: /WORD/ a I took precaution to have a new line after... (11 Replies)
Discussion started by: vash
11 Replies

4. UNIX for Dummies Questions & Answers

Execution problem in running multiple scripts

hi all, I have 3 individual scripts to perform the task . 2nd script should run only after the 1st script and 3rd script must run only after first 2 scripts are executed successfully. i want to have a single script that calls all this 3 scripts .this single script should execute the 2nd script... (1 Reply)
Discussion started by: Rahul619
1 Replies

5. Shell Programming and Scripting

Help with Shell Scripts Using sed in multiple files.

Hi, I was hoping that someone could help me. I have a problem that i am trying to work on and it requires me to change text within multiple files using sed. I use the program to change an occurance of a word throughout different files that are being tested. At first i had to Create a new script,... (1 Reply)
Discussion started by: Johnny2518
1 Replies

6. Shell Programming and Scripting

making the first character of word using uppercase using awk and sed

I want to make the first character of some words to be uppercase. I have a file like the one below. uid,givenname,sn,cn,mail,telephonenumber mattj,matt,johnson,matt johnson,mattj@gmail.com markv,mark,vennet,matt s vennet,markv@gmail.com mikea,mike,austi,mike austin,mike@gmail.com I want... (3 Replies)
Discussion started by: matt12
3 Replies

7. Shell Programming and Scripting

Running Multiple scripts based on file size.

Hi, I have created 3 shell scripts which has to run one by one first two shell scripts will create a .txt files...which are used by the third shell script.Now I want to create a master script and run all these in a single script. Please give a pseudo code on how to so the same. ... (4 Replies)
Discussion started by: gaur.deepti
4 Replies

8. Shell Programming and Scripting

multiple child scripts running in backgroud, how to use grep on the parent?

Hi I have a shell script A which calls another 10 shell scripts which run in background. How do i make the parent script wait for the child scripts complete, or in other words, i must be able to do a grep of parent script to find out if the child scripts are still running. My Code: ... (5 Replies)
Discussion started by: albertashish
5 Replies

9. Shell Programming and Scripting

running multiple scripts

Hi all I have a requirement where I have a flow like Script1 script2 Script3 Script 4 Script 5 Script 6 script7 where script2 to script6 will... (3 Replies)
Discussion started by: nvuradi
3 Replies

10. UNIX for Dummies Questions & Answers

How to open multiple shells while the scripts keeps running.

Hello, I've tried for a while now to run a bash script that continues to the end, while opening new shells as needed. I've tried xterm -e "somecommand"; & xterm -e " somecommand"; I've also tried screen -S "somecommand"; & screen -S "somecommand"; All without any luck, they... (5 Replies)
Discussion started by: Closed_Socket
5 Replies
Login or Register to Ask a Question