Multi thread awk command for faster performance Post: 302632149

Sponsored Content

Top Forums Shell Programming and Scripting Multi thread awk command for faster performance Post 302632149 by drl on Sunday 29th of April 2012 09:50:03 AM

04-29-2012

Registered User

Hi.

Quote:

A thread is a lightweight process. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources. -- excerpt from http://en.wikipedia.org/wiki/Thread_(computing)

Here is a sample use of GNU parallel that counts file contents with wc:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate multiple processes simultaneously, with "GNU parallel".

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C parallel

pl " Structure of directories:"
tree d1 d2

pl " Results of parallel processes:"
ls d1/* d2/* |
grep txt |
parallel --ungroup 'echo -n job {#}, process $$", wc = "; wc {}' |
align

exit 0

producing:

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
parallel GNU parallel 20111122

-----
 Structure of directories:
d1
|-- a.txt
|-- b.txt
|-- binary-1.exe
`-- c.txt
d2
|-- frog-town.jpg
|-- x.txt
`-- y.txt

0 directories, 7 files

-----
 Results of parallel processes:
job 1, process 27495, wc =  4  16   70 d1/a.txt
job 2, process 27515, wc = 16  16  123 d1/b.txt
job 3, process 27535, wc = 26 265 1464 d1/c.txt
job 4, process 27555, wc =  4  16   70 d2/x.txt
job 5, process 27575, wc = 16  16  123 d2/y.txt

Each one of the tasks was run as a separate process. The calling sequence for parallel is complex, so some experimentation might be useful. I have not tried it, but I think parallel claims to be able to utilize different computers for tasks.

The code for the (perl) parallel script is at GNU Parallel - GNU Project - Free Software Foundation

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

9 More Discussions You Might Find Interesting

1. Programming

Multi threading using posix thread library

hi all, can anyone tell me some good site for the mutithreading tutorials, its application, and some code examples. -sushil

2. UNIX for Dummies Questions & Answers

Which command will be faster? y?

i)wc -c/etc/passwd|awk'{print $1}' ii)ls -al/etc/passwd|awk'{print $5}'

3. Programming

Multi thread data sharing problem in uclinux

hello, I have wrote a multi thread application to run under uclinux. the problem is that threads does not share data. using the ps command it shows a single process for each thread. I test the application under Ubuntu 8.04 and Open Suse 10.3 with 2.6 kernel and there were no problems and also...

4. Shell Programming and Scripting

Multi thread shell programming

I have a unix directory where a million of small text files getting accumulated every week. As of now there is a shell batch program in place which merges all the files in this directory into a single file and ftp to other system. Previously the volume of the files would be around 1 lakh...

5. Shell Programming and Scripting

Faster way to use this awk command

awk "/May 23, 2012 /,0" /var/tmp/datafile the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file. now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to...

6. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Hi, I have a large number of input files with two columns of numbers. For example: 83 1453 99 3255 99 8482 99 7372 83 175 I only wish to retain lines where the numbers fullfil two requirements. E.g: =83 1000<=<=2000 To do this I use the following...

7. Shell Programming and Scripting

How to substract selective values in multi row, multi column file (using awk or sed?)

Hi, I have a problem where I need to make this input: nameRow1a,text1a,text2a,floatValue1a,FloatValue2a,...,floatValue140a nameRow1b,text1b,text2b,floatValue1b,FloatValue2b,...,floatValue140b look like this output: nameRow1a,text1b,text2a,(floatValue1a - floatValue1b),(floatValue2a -...

8. Shell Programming and Scripting

How to make awk command faster?

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster. awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>...

9. Shell Programming and Scripting

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time. Each line has the following structure: 127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br) These text files are...

LEARN ABOUT DEBIAN

sem

SEM(1)								     parallel								    SEM(1)

NAME

       sem - semaphore for executing shell command lines in parallel

SYNOPSIS

       sem [--fg] [--id <id>] [--timeout <secs>] [-j <num>] [--wait] command

DESCRIPTION

       GNU sem is an alias for GNU parallel --semaphore.

       It works as a tool for executing shell commands in parallel. GNU sem acts as a counting semaphore. When GNU sem is called with command it
       will start the command in the background. When num number of commands are running in the background, GNU sem will wait for one of these to
       complete before starting another command.

       Before looking at the options you may want to check out the examples after the list of options. That will give you an idea of what GNU sem
       is capable of.

OPTIONS

       command	Command to execute. The command may be followed by arguments for the command.

       --bg	Run command in background thus GNU parallel will not wait for completion of the command before exiting. This is the default.

		See also: --fg

       -j N	Run up to N commands in parallel. Default is 1 thus acting like a mutex.

       --jobs N
       -j N
       --max-procs N
       -P N	Run up to N commands in parallel. Default is 1 thus acting like a mutex.

       --jobs +N
       -j +N
       --max-procs +N
       -P +N	Add N to the number of CPU cores.  Run up to this many jobs in parallel. For compute intensive jobs -j +0 is useful as it will run
		number-of-cpu-cores jobs simultaneously.

       --jobs -N
       -j -N
       --max-procs -N
       -P -N	Subtract N from the number of CPU cores.  Run up to this many jobs in parallel.  If the evaluated number is less than 1 then 1
		will be used.  See also --use-cpus-instead-of-cores.

       --jobs N%
       -j N%
       --max-procs N%
       -P N%	Multiply N% with the number of CPU cores.  Run up to this many jobs in parallel.  If the evaluated number is less than 1 then 1
		will be used.  See also --use-cpus-instead-of-cores.

       --jobs procfile
       -j procfile
       --max-procs procfile
       -P procfile
		Read parameter from file. Use the content of procfile as parameter for -j. E.g. procfile could contain the string 100% or +2 or
		10.

       --semaphorename name
       --id name
		Use name as the name of the semaphore. Default is the name of the controlling tty (output from tty).

		The default normally works as expected when used interactively, but when used in a script name should be set. $$ or my_task_name
		are often a good value.

		The semaphore is stored in ~/.parallel/semaphores/

       --fg	Do not put command in background.

       --timeout secs (not implemented)
       -t secs (not implemented)
		If the semaphore is not released within secs seconds, take it anyway.

       --wait
       -w	Wait for all commands to complete.

EXAMPLE
: Gzipping *.log
       Run one gzip process per CPU core. Block until a CPU core becomes available.

	 for i in `ls *.log` ; do
	   echo $i
	   sem -j+0 gzip $i ";" echo done
	 done
	 sem --wait

EXAMPLE
: Protecting pod2html from itself
       pod2html creates two files: pod2htmd.tmp and pod2htmi.tmp which it does not clean up. It uses these two files for a short time. But if you
       run multiple pod2html in parallel (e.g. in a Makefile with make -j) you need to protect pod2html from running twice at the same time. sem
       running as a mutex will do just that:

	 sem --fg --id pod2html pod2html foo.pod > foo.html
	 sem --fg --id pod2html rm -f pod2htmd.tmp pod2htmi.tmp

BUGS

       None known.

REPORTING BUGS

       Report bugs to <bug-parallel@gnu.org>.

AUTHOR

       Copyright (C) 2010,2011 Ole Tange, http://ole.tange.dk and Free Software Foundation, Inc.

LICENSE

       Copyright (C) 2010,2011 Free Software Foundation, Inc.

       This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
       the Free Software Foundation; either version 3 of the License, or at your option any later version.

       This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

       You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>.

   Documentation license I
       Permission is granted to copy, distribute and/or modify this documentation under the terms of the GNU Free Documentation License, Version
       1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no
       Back-Cover Texts.  A copy of the license is included in the file fdl.txt.

   Documentation license II
       You are free:

       to Share to copy, distribute and transmit the work

       to Remix to adapt the work

       Under the following conditions:

       Attribution
		You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse
		you or your use of the work).

       Share Alike
		If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a
		compatible license.

       With the understanding that:

       Waiver	Any of the above conditions can be waived if you get permission from the copyright holder.

       Public Domain
		Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the
		license.

       Other Rights
		In no way are any of the following rights affected by the license:

		o Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;

		o The author's moral rights;

		o Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.

       Notice	For any reuse or distribution, you must make clear to others the license terms of this work.

       A copy of the full license is included in the file as cc-by-sa.txt.

DEPENDENCIES

       GNU sem uses Perl, and the Perl modules Getopt::Long, Symbol, Fcntl.

SEE ALSO

       parallel(1)

20120422							    2011-06-25								    SEM(1)

9 More Discussions You Might Find Interesting

1. Programming

Multi threading using posix thread library

Discussion started by: shushilmore

2. UNIX for Dummies Questions & Answers

Which command will be faster? y?

Discussion started by: karthi_g

3. Programming

Multi thread data sharing problem in uclinux

Discussion started by: mrhosseini

4. Shell Programming and Scripting

Multi thread shell programming

Discussion started by: vk39221

5. Shell Programming and Scripting

Faster way to use this awk command

Discussion started by: SkySmart

6. Shell Programming and Scripting

Making a faster alternative to a slow awk command

Discussion started by: s052866