Awk script gsub optimization

08-31-2010

Registered User

8, 0

Join Date: Jun 2008

Last Activity: 31 August 2010, 9:40 AM EDT

Posts: 8

Thanks Given: 1

Thanked 0 Times in 0 Posts

Awk script gsub optimization

I have created Shell script with below awk code for replacing special characters from input file.

Quote:

var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print $0 > "'"$NEWFILE"'"

}
END{print count}
' "${INPUT_DIR}/$FILENAME")

Source file has 6 mn records. This script was able to handle 2 mn records in 1 hr. This is very slow speed and we need to optimise our processing.

Can any Guru help me for optimization solution?

1 thing we want to try is , split source files into multiple pieces and then run script multiple times in the background against them.

Is this good way for optimization?

Akshay

View Public Profile for Akshay

Find all posts by Akshay

08-31-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

The problem is the $( ) construct with awk inside. I'm assuming, at least from what you posted, a child process gets created 2 million times. And each time it reads an input file? That will take forever.

If this were executed just once on a file it should run in a few seconds at most. Please post more of your code.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

08-31-2010

Registered User

315, 42

Join Date: Feb 2010

Last Activity: 2 October 2014, 8:55 AM EDT

Location: Sao Paulo, Brasil

Posts: 315

Thanks Given: 0

Thanked 42 Times in 40 Posts

One question: do you really need awk?

Why don't you use sed?

Something like:

Code:

sed 's/<replace>/<replace by>/g' <filename> > <new filename>

felipe.vinturin

View Public Profile for felipe.vinturin

Find all posts by felipe.vinturin

08-31-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

In addition to felipe, you can use the variables like this:

Code:

sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"

I guess you want to count the number of replacements. Perhaps counting the changed characters after-the-fact-might be faster, eg:

Code:

sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed '/\(.\)\t\1/d' | wc -l)

Alternatively you could try mawk, which is a really fast awk..

Last edited by Scrutinizer; 08-31-2010 at 11:05 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-31-2010

Registered User

8, 0

Join Date: Jun 2008

Last Activity: 31 August 2010, 9:40 AM EDT

Posts: 8

Thanks Given: 1

Thanked 0 Times in 0 Posts

We preferred awk gsub because we had to capture count of replacements in file. sed and tr were not helping in capturing counts.
Below is actual script with some changes.

Code:

# remove temporary files.
function CleanUp
{}

  # function to check file presence.
function CheckFile
{}

  # function to replace special characters and to count the no of replacements

function ReplceSpecChar
{
count=0
NEWFILE="${OUTPUT_DIR}/${FILENAME}_NEW"
bteq<<EOF >>$LOG_FILE 2>&1
.logon xxxxx/yyyyy,zzzzz
.export file=${RELEASE_DIR}/specialchar.tmp
.set titledashes off
.set format off

SELECT special_character (TITLE '')
FROM DATABASE.TABLE
where FILE_NAME = '$FILENAME'


.export reset
EOF

SpecialChar=`cat specialchar.tmp`

bteq<<EOF >>$LOG_FILE 2>&1
.logon xxxxx/yyyyy,zzzzzz
.export file=${RELEASE_DIR}/replacementchar.tmp
.set titledashes off
.set format off

SELECT replaced_value (TITLE '')
FROM DATABASE.TABLE
where FILE_NAME = '$FILENAME'


.export reset
EOF

ReplacementChar=`cat replacementchar.tmp`

var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print $0 > "'"$NEWFILE"'"

}
END{print count}
' "${INPUT_DIR}/$FILENAME")

echo "  ... No of replaced characters are ::: $var " >>$LOG_FILE
echo "$var" > ${RELEASE_DIR}/SpecCharCount.dat
echo "  ... New file created is           ::: $NEWFILE " >>$LOG_FILE
}


# Main script starts here...

if [ -z "$FILENAME" ]
then
                echo " ... No File name specified. Closing $0 ... " >>$LOG_FILE
                exit
fi
CheckFile
ReplceSpecChar
CleanUp

---------- Post updated at 06:54 PM ---------- Previous update was at 05:14 PM ----------

Quote:

Originally Posted by Scrutinizer

In addition to felipe, you can use the variables like this:

Code:

sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"

I guess you want to count the number of replacements. Perhaps counting the changed characters after-the-fact-might be faster, eg:

Code:

sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Alternatively you could try mawk, which is a really fast awk..

What does second line do here?

Code:

var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Will it give me no of replaced characters from source file?

I tried to run command against sample file .. seems some error in code..

i changed code as below , still showing some error.

Quote:

sed "s/$SpecialChar/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/$.$\t\1/!p' | wc -l)

Last edited by Akshay; 08-31-2010 at 08:53 AM.. Reason: Replace QUOTE tags with CODE tags

Akshay

View Public Profile for Akshay

Find all posts by Akshay

08-31-2010

Registered User

380, 91

Join Date: Aug 2009

Last Activity: 15 March 2013, 10:40 AM EDT

Location: New Jersey

Posts: 380

Thanks Given: 7

Thanked 91 Times in 75 Posts

The slowness is in fact due to the line

Code:

print $0 > "'"$NEWFILE"'"

but there is no subprocess involved here, just writing to a file instead of stdout. You can avoid the redirection by writing the counter to a temp file:

Code:

var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print

}
END{print count > "/tmp/rep-chr-count"}
' "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
cat /tmp/rep-chr-count
)

Another way (should be fast) to find the count of special char is:

Code:

tr -dc "$SpecialChar" < "${INPUT_DIR}/$FILENAME" |wc -c

binlib

View Public Profile for binlib

Find all posts by binlib

08-31-2010

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by Akshay

What does second line do here?

Code:

var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Will it give me no of replaced characters from source file?

I tried to run command against sample file .. seems some error in code..

i changed code as below , still showing some error.

I forgot to say it only works in bash/ksh93. It indeed calculates the number of characters that have been changed..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Awk script gsub optimization

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Discussion started by: kartikirans

2. Shell Programming and Scripting

awk command optimization

Discussion started by: SkySmart

3. Shell Programming and Scripting

awk command optimization

Discussion started by: abhi1988sri

4. Shell Programming and Scripting

awk gsub

Discussion started by: ysrini

5. UNIX for Dummies Questions & Answers

awk: multiple gsub in a script

Discussion started by: lucasvs

6. Shell Programming and Scripting

Awk; gsub in fields 3 and 4

Discussion started by: Bubnoff

7. Shell Programming and Scripting

AWK optimization

Discussion started by: majormark

8. Shell Programming and Scripting

awk gsub

Discussion started by: pxy2d1

9. Shell Programming and Scripting

Help with AWK and gsub

Discussion started by: npolite

10. Shell Programming and Scripting

use var in gsub of awk

Discussion started by: summer_cherry