Awk script gsub optimization


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk script gsub optimization
# 1  
Old 08-31-2010
Awk script gsub optimization

I have created Shell script with below awk code for replacing special characters from input file.

Quote:
var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print $0 > "'"$NEWFILE"'"

}
END{print count}
' "${INPUT_DIR}/$FILENAME")
Source file has 6 mn records. This script was able to handle 2 mn records in 1 hr. This is very slow speed and we need to optimise our processing.

Can any Guru help me for optimization solution?

1 thing we want to try is , split source files into multiple pieces and then run script multiple times in the background against them.

Is this good way for optimization?
# 2  
Old 08-31-2010
The problem is the $( ) construct with awk inside. I'm assuming, at least from what you posted, a child process gets created 2 million times. And each time it reads an input file? That will take forever.

If this were executed just once on a file it should run in a few seconds at most. Please post more of your code.
# 3  
Old 08-31-2010
One question: do you really need awk?

Why don't you use sed?

Something like:
Code:
sed 's/<replace>/<replace by>/g' <filename> > <new filename>

# 4  
Old 08-31-2010
In addition to felipe, you can use the variables like this:
Code:
sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"

I guess you want to count the number of replacements. Perhaps counting the changed characters after-the-fact-might be faster, eg:
Code:
sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed '/\(.\)\t\1/d' | wc -l)

Alternatively you could try mawk, which is a really fast awk..

Last edited by Scrutinizer; 08-31-2010 at 11:05 AM..
# 5  
Old 08-31-2010
We preferred awk gsub because we had to capture count of replacements in file. sed and tr were not helping in capturing counts.
Below is actual script with some changes.

Code:
# remove temporary files.
function CleanUp
{}

  # function to check file presence.
function CheckFile
{}

  # function to replace special characters and to count the no of replacements

function ReplceSpecChar
{
count=0
NEWFILE="${OUTPUT_DIR}/${FILENAME}_NEW"
bteq<<EOF >>$LOG_FILE 2>&1
.logon xxxxx/yyyyy,zzzzz
.export file=${RELEASE_DIR}/specialchar.tmp
.set titledashes off
.set format off

SELECT special_character (TITLE '')
FROM DATABASE.TABLE
where FILE_NAME = '$FILENAME'


.export reset
EOF

SpecialChar=`cat specialchar.tmp`

bteq<<EOF >>$LOG_FILE 2>&1
.logon xxxxx/yyyyy,zzzzzz
.export file=${RELEASE_DIR}/replacementchar.tmp
.set titledashes off
.set format off

SELECT replaced_value (TITLE '')
FROM DATABASE.TABLE
where FILE_NAME = '$FILENAME'


.export reset
EOF

ReplacementChar=`cat replacementchar.tmp`

var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print $0 > "'"$NEWFILE"'"

}
END{print count}
' "${INPUT_DIR}/$FILENAME")

echo "  ... No of replaced characters are ::: $var " >>$LOG_FILE
echo "$var" > ${RELEASE_DIR}/SpecCharCount.dat
echo "  ... New file created is           ::: $NEWFILE " >>$LOG_FILE
}


# Main script starts here...

if [ -z "$FILENAME" ]
then
                echo " ... No File name specified. Closing $0 ... " >>$LOG_FILE
                exit
fi
CheckFile
ReplceSpecChar
CleanUp



---------- Post updated at 06:54 PM ---------- Previous update was at 05:14 PM ----------

Quote:
Originally Posted by Scrutinizer
In addition to felipe, you can use the variables like this:
Code:
sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"

I guess you want to count the number of replacements. Perhaps counting the changed characters after-the-fact-might be faster, eg:
Code:
sed "s/$SpecialChar"/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Alternatively you could try mawk, which is a really fast awk..
What does second line do here?
Code:
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Will it give me no of replaced characters from source file?

I tried to run command against sample file .. seems some error in code..

i changed code as below , still showing some error.
Quote:
sed "s/$SpecialChar/$ReplacementChar/g" "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Last edited by Akshay; 08-31-2010 at 08:53 AM.. Reason: Replace QUOTE tags with CODE tags
# 6  
Old 08-31-2010
The slowness is in fact due to the line
Code:
print $0 > "'"$NEWFILE"'"

but there is no subprocess involved here, just writing to a file instead of stdout. You can avoid the redirection by writing the counter to a temp file:

Code:
var=$(awk -v sc="$SpecialChar" -v rc="$ReplacementChar" '{
count+=gsub(sc,rc,$0)
print

}
END{print count > "/tmp/rep-chr-count"}
' "${INPUT_DIR}/$FILENAME" > "$NEWFILE"
cat /tmp/rep-chr-count
)

Another way (should be fast) to find the count of special char is:
Code:
tr -dc "$SpecialChar" < "${INPUT_DIR}/$FILENAME" |wc -c

# 7  
Old 08-31-2010
Quote:
Originally Posted by Akshay
What does second line do here?
Code:
var=$(paste <(fold -w1 "${INPUT_DIR}/$FILENAME") <(fold -w1 "$NEWFILE") | sed -n '/\(.\)\t\1/!p' | wc -l)

Will it give me no of replaced characters from source file?

I tried to run command against sample file .. seems some error in code..

i changed code as below , still showing some error.
I forgot to say it only works in bash/ksh93. It indeed calculates the number of characters that have been changed..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file File delimiter "|" Need to have Sum of all columns, with column number : aggregation (summation) for each column File not having the header Like below - Column 1 "Total Column 2 : "Total ... ...... (2 Replies)
Discussion started by: kartikirans
2 Replies

2. Shell Programming and Scripting

awk command optimization

gawk -v sw="error|fail|panic|accepted" 'NR>1 && NR <=128500 { for (w in a) { ... (15 Replies)
Discussion started by: SkySmart
15 Replies

3. Shell Programming and Scripting

awk command optimization

Hi, I need some help to optimize this piece of code: sqlplus -S $DB_USER/$DB_PWD@$DB_INSTANCE @$PRODUCT_COLL/$SSA_NAME/bin/tools/sql/tablespace.sql | grep -i UNDO_001_COD3 | awk '{printf ";TBS_UNDO_001_COD3"$5"\n"}' sqlplus -S $DB_USER/$DB_PWD@$DB_INSTANCE... (1 Reply)
Discussion started by: abhi1988sri
1 Replies

4. Shell Programming and Scripting

awk gsub

Hi, I want to print the first column with original value and without any double quotes The output should look like <original column>|<column without quotes> $ cat a.txt "20121023","19301229712","100397" "20121023","19361629712","100778" "20121030A","19361630412","100838"... (3 Replies)
Discussion started by: ysrini
3 Replies

5. UNIX for Dummies Questions & Answers

awk: multiple gsub in a script

%%%%% (1 Reply)
Discussion started by: lucasvs
1 Replies

6. Shell Programming and Scripting

Awk; gsub in fields 3 and 4

I want to transform a log file into input for a database. Here's the log file: Tue Aug 4 20:17:01 PDT 2009 Wireless users: 339 Daily Average: 48.4285 = Tue Aug 11 20:17:01 PDT 2009 Wireless users: 295 Daily Average: 42.1428 = Tue Aug 18 20:17:01 PDT 2009 Wireless users: 294 Daily... (6 Replies)
Discussion started by: Bubnoff
6 Replies

7. Shell Programming and Scripting

AWK optimization

Hello, Do you have any tips on how to optimize the AWK that gets the lines in the log between these XML tags? se2|6|<ns1:accountInfoRequest xmlns:ns1="http://www.123.com/123/ se2|6|etc2"> .... <some other tags> se2|6|</ns1:acc se2|6|ountInfoRequest> The AWK I'm using to get this... (2 Replies)
Discussion started by: majormark
2 Replies

8. Shell Programming and Scripting

awk gsub

Hi all I want to do a simple substitution in awk but I am getting unexpected output. My function accepts a time and then prints out a validation message if the time is valid. However some times may include a : and i want to strip this out if it exists before i get to the validation. I have shown... (4 Replies)
Discussion started by: pxy2d1
4 Replies

9. Shell Programming and Scripting

Help with AWK and gsub

Hello, I have a variable that displays the following results from a JVM.... 1602100K->1578435K I would like to collect the value of 1578435 which is the value after a garbage collection. I've tried the following command but it looks like I can't get the > to work. Any suggestions as... (4 Replies)
Discussion started by: npolite
4 Replies

10. Shell Programming and Scripting

use var in gsub of awk

Hi all, This problem has cost me half a day, and i still do not know how to do. Any help will be appreciated. Thanks advance. I want to use a variable as the first parameters of gsub function of awk. Example: { ... arri]=gsub(i,tolower(i),$1) (which should be ambraced by //) ... } (1 Reply)
Discussion started by: summer_cherry
1 Replies
Login or Register to Ask a Question