Linux shell programming performance issue

08-30-2014

Registered User

18, 0

Join Date: Jun 2012

Last Activity: 29 March 2017, 6:44 PM EDT

Location: India

Posts: 18

Thanks Given: 7

Thanked 0 Times in 0 Posts

Linux shell programming performance issue

Hi All,

can any one help me on this please.

Replace sting in FILE1.txt with FILE2.txt. FILE1.txt record must have at least one state is repeated once.But need to replace only from second occurrence in record in FILE1.txt

Condition: order of searching the records in FILE2.txt is impartent.

So when the FILE2.txt record entry is matched with FILE1.txt then break the loop and don't search again the FILE2.txt next record. see below exampe.

FILE1.txt
----------

Code:

TEXAS CALIFORNIA TEXAS
DALLAS CALIFORNIA CALIFORNIA DALLAS DALLAS TEXAS

FILE2.txt
------------

Code:

TEXAS,TX
DALLAS,DA
CALIFORNIA,CA
NEWYORK,NY

output:
-------

Code:

TEXAS CALIFORNIA TX

(TEXAS is matched so replaced TEXAS with TX in 2nd occurrence)

Code:

DALLAS CALIFORNIA CALIFORNIA DA DA TEXAS

(DALLAS and CALIFORNIA is matched more than once but as the order in FILE2.txt is impartent so DALLAS is coming first than CALIFORNIA so replacing DALLS only and breaking the loop and not searching again for CALIFORNIA)

I have implemented this using while loop and its working as expected but as the FILE1.txt have Millions of records and FILE2.txt have 50 records, so its taking Hours to complete. Any AWK solution for this to speed up the performance please ?

Last edited by Don Cragun; 08-30-2014 at 04:06 AM.. Reason: Remove QUOTE tags; add CODE tags.

ureddy

View Public Profile for ureddy

Find all posts by ureddy

08-30-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please show us your shell script.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-31-2014

Registered User

13, 0

Join Date: Aug 2014

Last Activity: 9 September 2014, 6:19 PM EDT

Location: east coast

Posts: 13

Thanks Given: 2

Thanked 0 Times in 0 Posts

Maybe use perl instead, uses hash arrays use unique to remove redundant elements.

f77coder

View Public Profile for f77coder

Find all posts by f77coder

08-31-2014

Registered User

18, 0

Join Date: Jun 2012

Last Activity: 29 March 2017, 6:44 PM EDT

Location: India

Posts: 18

Thanks Given: 7

Thanked 0 Times in 0 Posts

Hi,thanks for looking in to this. Here is the code.

Code:

echo "Replace the string matches only once or except FIRST occurence replace ALL."
        
	tot_cnt=`wc -l < $REP_FILE_PATH/$REP_FILE`

 	while IFS='' read -r line; do          (to preserve leading and trailing  spacees used IFS='' read -r )
  		i=0
        	while read line_1; do
           	 field[1]=`cut -d',' -f1 <<<"$line_1"`
            	field[2]="`cut -d',' -f2 <<<"$line_1"`
            	cnt=`echo -n "$line" | grep -o "${field[1]}" | wc -l`
           
            	if [[ "$cnt" -gt 1 ]] ; then
            	sed -e "s/"${field[1]}"/"${field[2]}"/2g"  <<<"$line" >> tmp.txt
            	break
            	fi
        	let i++
            	done < file2.txt
        done< file1.txt

ureddy

View Public Profile for ureddy

Find all posts by ureddy

08-31-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You don't need awk (or similar) to improve the performance of your script. Just by the look on it, it can be seen that you run six commands (= six new processes) in the inner loop, times 50 for the lines in file 2, times millions for the lines in file1 (opening file2 millions times (even though buffered/cached)).

With your input data, and after cleaning out a few quirks in your code snippet, I find

Code:

time . XX
real    0m0.308s
user    0m0.192s
sys    0m0.119s

, while

Code:

time . YY
real    0m0.014s
user    0m0.012s
sys    0m0.000s

with YY being

Code:

while IFS='' read -r line
         do     while IFS=, read field1 field2
                        do      TMP=${line//$field1}
                                if [ $(( (${#line}- ${#TMP}) / ${#field1} )) -gt 1 ]
                                        then    sed  "s/"$field1"/"$field2"/2g"  <<<"$line" >> tmp.txt
                                        break
                                fi
                        done < file2
        done < file1
cat tmp.txt
TEXAS CALIFORNIA TX
DALLAS CALIFORNIA CALIFORNIA DA DA TEXAS

An even faster solution might be to use an array to hold file2's contents, and have the outer loop read file1, and an inner loop to iterate through the array doing the comparisons/modifications.

---------- Post updated at 22:00 ---------- Previous update was at 21:36 ----------

Modification using arrays; adapt to taste...:

Code:

unset i
while IFS=, read field1[++i] field2[i]; do : ; done < file2
while IFS='' read -r line
         do     for (( i=1; i<=${#field1[@]}; i++ ))
                        do      TMP=${line//${field1[$i]}}
                                if [ $(( (${#line}- ${#TMP}) / ${#field1[$i]} )) -gt 1 ]
                                        then    sed "s/"${field1[$i]}"/"${field2[$i]}"/2g"  <<<"$line" >> tmp.txt
                                        break
                                fi
                        done
        done < file1

Timing is similar to the first version; looks like the disk cache is quite powerful:

Code:

time . ZZ

real    0m0.015s
user    0m0.003s
sys    0m0.013s

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-03-2014

Registered User

614, 110

Join Date: May 2005

Last Activity: 27 June 2016, 2:12 PM EDT

Posts: 614

Thanks Given: 4

Thanked 110 Times in 107 Posts

Assuming GNU sed....

Create the sed file using FILE2.txt:

Code:

sed 's#\([^,]*\),\(.*\)#s/\1/\2/2g;t#' <FILE2.txt >FILE2.sed

Use that FILE2.sed as the commands for sed:

Code:

sed -f FILE2.sed <FILE1.txt >RESULT.txt

This User Gave Thanks to cjcox For This Post:

cjcox

View Public Profile for cjcox

Find all posts by cjcox

09-04-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Previous solution optimized:

Code:

sed 's#.*#s,&,2g#' FILE2.txt >FILE2.sed
sed -f FILE2.sed FILE1.txt >RESULT.txt

Now an awk solution that does not do a RE substitution, but a substitution word by word:

Code:

awk 'NR==FNR {r[$1]=$2; next} ($1 in r) {for (i=2; i<=NF; i++) if ($i==$1) $i=r[$i]} 1' FS=, FILE2.txt FS=" " FILE1.txt

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

Shell Programming and Scripting

Linux shell programming performance issue

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Performance Issue - Shell Script

Discussion started by: imrandec85

2. Shell Programming and Scripting

Performance issue in shell script

Discussion started by: ureddy

3. Red Hat

Performance issue in Linux

Discussion started by: tiger09

4. Shell Programming and Scripting

Shell programming in Linux

Discussion started by: sagarparadkar

5. UNIX for Dummies Questions & Answers

Linux machine performance issue.

Discussion started by: pinga123

6. UNIX for Advanced & Expert Users

FTP-Shell Script-Performance issue

Discussion started by: RSC1985

7. UNIX for Advanced & Expert Users

run win app on Linux -performance issue

Discussion started by: raedbenz

8. Shell Programming and Scripting

Sed issue in K Shell programming

Discussion started by: toshidas2000

9. News, Links, Events and Announcements

Announcing collectl - new performance linux performance monitor

Discussion started by: MarkSeger

10. Shell Programming and Scripting

Can somebody advise any free Linux sever for shell programming?

Discussion started by: belgampaul