Speeding up search and replace in a for loop

06-18-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

@pbluescript, is file.txt free format? Could you post a sample?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

06-18-2012

Registered User

11, 1

Join Date: Oct 2011

Last Activity: 28 March 2013, 1:24 PM EDT

Posts: 11

Thanks Given: 10

Thanked 1 Time in 1 Post

Quote:

Originally Posted by hergp

Wow, I did not expect ex to be so much more efficient than sed.

Do you have the total run-time in seconds for the three approaches too, pbluescript?

Sure. These were all submitted to an LSF queue, and each node has a minimum of 8 cores with 2.8Ghz+ Intel Xeon CPUs and 16GB RAM running RHEL 5.3. Here is some extra info about each job, with the actual run time listed:

My method: 195,389 seconds
Max Memory : 5 MB
Max Swap : 266 MB
Max Processes : 5
Max Threads : 6

hergp's method: 209,240 seconds
Max Memory : 2676 MB
Max Swap : 2870 MB
Max Processes : 4
Max Threads : 5

alister's method: 42,573 seconds
Max Memory : 121 MB
Max Swap : 392 MB
Max Processes : 5
Max Threads : 6

When actual run time is used, the awk, ex method looks even better.

---------- Post updated at 08:52 AM ---------- Previous update was at 08:44 AM ----------

Quote:

Originally Posted by Scrutinizer

@pbluescript, is file.txt free format? Could you post a sample?

Sure. The actual commands I ran were slightly different than what I posted as there are two places per line that could be changed, but I only wanted one of them to change.

sed in a for loop version:

Code:

sed -i "s#gene_id \"$OLD\"#gene_id \"$NEW\"#g" file.txt

alister's version:

Code:

awk -F, '{print "%s#gene_id \""$1"\"#gene_id \""$2"\"#g"} END {print "x"}' conversion.csv | ex -s file.txt

Here is a sample of what I started with:

Code:

chr1    mm9_knownGene   exon    3195985 3197398 0.000000        -       .       gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   exon    3203520 3205713 0.000000        -       .       gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   stop_codon      3206103 3206105 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3206106 3207049 0.000000        -       2       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3204563 3207049 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3411783 3411982 0.000000        -       1       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3411783 3411982 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3660633 3661429 0.000000        -       0       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   start_codon     3661427 3661429 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3660633 3661579 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

Here is a sample of conversion.csv:

Code:

uc007afh.1,Lypla1
uc007afg.1,Lypla1
uc007afi.2,Tcea1
uc011wht.1,Tcea1
uc011whu.1,Tcea1
uc007afn.1,Atp6v1h
uc007afm.1,Atp6v1h
uc007afo.1,Oprk1
uc007afp.1,Oprk1
uc007afq.1,Oprk1

Here is a sample of the final result:

Code:

chr1    mm9_knownGene   exon    3195985 3197398 0.000000        -       .       gene_id "mKIAA1889"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   exon    3203520 3205713 0.000000        -       .       gene_id "mKIAA1889"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   stop_codon      3206103 3206105 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3206106 3207049 0.000000        -       2       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3204563 3207049 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3411783 3411982 0.000000        -       1       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3411783 3411982 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3660633 3661429 0.000000        -       0       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   start_codon     3661427 3661429 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3660633 3661579 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";

Last edited by Scrutinizer; 06-18-2012 at 10:38 AM.. Reason: code tags

pbluescript

View Public Profile for pbluescript

Find all posts by pbluescript

06-18-2012

Banned

68, 9

Join Date: May 2012

Last Activity: 7 August 2015, 4:00 PM EDT

Posts: 68

Thanks Given: 7

Thanked 9 Times in 9 Posts

A marginal improvement to your original solution may be by removing the echo commands and reading the fields in the read statement itself:

Code:

while IFS=\, read old new
do
  sed -i "s#\"$old\"#\"$new\"#g" file.txt
done < conversion.csv

jawsnnn

View Public Profile for jawsnnn

Find all posts by jawsnnn

06-18-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by pbluescript

Sure. The actual commands I ran were slightly different than what I posted as there are two places per line that could be changed, but I only wanted one of them to change.

sed in a for loop version:

Code:

sed -i "s#gene_id \"$OLD\"#gene_id \"$NEW\"#g" file.txt

alister's version:

Code:

awk -F, '{print "%s#gene_id \""$1"\"#gene_id \""$2"\"#g"} END {print "x"}' conversion.csv | ex -s file.txt

If you only want to change the first occurence per line, remove the trailing g (global) flag in the sed substitute command. Otherwise, after the first substitution in each line, the sed/ex regular expression engine will continue scanning the remainder of the line for a second possible match (in your sample commands, "gene_id..."), instead of immediately exiting the substitute command and moving on to the next line. I wouldn't expect much improvement, but who knows.

sed in a for loop version:

Code:

sed -i "s#\"$OLD\"#\"$NEW\"#" file.txt

alister's version:

Code:

awk -F, '{print "%s#\""$1"\"#\""$2"\"#"} END {print "x"}' conversion.csv | ex -s file.txt

Perhaps you may want to re-include the "gene_id" text to tighten the match, but you definitely don't want to use the global substitution flag.

I edited those right here in the forum's webpage textbox, so, if you are going to test them, it would be prudent to use a small data set just to rule out a non-fatal typo (with all those double quotes hanging around, I may have accidentally deleted one).

Regards,
Alister

Last edited by alister; 06-18-2012 at 10:49 AM..

alister

View Public Profile for alister

Find all posts by alister

06-18-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Since file.txt is structured, you could also try:

Code:

awk 'NR==FNR{A[$1]=$2;next} $2 in A{$2=A[$2]}1' FS=, conversion.csv FS=\" OFS=\" file.txt

For optimum results also try mawk instead of awk if that is available..

Last edited by Scrutinizer; 06-18-2012 at 11:43 AM..

These 2 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

06-18-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

@pbluescript

Scrutinizer's suggestion should leave you pleasantly surprised.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

06-18-2012

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by hergp

Wow, I did not expect ex to be so much more efficient than sed.

Do you have the total run-time in seconds for the three approaches too, pbluescript?

I don't think the difference lies so much in the sed/ex statements, but rather in the while loop that contains command substitutions with external cut commands, which is very costly in a shell loop.

We could replace the loop with one that does not use external commands:

Code:

while IFS=, read OLD NEW
do
  echo "s#\"$OLD\"#\"$NEW\"#g"
done < conversion.csv >commands.txt

I ran some tests on a file with 55000 entries and results were:

loop with cut commands: 4 minutes 55 seconds
loop without cut commands: 1.7 seconds

Last edited by Scrutinizer; 06-18-2012 at 11:36 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Speeding up search and replace in a for loop

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help speeding up script

Discussion started by: JohnN6

2. Shell Programming and Scripting

Speeding up substitutions

Discussion started by: senhia83

3. Shell Programming and Scripting

Nested search in a file and replace the inner search

Discussion started by: tolearn

4. Shell Programming and Scripting

search replace with loop and variable

Discussion started by: wind

5. Shell Programming and Scripting

perl search and replace - search in first line and replance in 2nd line

Discussion started by: arvindng

6. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Discussion started by: Whit3H0rse

7. Programming

PERL, search and replace inside foreach loop

Discussion started by: riyasnr007

8. Shell Programming and Scripting

awk - replace number of string length from search and replace for a serialized array

Discussion started by: otrotipo

9. UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

Discussion started by: Dave Stockdale

10. Shell Programming and Scripting

Perl: Search for string on line then search and replace text

Discussion started by: Crypto