Speeding up search and replace in a for loop


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Speeding up search and replace in a for loop
# 8  
Old 06-18-2012
@pbluescript, is file.txt free format? Could you post a sample?
# 9  
Old 06-18-2012
Quote:
Originally Posted by hergp
Wow, I did not expect ex to be so much more efficient than sed.

Do you have the total run-time in seconds for the three approaches too, pbluescript?
Sure. These were all submitted to an LSF queue, and each node has a minimum of 8 cores with 2.8Ghz+ Intel Xeon CPUs and 16GB RAM running RHEL 5.3. Here is some extra info about each job, with the actual run time listed:

My method: 195,389 seconds
Max Memory : 5 MB
Max Swap : 266 MB
Max Processes : 5
Max Threads : 6

hergp's method: 209,240 seconds
Max Memory : 2676 MB
Max Swap : 2870 MB
Max Processes : 4
Max Threads : 5

alister's method: 42,573 seconds
Max Memory : 121 MB
Max Swap : 392 MB
Max Processes : 5
Max Threads : 6

When actual run time is used, the awk, ex method looks even better.

---------- Post updated at 08:52 AM ---------- Previous update was at 08:44 AM ----------

Quote:
Originally Posted by Scrutinizer
@pbluescript, is file.txt free format? Could you post a sample?
Sure. The actual commands I ran were slightly different than what I posted as there are two places per line that could be changed, but I only wanted one of them to change.

sed in a for loop version:
Code:
sed -i "s#gene_id \"$OLD\"#gene_id \"$NEW\"#g" file.txt

alister's version:

Code:
awk -F, '{print "%s#gene_id \""$1"\"#gene_id \""$2"\"#g"} END {print "x"}' conversion.csv | ex -s file.txt

Here is a sample of what I started with:

Code:
chr1    mm9_knownGene   exon    3195985 3197398 0.000000        -       .       gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   exon    3203520 3205713 0.000000        -       .       gene_id "uc007aet.1"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   stop_codon      3206103 3206105 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3206106 3207049 0.000000        -       2       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3204563 3207049 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3411783 3411982 0.000000        -       1       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3411783 3411982 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3660633 3661429 0.000000        -       0       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   start_codon     3661427 3661429 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3660633 3661579 0.000000        -       .       gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

Here is a sample of conversion.csv:

Code:
uc007afh.1,Lypla1
uc007afg.1,Lypla1
uc007afi.2,Tcea1
uc011wht.1,Tcea1
uc011whu.1,Tcea1
uc007afn.1,Atp6v1h
uc007afm.1,Atp6v1h
uc007afo.1,Oprk1
uc007afp.1,Oprk1
uc007afq.1,Oprk1

Here is a sample of the final result:

Code:
chr1    mm9_knownGene   exon    3195985 3197398 0.000000        -       .       gene_id "mKIAA1889"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   exon    3203520 3205713 0.000000        -       .       gene_id "mKIAA1889"; transcript_id "uc007aet.1";
chr1    mm9_knownGene   stop_codon      3206103 3206105 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3206106 3207049 0.000000        -       2       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3204563 3207049 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3411783 3411982 0.000000        -       1       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3411783 3411982 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   CDS     3660633 3661429 0.000000        -       0       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   start_codon     3661427 3661429 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";
chr1    mm9_knownGene   exon    3660633 3661579 0.000000        -       .       gene_id "Xkr4"; transcript_id "uc007aeu.1";


Last edited by Scrutinizer; 06-18-2012 at 10:38 AM.. Reason: code tags
# 10  
Old 06-18-2012
A marginal improvement to your original solution may be by removing the echo commands and reading the fields in the read statement itself:

Code:
while IFS=\, read old new
do
  sed -i "s#\"$old\"#\"$new\"#g" file.txt
done < conversion.csv

# 11  
Old 06-18-2012
Quote:
Originally Posted by pbluescript
Sure. The actual commands I ran were slightly different than what I posted as there are two places per line that could be changed, but I only wanted one of them to change.

sed in a for loop version:
Code:
sed -i "s#gene_id \"$OLD\"#gene_id \"$NEW\"#g" file.txt

alister's version:

Code:
awk -F, '{print "%s#gene_id \""$1"\"#gene_id \""$2"\"#g"} END {print "x"}' conversion.csv | ex -s file.txt

If you only want to change the first occurence per line, remove the trailing g (global) flag in the sed substitute command. Otherwise, after the first substitution in each line, the sed/ex regular expression engine will continue scanning the remainder of the line for a second possible match (in your sample commands, "gene_id..."), instead of immediately exiting the substitute command and moving on to the next line. I wouldn't expect much improvement, but who knows.


sed in a for loop version:
Code:
sed -i "s#\"$OLD\"#\"$NEW\"#" file.txt

alister's version:

Code:
awk -F, '{print "%s#\""$1"\"#\""$2"\"#"} END {print "x"}' conversion.csv | ex -s file.txt

Perhaps you may want to re-include the "gene_id" text to tighten the match, but you definitely don't want to use the global substitution flag.

I edited those right here in the forum's webpage textbox, so, if you are going to test them, it would be prudent to use a small data set just to rule out a non-fatal typo (with all those double quotes hanging around, I may have accidentally deleted one).

Regards,
Alister

Last edited by alister; 06-18-2012 at 10:49 AM..
# 12  
Old 06-18-2012
Since file.txt is structured, you could also try:
Code:
awk 'NR==FNR{A[$1]=$2;next} $2 in A{$2=A[$2]}1' FS=, conversion.csv FS=\" OFS=\" file.txt

For optimum results also try mawk instead of awk if that is available..

Last edited by Scrutinizer; 06-18-2012 at 11:43 AM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 13  
Old 06-18-2012
@pbluescript

Scrutinizer's suggestion should leave you pleasantly surprised.

Regards,
Alister
# 14  
Old 06-18-2012
Quote:
Originally Posted by hergp
Wow, I did not expect ex to be so much more efficient than sed.

Do you have the total run-time in seconds for the three approaches too, pbluescript?
I don't think the difference lies so much in the sed/ex statements, but rather in the while loop that contains command substitutions with external cut commands, which is very costly in a shell loop.

We could replace the loop with one that does not use external commands:

Code:
while IFS=, read OLD NEW
do
  echo "s#\"$OLD\"#\"$NEW\"#g"
done < conversion.csv >commands.txt

I ran some tests on a file with 55000 entries and results were:
  • loop with cut commands: 4 minutes 55 seconds
  • loop without cut commands: 1.7 seconds

Last edited by Scrutinizer; 06-18-2012 at 11:36 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help speeding up script

This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up? cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt";... (20 Replies)
Discussion started by: JohnN6
20 Replies

2. Shell Programming and Scripting

Speeding up substitutions

Hi all, I have a lookup table from which I am looking up values (from col1) and replacing them by corresponding values (from col2) in another file. lookup file a,b c,d So just replace a by b, and replace c by d. mainfile a,fvvgeggsegg,dvs a,fgeggefddddddddddg... (7 Replies)
Discussion started by: senhia83
7 Replies

3. Shell Programming and Scripting

Nested search in a file and replace the inner search

Hi Team, I am new to unix, please help me in this. I have a file named properties. The content of the file is : ##Mobile props east.url=https://qa.east.corp.com/prop/end west.url=https://qa.west.corp.com/prop/end south.url=https://qa.south.corp.com/prop/end... (2 Replies)
Discussion started by: tolearn
2 Replies

4. Shell Programming and Scripting

search replace with loop and variable

Hi, could anyone help me with this, tried several times but still not getting it right or having enough grounding to do it outside of javascript: Using awk or sed or bash: need to go through a text file using a for next loop, replacing substrings in the file that consist of a potentially multi... (3 Replies)
Discussion started by: wind
3 Replies

5. Shell Programming and Scripting

perl search and replace - search in first line and replance in 2nd line

Dear All, i want to search particular string and want to replance next line value. following is the test file. search string is tmp,??? ,10:1 "???" may contain any 3 character it should remain the same and next line replace with ,10:50 tmp,123 --- if match tmp,??? then... (3 Replies)
Discussion started by: arvindng
3 Replies

6. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:... (11 Replies)
Discussion started by: Whit3H0rse
11 Replies

7. Programming

PERL, search and replace inside foreach loop

Hello All, Im a Hardware engineer, I have written this script to automate my job. I got stuck in the following location. CODE: .. .. ... foreach $key(keys %arr_hash) { my ($loc,$ind,$add) = split /,/, $arr_hash{$key}; &create_verilog($key, $loc, $ind ,$add); } sub create_verilog{... (2 Replies)
Discussion started by: riyasnr007
2 Replies

8. Shell Programming and Scripting

awk - replace number of string length from search and replace for a serialized array

Hello, I really would appreciate some help with a bash script for some string manipulation on an SQL dump: I'd like to be able to rename "sites/WHATEVER/files" to "sites/SOMETHINGELSE/files" within the sql dump. This is quite easy with sed: sed -e... (1 Reply)
Discussion started by: otrotipo
1 Replies

9. UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

Hi all, I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs. The first thing I did was: ls -R | grep .pdf > /tmp/dave_pdfs.outWhich generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like... (8 Replies)
Discussion started by: Dave Stockdale
8 Replies

10. Shell Programming and Scripting

Perl: Search for string on line then search and replace text

Hi All, I have a file that I need to be able to find a pattern match on a line, search that line for a text pattern, and replace that text. An example of 4 lines in my file is: 1. MatchText_randomNumberOfText moreData ReplaceMe moreData 2. MatchText_randomNumberOfText moreData moreData... (4 Replies)
Discussion started by: Crypto
4 Replies
Login or Register to Ask a Question