Molecular biologist requires help re: search / replace script Post: 302183287

Sponsored Content

Top Forums Shell Programming and Scripting Molecular biologist requires help re: search / replace script Post 302183287 by gstuart on Tuesday 8th of April 2008 04:50:01 PM

04-08-2008

Registered User

Hi jim & era: I appreciate both your solutions very much! The summary below is rather long, but it illustrates my working through the problem - please be patient! ;-)

I think that we are close; however, the final output is still not quite what I want, as described below. Basically, I need to find duplicates (a pp b = b pp a, etc.) and summarize and count them (2 a pp b), including the unique lines (1 a pp a). However, I'd like this sorting, etc. to be based on three specific columns (here, $1, $2, $3 - in the source file), independently of any of the other fields.

I managed to save jim's awk script as a perl script/file, �search_replace.awk.sh�, executable from the command line. For reference, I am using Cygwin on Windows XP Pro at work; I am a Ubuntu linux user at home.

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)} else {printf("%s %s %s", $1, $2, $3)} for(i=4; i<=NF; i++) {printf(" %s", $i)} printf("\n") }' dummy_test_duplicates_file_2.txt | awk '{ arr[$1 $2 $3]++; if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

which is executed as follows:

$ sh search_replace.awk.sh

(Question: Is it possible to specify the input file from the command line, rather than within the script?)

This also works (same code, split over different lines)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

but not this (identical code, split differently)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt |
awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found

... Adding a backslash ( printf("\n") }' dummy_test_duplicates_file_2.txt | \ ) should - but does not - correct the problem ...

Ultimately, I modified this �search_replace.awk.sh� script by adding

BEGIN {OFS=FS="\t"}

to try to ensure tab-delimited input/output:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

I also compared the output form jim's code versus that from era's awk command,

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Using my source file (to be searched) - �dummy_test_duplicates_file.txt� (21 lines), I get the following:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ cat dummy_test_duplicates_file.txt | sort
a gi b
a pp a
a pp b
a pp b
a pp c
a pp d
a pp e
a pp e
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ sh search_replace.awk.sh
3 y gi t
1 u gi t
2 d pp a
1 z pp t
2 z gi t
2 v gi t
2 e pp a
1 a pp a
1 w gi t
1 b gi a
3 b pp a
1 x gi t
1 c pp a

Compare to:

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
1 t gi x
1 t gi v
1 t gi y
1 t gi z
1 a pp a
2 t gi y
2 a pp b
1 t gi z
1 a pp c
1 a pp d
1 a pp b
2 a pp e
1 a gi b
1 a pp d
1 t gi u
1 t pp z
1 t gi v
1 t gi w

This also works:

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Interestingly, adding a second ( BEGIN {OFS=FS="\t"} ),

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

gives the same output, formatted slightly differently and in a different line order:

$ sh search_replace.awk.sh
2 z gi t
2 d pp a
3 y gi t
1 c pp a
1 x gi t
3 b pp a
1 w gi t
1 a pp a
2 v gi t
1 b gi a
1 z pp t
1 u gi t
2 e pp a

$ sh search_replace.awk.sh | sort
1 a pp a
1 b gi a
1 c pp a
1 u gi t
1 w gi t
1 x gi t
1 z pp t
2 d pp a
2 e pp a
2 v gi t
2 z gi t
3 b pp a
3 y gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

====================

So far, this "appears" to be working great! :-)

However, when I add a fourth column (to see the effect - remember I want to sort, count "duplicates," etc. based only on three defined columns / fields), saved as "dummy_test_duplicates_file_3.txt"

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ cat dummy_test_duplicates_file_3.txt | sort
a gi b I
a pp a don't
a pp b This
a pp b text
a pp c column
a pp d contains
a pp e some
a pp e that
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

[Here I changed source file in perl script from �dummy test_duplicates_file.txt� to �dummy_test_duplicates_file_3.txt�

$ sh search_replace.awk.sh | sort
1 a pp a don't
1 b gi a I
1 b pp a This
1 b pp a text
1 b pp a want
1 c pp a column
1 d pp a contains
1 d pp a to
1 e pp a some
1 e pp a that
1 u gi t affect
1 v gi t
1 v gi t the
1 w gi t search/replace
1 x gi t operations.
1 z pp t
2 z gi t
3 y gi t

This (above) really isn't what I am looking for - Note that �a pp b� (or �b pp a� in the second example) are listed separately, 3 times, in the two different solutions (outputs), above. The contents of the fourth column ($4 in the input file) is affecting the search / replace operation.

Additionally - What is the purpose of the "%s� (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?

I also decided to change the �order� of the search fields $3, $1:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($1 > $3) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $3, $2, $1)}
for(i=4; i<=NF; i++) {printf("%s", $i)}
printf("\n") }' dummy_test_duplicates_file_3.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

e.g. if $1 (b) > $3 (a), print $3 (a) $2 $1 (b)

Thank you both once again for your different, unique solutions - I'd like to get this sorted out using one (or both - this is educational) methods!

Sincerely, Greg :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543...

2. UNIX for Dummies Questions & Answers

multiple input search and replace script

hi, i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files. cho "" echo...

3. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script perl -pi -e 's@/Pattern@@g' $path/$file I used the @ as my delimiter because the pattern contains "/"

4. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Hi, I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file: 1.ejverything 2.bllown 3.maikling 4.manegement 5.existjing 6.systems My design currently takes input from the user, and...

5. Shell Programming and Scripting

Script Search replace - complicated

I have a text file for which i need a script which does some fancy search and replace. Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the...

6. Shell Programming and Scripting

Please Help to Check script Search and Replace

Please Help to Check script Search and Replace Ex. Search 0001 and Replete un_0001 ---script Code: nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt script is work to one result but if i have file1.txt, file2.txt, file3.txt i want to Replace...

7. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Ok, what about: array set simulation_frames { ... } foreach { frames } { writepdb pdb_$frames.pdb }Now, my question is simply, what strategy could I use to import my numbers into the array { ... } I could manually copy them, and that would work, but is there another way?

8. Shell Programming and Scripting

Script to search and replace

Hi All, I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files. Could any one let me know how do i find the text in many files under a directory. Thanks

9. Shell Programming and Scripting

Search and replace script

Hi, Below is the script which will find a particular text and replace with another one in a group of files under a directory /test #!/bin/bash old=$1 --- first input old text new=$2--- input new text cd /test --- folder into which files need to be checked for y in `ls *`; do sed...

10. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Hi, I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern. Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file. Input...