Molecular biologist requires help re: search / replace script

04-07-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

jim: I don't understand. What happened to z gi t 60 for example?

greg: just to clarify, do you want weights for the triplets, or for the entire lines? I assumed triplets, but I guess jim undertood differently.

Last edited by era; 04-07-2008 at 05:23 PM.. Reason: Ah, Jim's canonicalization is reverse alphabetic, so some of the ones I wondered about are explained

era

View Public Profile for era

Find all posts by era

04-08-2008

Registered User

16, 0

Join Date: Apr 2008

Last Activity: 16 November 2014, 8:59 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi jim & era: I appreciate both your solutions very much! The summary below is rather long, but it illustrates my working through the problem - please be patient! ;-)

I think that we are close; however, the final output is still not quite what I want, as described below. Basically, I need to find duplicates (a pp b = b pp a, etc.) and summarize and count them (2 a pp b), including the unique lines (1 a pp a). However, I'd like this sorting, etc. to be based on three specific columns (here, $1, $2, $3 - in the source file), independently of any of the other fields.

I managed to save jim's awk script as a perl script/file, �search_replace.awk.sh�, executable from the command line. For reference, I am using Cygwin on Windows XP Pro at work; I am a Ubuntu linux user at home.

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)} else {printf("%s %s %s", $1, $2, $3)} for(i=4; i<=NF; i++) {printf(" %s", $i)} printf("\n") }' dummy_test_duplicates_file_2.txt | awk '{ arr[$1 $2 $3]++; if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

which is executed as follows:

$ sh search_replace.awk.sh

(Question: Is it possible to specify the input file from the command line, rather than within the script?)

This also works (same code, split over different lines)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

but not this (identical code, split differently)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt |
awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found

... Adding a backslash ( printf("\n") }' dummy_test_duplicates_file_2.txt | \ ) should - but does not - correct the problem ...

Ultimately, I modified this �search_replace.awk.sh� script by adding

BEGIN {OFS=FS="\t"}

to try to ensure tab-delimited input/output:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

I also compared the output form jim's code versus that from era's awk command,

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Using my source file (to be searched) - �dummy_test_duplicates_file.txt� (21 lines), I get the following:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ cat dummy_test_duplicates_file.txt | sort
a gi b
a pp a
a pp b
a pp b
a pp c
a pp d
a pp e
a pp e
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ sh search_replace.awk.sh
3 y gi t
1 u gi t
2 d pp a
1 z pp t
2 z gi t
2 v gi t
2 e pp a
1 a pp a
1 w gi t
1 b gi a
3 b pp a
1 x gi t
1 c pp a

Compare to:

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
1 t gi x
1 t gi v
1 t gi y
1 t gi z
1 a pp a
2 t gi y
2 a pp b
1 t gi z
1 a pp c
1 a pp d
1 a pp b
2 a pp e
1 a gi b
1 a pp d
1 t gi u
1 t pp z
1 t gi v
1 t gi w

This also works:

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Interestingly, adding a second ( BEGIN {OFS=FS="\t"} ),

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

gives the same output, formatted slightly differently and in a different line order:

$ sh search_replace.awk.sh
2 z gi t
2 d pp a
3 y gi t
1 c pp a
1 x gi t
3 b pp a
1 w gi t
1 a pp a
2 v gi t
1 b gi a
1 z pp t
1 u gi t
2 e pp a

$ sh search_replace.awk.sh | sort
1 a pp a
1 b gi a
1 c pp a
1 u gi t
1 w gi t
1 x gi t
1 z pp t
2 d pp a
2 e pp a
2 v gi t
2 z gi t
3 b pp a
3 y gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

====================

So far, this "appears" to be working great! :-)

However, when I add a fourth column (to see the effect - remember I want to sort, count "duplicates," etc. based only on three defined columns / fields), saved as "dummy_test_duplicates_file_3.txt"

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ cat dummy_test_duplicates_file_3.txt | sort
a gi b I
a pp a don't
a pp b This
a pp b text
a pp c column
a pp d contains
a pp e some
a pp e that
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

[Here I changed source file in perl script from �dummy test_duplicates_file.txt� to �dummy_test_duplicates_file_3.txt�

$ sh search_replace.awk.sh | sort
1 a pp a don't
1 b gi a I
1 b pp a This
1 b pp a text
1 b pp a want
1 c pp a column
1 d pp a contains
1 d pp a to
1 e pp a some
1 e pp a that
1 u gi t affect
1 v gi t
1 v gi t the
1 w gi t search/replace
1 x gi t operations.
1 z pp t
2 z gi t
3 y gi t

This (above) really isn't what I am looking for - Note that �a pp b� (or �b pp a� in the second example) are listed separately, 3 times, in the two different solutions (outputs), above. The contents of the fourth column ($4 in the input file) is affecting the search / replace operation.

Additionally - What is the purpose of the "%s� (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?

I also decided to change the �order� of the search fields $3, $1:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($1 > $3) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $3, $2, $1)}
for(i=4; i<=NF; i++) {printf("%s", $i)}
printf("\n") }' dummy_test_duplicates_file_3.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

e.g. if $1 (b) > $3 (a), print $3 (a) $2 $1 (b)

Thank you both once again for your different, unique solutions - I'd like to get this sorted out using one (or both - this is educational) methods!

Sincerely, Greg :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

04-08-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Quote:

Originally Posted by gstuart

(Question: Is it possible to specify the input file from the command line, rather than within the script?)

Absolutely. The command-line arguments are available in the shell script in $1, $2, $3 etc; "$@" (with the quotes) is all the arguments. This is confusing if you've been playing a lot with awk, where it's the fields of the current line.

Quote:

Originally Posted by gstuart

$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found

I would speculate that this is a Cygwin problem. \r is the DOS carriage return character. Maybe you could save as a Unix file and try again.

Quote:

Originally Posted by gstuart

Code:

$ awk 'BEGIN {OFS=FS="\t"}
$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ }
END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1	a	gi	b	I
1	a	pp	a	don't
1	a	pp	b	This
1	a	pp	b	text
1	a	pp	b	want

Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++

Quote:

Additionally - What is the purpose of the "%s” (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?

Correct.

I hope I managed to address all your questions. Maybe an executive summary would help (-:

era

View Public Profile for era

Find all posts by era

04-08-2008

Registered User

647, 0

Join Date: Feb 2008

Last Activity: 22 September 2010, 2:56 PM EDT

Location: Jersey Shore

Posts: 647

Thanks Given: 0

Thanked 0 Times in 0 Posts

run dos2unix on that file that may have been constructed under a DOS/windows environment. you may find it react better.

pupp

View Public Profile for pupp

Find all posts by pupp

04-09-2008

Registered User

16, 0

Join Date: Apr 2008

Last Activity: 16 November 2014, 8:59 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Wednesday April 09, 2008

Quote: Originally Posted by gstuart
Code:

$ awk 'BEGIN {OFS=FS="\t"}
$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ }
END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want

[ Reply - era: ] "Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++"
---------------------
Wow - Thanks era! This is wonderful - I'm not sure that I would have figured this out, in a timely manner! I'm going to read up a bit on awk and arrays. I think that this will be *very* helpful!

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

If I understand this correctly, increasing the array to four or more elements (in this example) �screws up� the search for duplicates, due to non-identity of the remaining fields. Short of concatenating these fields, e.g. from the above output,

1 a pp b This
1 a pp b text
1 a pp b want

as

3 a pp b This; text; want

I don't seen any �simple� work-around (?!) ... However, I can live with this, as I am critically interested in finding / counting / parsing duplicates among the $1, $2, $3 fields (in the input file: Gene A, interaction type, Gene B), and having the output list the unique relationships with the duplicate counts (�weights�)! This is perfectly acceptable to me, and the code provided by era (modified slightly, here),

awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

is simple and elegant! :-)

This is my understanding of era's code:

$ awk 'BEGIN
{OFS=FS="\t"} # output, input field separators = tabs
if $3 < $1 then
{ t = $1; $1 = $3; $3 = t } # switch the order of the variables if $3 < $1, e.g. "b gi a" ---> "a gi b"
{ a[$1 "\t" $2 "\t" $3 "\t" $4]++ } # ++ increments the index (i.e. table elements) in
# a[$1 "\t" $2 "\t" $3 "\t" $4] by 1, for the following "for" loop
END

{ for (k in a) { print a[k], k } } ' # for statement: for (initialization; condition; increment)
# k = array elements; incrementally printed, until the end of the array
# a = the array, defined above (three index elements, in this example)
dummy_test_duplicates_file_3.txt | sort # input (source) file; dump to screen, sorted via pipe
# (or " > output_file.txt" - to save the output directly to a file)

How is this thing actually counting the �duplicates� - *following* this declaration, the array itself must be able to keep track of duplicates? ... I need to read up on awk and arrays ...

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k] } } ' dummy_test_duplicates_file_3.txt | sort
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print k } } ' dummy_test_duplicates_file_3.txt | sort
a gi b
a pp a
a pp b
a pp c
a pp d
a pp e
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

$ cat out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

NOTE that the print command (above) adds a �new� (fourth field, in the first column position) - the frequency (duplicates) count, or �weight�:

$ awk ' { print ($0) } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk ' { print ($1) } ' out.tab
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk ' { print ($2) } ' out.tab
a
a
a
t
t
t
t
a
a
t
t
a
t

$ awk ' { print ($3) } ' out.tab
gi
pp
pp
gi
gi
gi
pp
pp
pp
gi
gi
pp
gi

$ awk ' { print ($4) } ' out.tab
b
a
c
u
w
x
z
d
e
v
z
b
y

$ awk ' { print ($5) } ' out.tab

$ awk ' { print ($"[1-NF]") } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

THANK YOU all, once again for your help -very much appreciated.

Sincerely, Greg S. :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

Shell Programming and Scripting

Molecular biologist requires help re: search / replace script

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Discussion started by: chandrath

2. Shell Programming and Scripting

Search and replace script

Discussion started by: chetansingh23

3. Shell Programming and Scripting

Script to search and replace

Discussion started by: chetansingh23

4. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Discussion started by: chrisjorg

5. Shell Programming and Scripting

Please Help to Check script Search and Replace

Discussion started by: kittiwas

6. Shell Programming and Scripting

Script Search replace - complicated

Discussion started by: kelseyh

7. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Discussion started by: mkfitzwilliams

8. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

Discussion started by: NobluesFDT

9. UNIX for Dummies Questions & Answers

multiple input search and replace script

Discussion started by: tungaw2004

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Discussion started by: csejl