Molecular biologist requires help re: search / replace script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Molecular biologist requires help re: search / replace script
# 8  
Old 04-07-2008
jim: I don't understand. What happened to z gi t 60 for example?

greg: just to clarify, do you want weights for the triplets, or for the entire lines? I assumed triplets, but I guess jim undertood differently.

Last edited by era; 04-07-2008 at 05:23 PM.. Reason: Ah, Jim's canonicalization is reverse alphabetic, so some of the ones I wondered about are explained
# 9  
Old 04-08-2008
Hi jim & era: I appreciate both your solutions very much! The summary below is rather long, but it illustrates my working through the problem - please be patient! ;-)

I think that we are close; however, the final output is still not quite what I want, as described below. Basically, I need to find duplicates (a pp b = b pp a, etc.) and summarize and count them (2 a pp b), including the unique lines (1 a pp a). However, I'd like this sorting, etc. to be based on three specific columns (here, $1, $2, $3 - in the source file), independently of any of the other fields.

I managed to save jim's awk script as a perl script/file, “search_replace.awk.sh”, executable from the command line. For reference, I am using Cygwin on Windows XP Pro at work; I am a Ubuntu linux user at home.

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)} else {printf("%s %s %s", $1, $2, $3)} for(i=4; i<=NF; i++) {printf(" %s", $i)} printf("\n") }' dummy_test_duplicates_file_2.txt | awk '{ arr[$1 $2 $3]++; if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

which is executed as follows:

$ sh search_replace.awk.sh

(Question: Is it possible to specify the input file from the command line, rather than within the script?)

This also works (same code, split over different lines)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

but not this (identical code, split differently)

#!/usr/bin/bash
awk '{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt |
awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found

... Adding a backslash ( printf("\n") }' dummy_test_duplicates_file_2.txt | \ ) should - but does not - correct the problem ...

Ultimately, I modified this “search_replace.awk.sh” script by adding

BEGIN {OFS=FS="\t"}

to try to ensure tab-delimited input/output:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file_2.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

I also compared the output form jim's code versus that from era's awk command,

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Using my source file (to be searched) - “dummy_test_duplicates_file.txt” (21 lines), I get the following:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'


$ cat dummy_test_duplicates_file.txt | sort
a gi b
a pp a
a pp b
a pp b
a pp c
a pp d
a pp e
a pp e
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t


$ sh search_replace.awk.sh
3 y gi t
1 u gi t
2 d pp a
1 z pp t
2 z gi t
2 v gi t
2 e pp a
1 a pp a
1 w gi t
1 b gi a
3 b pp a
1 x gi t
1 c pp a


Compare to:

$ awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt
1 t gi x
1 t gi v
1 t gi y
1 t gi z
1 a pp a
2 t gi y
2 a pp b
1 t gi z
1 a pp c
1 a pp d
1 a pp b
2 a pp e
1 a gi b
1 a pp d
1 t gi u
1 t pp z
1 t gi v
1 t gi w

This also works:

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt

Interestingly, adding a second ( BEGIN {OFS=FS="\t"} ),

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($3 > $1) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $1, $2, $3)}
for(i=4; i<=NF; i++) {printf(" %s", $i)}
printf("\n") }' dummy_test_duplicates_file.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

gives the same output, formatted slightly differently and in a different line order:

$ sh search_replace.awk.sh
2 z gi t
2 d pp a
3 y gi t
1 c pp a
1 x gi t
3 b pp a
1 w gi t
1 a pp a
2 v gi t
1 b gi a
1 z pp t
1 u gi t
2 e pp a

$ sh search_replace.awk.sh | sort
1 a pp a
1 b gi a
1 c pp a
1 u gi t
1 w gi t
1 x gi t
1 z pp t
2 d pp a
2 e pp a
2 v gi t
2 z gi t
3 b pp a
3 y gi t

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

====================

So far, this "appears" to be working great! :-)


However, when I add a fourth column (to see the effect - remember I want to sort, count "duplicates," etc. based only on three defined columns / fields), saved as "dummy_test_duplicates_file_3.txt"

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z


$ cat dummy_test_duplicates_file_3.txt | sort
a gi b I
a pp a don't
a pp b This
a pp b text
a pp c column
a pp d contains
a pp e some
a pp e that
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
t pp z
v gi t
y gi t
y gi t
z gi t


$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

[Here I changed source file in perl script from “dummy test_duplicates_file.txt” to “dummy_test_duplicates_file_3.txt”

$ sh search_replace.awk.sh | sort
1 a pp a don't
1 b gi a I
1 b pp a This
1 b pp a text
1 b pp a want
1 c pp a column
1 d pp a contains
1 d pp a to
1 e pp a some
1 e pp a that
1 u gi t affect
1 v gi t
1 v gi t the
1 w gi t search/replace
1 x gi t operations.
1 z pp t
2 z gi t
3 y gi t


This (above) really isn't what I am looking for - Note that “a pp b” (or “b pp a” in the second example) are listed separately, 3 times, in the two different solutions (outputs), above. The contents of the fourth column ($4 in the input file) is affecting the search / replace operation.

Additionally - What is the purpose of the "%s” (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?

I also decided to change the “order” of the search fields $3, $1:

#!/usr/bin/bash
awk 'BEGIN {OFS=FS="\t"}
{ if($1 > $3) {printf("%s %s %s", $3, $2, $1)}
else {printf("%s %s %s", $3, $2, $1)}
for(i=4; i<=NF; i++) {printf("%s", $i)}
printf("\n") }' dummy_test_duplicates_file_3.txt | awk ' BEGIN {OFS=FS="\t"} { arr[$1 $2 $3]++; if(m[$1 $2 $3]=="")
{m[$1 $2 $3]=$0} } END { for(i in arr) {print arr[i], m[i]} }'

e.g. if $1 (b) > $3 (a), print $3 (a) $2 $1 (b)

Thank you both once again for your different, unique solutions - I'd like to get this sorted out using one (or both - this is educational) methods!

Sincerely, Greg :-)
# 10  
Old 04-08-2008
Quote:
Originally Posted by gstuart
(Question: Is it possible to specify the input file from the command line, rather than within the script?)
Absolutely. The command-line arguments are available in the shell script in $1, $2, $3 etc; "$@" (with the quotes) is all the arguments. This is confusing if you've been playing a lot with awk, where it's the fields of the current line.

Quote:
Originally Posted by gstuart
$ sh search_replace.awk.sh
search_replace.awk.sh: line 5: $'\r': command not found
I would speculate that this is a Cygwin problem. \r is the DOS carriage return character. Maybe you could save as a Unix file and try again.

Quote:
Originally Posted by gstuart
Code:
$ awk 'BEGIN {OFS=FS="\t"}
$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ }
END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1	a	gi	b	I
1	a	pp	a	don't
1	a	pp	b	This
1	a	pp	b	text
1	a	pp	b	want

Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++

Quote:
Additionally - What is the purpose of the "%s” (ASCII) string printf specification? It looks like it is being set to the field variables (e.g. printing fields $1, $2, $3 as ASCII strings)?
Correct.

I hope I managed to address all your questions. Maybe an executive summary would help (-:
# 11  
Old 04-08-2008
run dos2unix on that file that may have been constructed under a DOS/windows environment. you may find it react better.
# 12  
Old 04-09-2008
Wednesday April 09, 2008

Quote: Originally Posted by gstuart
Code:

$ awk 'BEGIN {OFS=FS="\t"}
$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ }
END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want

[ Reply - era: ] "Like I think I was saying before, replace $0 with $1 $2 $3 (or even $1 "\t" $2 "\t" $3) in a[$0]++"
---------------------
Wow - Thanks era! This is wonderful - I'm not sure that I would have figured this out, in a timely manner! I'm going to read up a bit on awk and arrays. I think that this will be *very* helpful!

$ cat dummy_test_duplicates_file_3.txt
a pp b This
a pp c column
a pp d contains
a pp e some
a pp b text
a pp e that
a gi b I
a pp a don't
b pp a want
d pp a to
t gi u affect
t gi v the
t gi w search/replace
t gi x operations.
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b I
1 a pp a don't
1 a pp b This
1 a pp b text
1 a pp b want
1 a pp c column
1 a pp d contains
1 a pp d to
1 a pp e some
1 a pp e that
1 t gi u affect
1 t gi v
1 t gi v the
1 t gi w search/replace
1 t gi x operations.
1 t pp z
2 t gi z
3 t gi y

If I understand this correctly, increasing the array to four or more elements (in this example) “screws up” the search for duplicates, due to non-identity of the remaining fields. Short of concatenating these fields, e.g. from the above output,

1 a pp b This
1 a pp b text
1 a pp b want

as

3 a pp b This; text; want

I don't seen any “simple” work-around (?!) ... However, I can live with this, as I am critically interested in finding / counting / parsing duplicates among the $1, $2, $3 fields (in the input file: Gene A, interaction type, Gene B), and having the output list the unique relationships with the duplicate counts (“weights”)! This is perfectly acceptable to me, and the code provided by era (modified slightly, here),

awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

is simple and elegant! :-)

This is my understanding of era's code:

$ awk 'BEGIN
{OFS=FS="\t"} # output, input field separators = tabs
if $3 < $1 then
{ t = $1; $1 = $3; $3 = t } # switch the order of the variables if $3 < $1, e.g. "b gi a" ---> "a gi b"
{ a[$1 "\t" $2 "\t" $3 "\t" $4]++ } # ++ increments the index (i.e. table elements) in
# a[$1 "\t" $2 "\t" $3 "\t" $4] by 1, for the following "for" loop
END

{ for (k in a) { print a[k], k } } ' # for statement: for (initialization; condition; increment)
# k = array elements; incrementally printed, until the end of the array
# a = the array, defined above (three index elements, in this example)
dummy_test_duplicates_file_3.txt | sort # input (source) file; dump to screen, sorted via pipe
# (or " > output_file.txt" - to save the output directly to a file)

How is this thing actually counting the “duplicates” - *following* this declaration, the array itself must be able to keep track of duplicates? ... I need to read up on awk and arrays ...

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k] } } ' dummy_test_duplicates_file_3.txt | sort
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print k } } ' dummy_test_duplicates_file_3.txt | sort
a gi b
a pp a
a pp b
a pp c
a pp d
a pp e
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
t pp z

$ awk 'BEGIN {OFS=FS="\t"} $3 < $1 { t = $1; $1 = $3; $3 = t } { a[$1 "\t" $2 "\t" $3]++ } END { for (k in a) { print a[k], k } } ' dummy_test_duplicates_file_3.txt | sort > out.tab

$ cat out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

NOTE that the print command (above) adds a “new” (fourth field, in the first column position) - the frequency (duplicates) count, or “weight”:

$ awk ' { print ($0) } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

$ awk ' { print ($1) } ' out.tab
1
1
1
1
1
1
1
2
2
2
2
3
3

$ awk ' { print ($2) } ' out.tab
a
a
a
t
t
t
t
a
a
t
t
a
t

$ awk ' { print ($3) } ' out.tab
gi
pp
pp
gi
gi
gi
pp
pp
pp
gi
gi
pp
gi

$ awk ' { print ($4) } ' out.tab
b
a
c
u
w
x
z
d
e
v
z
b
y

$ awk ' { print ($5) } ' out.tab














$ awk ' { print ($"[1-NF]") } ' out.tab
1 a gi b
1 a pp a
1 a pp c
1 t gi u
1 t gi w
1 t gi x
1 t pp z
2 a pp d
2 a pp e
2 t gi v
2 t gi z
3 a pp b
3 t gi y

THANK YOU all, once again for your help -very much appreciated.

Sincerely, Greg S. :-)
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Hi, I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern. Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file. Input... (13 Replies)
Discussion started by: chandrath
13 Replies

2. Shell Programming and Scripting

Search and replace script

Hi, Below is the script which will find a particular text and replace with another one in a group of files under a directory /test #!/bin/bash old=$1 --- first input old text new=$2--- input new text cd /test --- folder into which files need to be checked for y in `ls *`; do sed... (2 Replies)
Discussion started by: chetansingh23
2 Replies

3. Shell Programming and Scripting

Script to search and replace

Hi All, I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files. Could any one let me know how do i find the text in many files under a directory. Thanks (3 Replies)
Discussion started by: chetansingh23
3 Replies

4. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Ok, what about: array set simulation_frames { ... } foreach { frames } { writepdb pdb_$frames.pdb }Now, my question is simply, what strategy could I use to import my numbers into the array { ... } I could manually copy them, and that would work, but is there another way? (2 Replies)
Discussion started by: chrisjorg
2 Replies

5. Shell Programming and Scripting

Please Help to Check script Search and Replace

Please Help to Check script Search and Replace Ex. Search 0001 and Replete un_0001 ---script Code: nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt script is work to one result but if i have file1.txt, file2.txt, file3.txt i want to Replace... (5 Replies)
Discussion started by: kittiwas
5 Replies

6. Shell Programming and Scripting

Script Search replace - complicated

I have a text file for which i need a script which does some fancy search and replace. Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the... (7 Replies)
Discussion started by: kelseyh
7 Replies

7. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Hi, I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file: 1.ejverything 2.bllown 3.maikling 4.manegement 5.existjing 6.systems My design currently takes input from the user, and... (2 Replies)
Discussion started by: mkfitzwilliams
2 Replies

8. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script perl -pi -e 's@/Pattern@@g' $path/$file I used the @ as my delimiter because the pattern contains "/" (3 Replies)
Discussion started by: NobluesFDT
3 Replies

9. UNIX for Dummies Questions & Answers

multiple input search and replace script

hi, i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files. cho "" echo... (3 Replies)
Discussion started by: tungaw2004
3 Replies

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
Login or Register to Ask a Question