Molecular biologist requires help re: search / replace script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Molecular biologist requires help re: search / replace script
# 1  
Old 04-07-2008
Molecular biologist requires help re: search / replace script

Monday April 07, 2008

Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:

$ awk -f search_replace.awk <file to be searched>

I have a tab-delimited table of data (text), essentially as follows (for simplicity),

a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.

a pp b
a pp b
a pp b
b pp a
a pp b
b pp a


should be reduced to a single line,

a pp b

(since "b pp a" is 'the same' as "a pp b").

Is this clear?

Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.

Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.

In the six-line example above, this would be

6 a pp b

I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?

FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus

a pp b

is different from

a gi b

The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).

If anyone could suggest some solutions, that would be *very* much appreciated!

Thanking you all in advance,

Sincerely, Greg S. :-)
# 2  
Old 04-07-2008
Can you come up with a canonicalization for all these fields? Even if you can't keep them in canonical form, that would basically solve the immediate problem.

Let's say, for example, always sort them so $1 is alphabetically before $3. So swap any fields where $3 is before $1, then sort -u and all that.

Code:
awk '$3 < $1 { t = $1; $1 = $3;  $3 = t } { print }'

Extending this to keep counts for each canonical form in awk itself should be fairly trivial.

Code:
awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } '

Addendum: If you have additional fields you need to key on just $1$2$3 instead of $0, of course. Then you can't keep them in the weighted summary, though.

Last edited by era; 04-07-2008 at 04:37 PM.. Reason: Added the second script
# 3  
Old 04-07-2008
Hi era - Thank you for your reply. In my example (above), the middle column (field $2) will be either "pp" or "gi" or "tf" (I didn't mention "tf," for simlplicity) - at least for now.

Fields $1 and $3 will be genes that are uniquely named, sometimes containing a dash (e.g., YPR016W-A) but mostly of the form YGR110W, etc. (I work with yeast.) Each name is a unique "systematic" name - one per gene). None of the gene names contain spaces, thankfully.

I believe this what you meant by canonicalization (Canonicalization) - Wikipedia, the free encyclopedia

Could you kindly elaborate on the method(s) that you alluded to in your reply?

Thanks, era! Greg :-)

Last edited by DukeNuke2; 04-07-2008 at 04:53 PM..
# 4  
Old 04-07-2008
Did you try the code I posted? Does it do roughly what you asked for? If not, can you tell what more you need?

Canonical representation basically means you come up with a "standard" which all other representations can be converted into. (I wrote that before looking at the first sentence in the Wikipedia article -- honest.)

What I suggested was to simply use a canonicalization which says they should be in alphabetic order, so you can compare them head to head; in other words, find a way to mask off anything which introduces an artificial difference when you know two forms to be identical. Now if you rewrite them in this way, then simple string equality can be used to test for equivalence.

The code I posted already does this, but doesn't handle lines with more than three fields. Just add a print in the ++a[stuff] part if you want to keep them and add a frequency count at the end. You need a way to separate the frequency count from the raw data, though, so I'd just keep them in separate files, and not add that print statement.

PS. The wikipedia link is slightly broken. You want Canonicalization - Wikipedia, the free encyclopedia
# 5  
Old 04-07-2008
Hello again ... Yes, I'm following what you're saying (era) - I tried out the code you suggested. I'll explore this approach, further - thanks!

I'd also be interested in additional ideas (for comparison) ...

Greg :-)
# 6  
Old 04-07-2008
Quote:
# dummied up data file called filename:
csadev:/home/jmcnama> cat filename
a gi b 42
a pp a 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp c 44
a pp d 45
a pp e 46
a pp e 46
b pp a 43
b pp a 43
b pp a 43
d pp a 50
t gi u 51
t gi u 51
t gi v 52
t gi v 52
t gi w 53
t gi w 53
t gi x 54
t gi x 54
t gi y 55
t gi y 55
t gi z 56
t gi z 56
t pp z 57
v gi t 58
v gi t 58
y gi t 59
y gi t 59
y gi t 59
z gi t 60
z gi t 60

# run the script
csadev:/home/jmcnama> t.awk
2 u gi t 51
2 d pp a 45
1 b gi a 42
4 z gi t 56
1 c pp a 44
5 y gi t 55
9 b pp a 43
1 z pp t 57
2 x gi t 54
1 a pp a 43
2 w gi t 53
4 v gi t 52
2 e pp a 46
script:
Code:
awk '{
     	if($3 > $1) {printf("%s %s %s", $3, $2, $1)}     	
     	else {printf("%s %s %s", $1, $2, $3)}     	
     	for(i=4; i<=NF; i++) {printf(" %s", $i)}
     	printf("\n")
     } ' filename  | \
     awk '{ 
     		arr[$1 $2 $3]++;
     		if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0}
     	  }
     	  END {
     	     for(i in arr)
     	       {print arr[i], m[i]}
     	  }'	>  newfilename

# 7  
Old 04-07-2008
Just to finish that dangling remark about more fields after $3

Code:
awk -F "\t" '$3 < $1 { t = $1; $1 = $3; $3 = t }
{ a[$1 "\t" $2 "\t" $3]++ }
END { for (k in a) { print a[k], k } } '

I added the -F "\t" to explicitly make this tab-delimited, and changed the array to key on $1 $2 $3 joined by tabs. Hope your awk can do that too.

Like I wrote above, this summarizes the counts only, and I'd keep it that way. It might be useful to convert all of your data to the canonical format, which the first script I posted should already do for you. (Maybe add the -F "\t" there too.)

Hope this helps.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Hi, I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern. Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file. Input... (13 Replies)
Discussion started by: chandrath
13 Replies

2. Shell Programming and Scripting

Search and replace script

Hi, Below is the script which will find a particular text and replace with another one in a group of files under a directory /test #!/bin/bash old=$1 --- first input old text new=$2--- input new text cd /test --- folder into which files need to be checked for y in `ls *`; do sed... (2 Replies)
Discussion started by: chetansingh23
2 Replies

3. Shell Programming and Scripting

Script to search and replace

Hi All, I am trying to write a script which will find a particular text in certain group of files under a directory and if found correctly it will replace them with a new text in all the files. Could any one let me know how do i find the text in many files under a directory. Thanks (3 Replies)
Discussion started by: chetansingh23
3 Replies

4. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Ok, what about: array set simulation_frames { ... } foreach { frames } { writepdb pdb_$frames.pdb }Now, my question is simply, what strategy could I use to import my numbers into the array { ... } I could manually copy them, and that would work, but is there another way? (2 Replies)
Discussion started by: chrisjorg
2 Replies

5. Shell Programming and Scripting

Please Help to Check script Search and Replace

Please Help to Check script Search and Replace Ex. Search 0001 and Replete un_0001 ---script Code: nawk -F\" 'NR==FNR{a;next}$2 in a{sub($2,"un_"$2)}1' input.txt file*.txt > resoult.txt script is work to one result but if i have file1.txt, file2.txt, file3.txt i want to Replace... (5 Replies)
Discussion started by: kittiwas
5 Replies

6. Shell Programming and Scripting

Script Search replace - complicated

I have a text file for which i need a script which does some fancy search and replace. Basically i want to loop through each line, if i find an occurance of certain string format then i want to carry on search on replace another line, once i replaced this line i will contine to search for the... (7 Replies)
Discussion started by: kelseyh
7 Replies

7. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Hi, I am trying to write a shell script designed to take input line by line by line from a file with a word on each line for editing with sed. Example file: 1.ejverything 2.bllown 3.maikling 4.manegement 5.existjing 6.systems My design currently takes input from the user, and... (2 Replies)
Discussion started by: mkfitzwilliams
2 Replies

8. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

I am using perl to perform a search and replace. It works at the command line, but not in the csh shell script perl -pi -e 's@/Pattern@@g' $path/$file I used the @ as my delimiter because the pattern contains "/" (3 Replies)
Discussion started by: NobluesFDT
3 Replies

9. UNIX for Dummies Questions & Answers

multiple input search and replace script

hi, i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files. cho "" echo... (3 Replies)
Discussion started by: tungaw2004
3 Replies

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
Login or Register to Ask a Question