Molecular biologist requires help re: search / replace script

04-07-2008

Registered User

16, 0

Join Date: Apr 2008

Last Activity: 16 November 2014, 8:59 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Molecular biologist requires help re: search / replace script

Monday April 07, 2008

Hello - I was wondering if someone could help me? I have some basic knowledge of awk, etc., and can create simple scripts (e.g. a search_replace.awk file) that can be called from the command line:

$ awk -f search_replace.awk <file to be searched>

I have a tab-delimited table of data (text), essentially as follows (for simplicity),

a pp b
a pp c
a pp d
a pp e
a pp b
a pp e
a gi b
a pp a
b pp a
d pp a
t gi u
t gi v
t gi w
t gi x
t gi y
t gi z
z gi t
y gi t
v gi t
y gi t
t pp z

I want to be able take each line, in succession, and search it against the entire file, removing duplicates. I know that I can easily do this using the uniq command (on a *sorted* file), but I also need to be able to identify mirror-image or reverse duplicates, e.g.

a pp b
a pp b
a pp b
b pp a
a pp b
b pp a

should be reduced to a single line,

a pp b

(since "b pp a" is 'the same' as "a pp b").

Is this clear?

Additionally, my actual file contains additional columns (fields, per row); I would like to ignore (but keep) these additional fields, just searching and replacing based on the (in the example above) fields $1, $2, $3. I think that it is possible to specify fields with regard to search / replace operations, etc.

Lastly (I know that I am asking a lot), it would be ideal if the output could also keep track of how many duplicated lines there were, adding a column of "weights" (1; 2; 3; 4; etc.) indicating the numbers of duplicates in the source file, with 1 = no duplicates, 2 - one duplicate, etc.

In the six-line example above, this would be

6 a pp b

I have played around with the command line and some simple scripts, but this is a little beyond my grasp. I'm guessing one solution would be a grep operation, piped to / from an awk or sed command, perhaps?

FYI, I am a molecular biologist / geneticist; I am trying to sort a file of perhaps 150-200,000 lines each containing 7-8 fields, for loading into a data visualization / analysis program. In the example above, the first and third columns represent specific genes, with the middle (2nd) column establishing the relationship between the first and the second gene. Note that the relationship "pp" is different than "gi", thus

a pp b

is different from

a gi b

The reason for all of this is that I do not remove duplicate mappings (including the reverse or "mirror images," e.g. "a pp b" = "b pp a"), then I get extra lines appearing in my analysis program (Cytoscape), that complicates the display (relationships between groups of genes). The reason that I asked for the "weights" is that I want to weight the edges (lines) connecting my nodes (genes) in Cytoscape, according to how many time this relationship was reported, from various assays (different type of experiments; independent analyses).

If anyone could suggest some solutions, that would be *very* much appreciated!

Thanking you all in advance,

Sincerely, Greg S. :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

04-07-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Can you come up with a canonicalization for all these fields? Even if you can't keep them in canonical form, that would basically solve the immediate problem.

Let's say, for example, always sort them so $1 is alphabetically before $3. So swap any fields where $3 is before $1, then sort -u and all that.

Code:

awk '$3 < $1 { t = $1; $1 = $3;  $3 = t } { print }'

Extending this to keep counts for each canonical form in awk itself should be fairly trivial.

Code:

awk '$3 < $1 { t = $1; $1 = $3; $3 = t } { a[$0]++ } END { for (k in a) { print a[k], k } } '

Addendum: If you have additional fields you need to key on just $1$2$3 instead of $0, of course. Then you can't keep them in the weighted summary, though.

Last edited by era; 04-07-2008 at 04:37 PM.. Reason: Added the second script

era

View Public Profile for era

Find all posts by era

04-07-2008

Registered User

16, 0

Join Date: Apr 2008

Last Activity: 16 November 2014, 8:59 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi era - Thank you for your reply. In my example (above), the middle column (field $2) will be either "pp" or "gi" or "tf" (I didn't mention "tf," for simlplicity) - at least for now.

Fields $1 and $3 will be genes that are uniquely named, sometimes containing a dash (e.g., YPR016W-A) but mostly of the form YGR110W, etc. (I work with yeast.) Each name is a unique "systematic" name - one per gene). None of the gene names contain spaces, thankfully.

I believe this what you meant by canonicalization (Canonicalization) - Wikipedia, the free encyclopedia

Could you kindly elaborate on the method(s) that you alluded to in your reply?

Thanks, era! Greg :-)

Last edited by DukeNuke2; 04-07-2008 at 04:53 PM..

gstuart

View Public Profile for gstuart

Find all posts by gstuart

04-07-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Did you try the code I posted? Does it do roughly what you asked for? If not, can you tell what more you need?

Canonical representation basically means you come up with a "standard" which all other representations can be converted into. (I wrote that before looking at the first sentence in the Wikipedia article -- honest.)

What I suggested was to simply use a canonicalization which says they should be in alphabetic order, so you can compare them head to head; in other words, find a way to mask off anything which introduces an artificial difference when you know two forms to be identical. Now if you rewrite them in this way, then simple string equality can be used to test for equivalence.

The code I posted already does this, but doesn't handle lines with more than three fields. Just add a print in the ++a[stuff] part if you want to keep them and add a frequency count at the end. You need a way to separate the frequency count from the raw data, though, so I'd just keep them in separate files, and not add that print statement.

PS. The wikipedia link is slightly broken. You want Canonicalization - Wikipedia, the free encyclopedia

era

View Public Profile for era

Find all posts by era

04-07-2008

Registered User

16, 0

Join Date: Apr 2008

Last Activity: 16 November 2014, 8:59 PM EST

Posts: 16

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hello again ... Yes, I'm following what you're saying (era) - I tried out the code you suggested. I'll explore this approach, further - thanks!

I'd also be interested in additional ideas (for comparison) ...

Greg :-)

gstuart

View Public Profile for gstuart

Find all posts by gstuart

04-07-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Quote:

# dummied up data file called filename:
csadev:/home/jmcnama> cat filename
a gi b 42
a pp a 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp b 43
a pp c 44
a pp d 45
a pp e 46
a pp e 46
b pp a 43
b pp a 43
b pp a 43
d pp a 50
t gi u 51
t gi u 51
t gi v 52
t gi v 52
t gi w 53
t gi w 53
t gi x 54
t gi x 54
t gi y 55
t gi y 55
t gi z 56
t gi z 56
t pp z 57
v gi t 58
v gi t 58
y gi t 59
y gi t 59
y gi t 59
z gi t 60
z gi t 60

# run the script
csadev:/home/jmcnama> t.awk
2 u gi t 51
2 d pp a 45
1 b gi a 42
4 z gi t 56
1 c pp a 44
5 y gi t 55
9 b pp a 43
1 z pp t 57
2 x gi t 54
1 a pp a 43
2 w gi t 53
4 v gi t 52
2 e pp a 46

script:

Code:

awk '{
     	if($3 > $1) {printf("%s %s %s", $3, $2, $1)}     	
     	else {printf("%s %s %s", $1, $2, $3)}     	
     	for(i=4; i<=NF; i++) {printf(" %s", $i)}
     	printf("\n")
     } ' filename  | \
     awk '{ 
     		arr[$1 $2 $3]++;
     		if(m[$1 $2 $3]=="") {m[$1 $2 $3]=$0}
     	  }
     	  END {
     	     for(i in arr)
     	       {print arr[i], m[i]}
     	  }'	>  newfilename

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-07-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Just to finish that dangling remark about more fields after $3

Code:

awk -F "\t" '$3 < $1 { t = $1; $1 = $3; $3 = t }
{ a[$1 "\t" $2 "\t" $3]++ }
END { for (k in a) { print a[k], k } } '

I added the -F "\t" to explicitly make this tab-delimited, and changed the array to key on $1 $2 $3 joined by tabs. Hope your awk can do that too.

Like I wrote above, this summarizes the counts only, and I'd keep it that way. It might be useful to convert all of your data to the canonical format, which the first script I posted should already do for you. (Maybe add the -F "\t" there too.)

Hope this helps.

era

View Public Profile for era

Find all posts by era

Shell Programming and Scripting

Molecular biologist requires help re: search / replace script

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Shell script for search and replace by field

Discussion started by: chandrath

2. Shell Programming and Scripting

Search and replace script

Discussion started by: chetansingh23

3. Shell Programming and Scripting

Script to search and replace

Discussion started by: chetansingh23

4. Shell Programming and Scripting

TCL script (Molecular Chemistry)

Discussion started by: chrisjorg

5. Shell Programming and Scripting

Please Help to Check script Search and Replace

Discussion started by: kittiwas

6. Shell Programming and Scripting

Script Search replace - complicated

Discussion started by: kelseyh

7. UNIX for Dummies Questions & Answers

Unix script, sed search and replace?

Discussion started by: mkfitzwilliams

8. UNIX for Dummies Questions & Answers

Perl search and replace not working in csh script

Discussion started by: NobluesFDT

9. UNIX for Dummies Questions & Answers

multiple input search and replace script

Discussion started by: tungaw2004

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Discussion started by: csejl