Grouping matches by cols

09-10-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you both, Annihilannic and cfajohnson. I cannot try your code now but will do so first thing in the morning.

The results are from pairwise comparisons of genes - a table of gene1 (A or B or first column) matching gene2 (B or A or second col) by a particular cutoff % identity. I filtered an analysis of highly similar genes. Based on many years of doing gene analysis daily I have a reasonable idea that above this cutoff the gene functions are either very similar or identical.

The rule is that the first time a pair is seen, the first element of the pair becomes the name of the group. I am just using a FIFO scheme here. It really does not matter scientifically whether A or B (gene1 or gene2) gets assigned here. It matters however that once a group label has been identified that label is consistently used so that the same gene is not assigned to a different group. (Annihilannic trapped my mistakes smartly, very nice of you, special thanks. I will learn to stop working when my eyes are really blurry and my brain is fried.)

In my first example, we have A A, A B, and B B. This is 'coz A matches B, else we will only have A A and B B.
Since we see A A first in the list as A matches itself, we assign the group to be A. Now when we read further we get to either of B B or A B. But if A and B match A B or B A will come before B B. So B will be assigned to group A as A was seen before and got a label assigned and when we see B matching itself again we need to assign B to group A.
Alternately B B will be seen w/o either of A B or B A (if B does not match A, in which case we only have A A and B B) and hence will be assigned to group B.
So even if B matches itself (B B) it also matches A (when you see either of A B or B A) and A B is already assigned to A, so B's group will be A. If in a real example it appears in the order B B, A B, A A, no harm done, B will be the group label. So it will not scientifically matter even if we reverse the process and use the second col match as label but we need to then use the same grouping (and/or process of determining the grouping) consistently for other matches to both A and B as we move along.

Sorry, this "lecture" was unintended.
More questions? I will be very happy to answer.

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-10-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

In the third paragraph in my previous post, in the last sentence (prior to the parentheses), by "same gene" I intend to say "a gene previously assigned in a pair to a group". Sorry again.

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-10-2008

Registered User

1,009, 2

Join Date: May 2008

Last Activity: 28 October 2009, 7:03 PM EDT

Location: Sydney, Australia

Posts: 1,009

Thanks Given: 0

Thanked 2 Times in 2 Posts

You can edit posts here to correct them you know. :-) I'd be lost without that facility...

It makes sense to me... personally I think I would use an entirely different name for the groups to avoid confusing myself, maybe a group number. So, for example, A, B and X would be in group 1. That way you don't associate A with group 1 any more than you would X. But the end result is the same...

Annihilannic

View Public Profile for Annihilannic

Find all posts by Annihilannic

09-10-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Yeah, using a number for the group was where I originally wanted to be but the problem is that it tells a biologist little beyond the fact that "you just reached Group_10000289".
My next step (which I know how to do) is to substitute the group ID's w. the name of the gene. Now a biologist knows (s)he can look for "geneX" of his interest to see how many groups and how many members per group the gene has. And once I have merged that information I can use it to create interfaces for my analysis where people can come and query or manipulate or add to or analyze the results, etc.
Again, thanks for the valuable discussions. This helps me keep my project ideas clear in both the near and long term.
More after trying the code.

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-10-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Ok, now I have tried both codes on my initial inputs (the alphabet pairs) and they both produced identical results.
The logical net step was to do the same on a much smaller subset of my comparison data that I am familiar with to make sure I am getting what I expected from the codes.
I hit a snag here. I saw that the outputs wee slightly different and had errors. The issue was that the data had names like NC_008527.1|:1225155-1226045 and NC_008527.1|:c900661-899771. Note the difference in the two names |: and |:c. This was causing issues when the groups were being assigned. I hypothesized that it was only this difference causing troubles, based on other group assignments to pair w/o |:c in the names.
I was then able to write a Perl script to move the c from |:c to the end of the name and resort the data. When I used this new sorted data both scripts produced accurate and identical results.
Considering that I am a newbie to awk (I have used it for less than 10 h now) I earnestly appreciate the favor that you both, cfajohnson and annihilanic, have done me.
Call me old-fashioned, but I never forget the least tidbit of help anybody has ever done to me. Many, many grateful thanks!!
One last question, is there a way to use "tab" as the column separator before adding the group name in place of space?

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-10-2008

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by gbalsu

One last question, is there a way to use "tab" as the column separator before adding the group name in place of space?

Code:

{ printf "%s\t%s\n", $0, group }

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

Shell Programming and Scripting

Grouping matches by cols

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting cut to ignore cols in middle of records

Discussion started by: wbport

2. Shell Programming and Scripting

Bitwise comparison of cols

Discussion started by: ritakadm

3. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Discussion started by: AshwaniSharma09

4. Shell Programming and Scripting

Join txt files with diff cols and rows

Discussion started by: BNasir

5. Shell Programming and Scripting

awk -- print combinations for 2 cols

Discussion started by: irrevocabile

6. Programming

Curses not updating LINES/COLS

Discussion started by: nwboy74

7. Shell Programming and Scripting

sort and split file by 2 cols (1 col after the other)

Discussion started by: Ghetz

8. Shell Programming and Scripting

How to find number of Cols in a file ?

Discussion started by: videsh77

9. Shell Programming and Scripting

awk - print formatted without knowing no of cols

Discussion started by: bistru

10. Shell Programming and Scripting

join cols from multi files into one file

Discussion started by: vbshuru