Grouping matches by cols


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grouping matches by cols
# 8  
Old 09-10-2008
Thank you both, Annihilannic and cfajohnson. I cannot try your code now but will do so first thing in the morning.

The results are from pairwise comparisons of genes - a table of gene1 (A or B or first column) matching gene2 (B or A or second col) by a particular cutoff % identity. I filtered an analysis of highly similar genes. Based on many years of doing gene analysis daily I have a reasonable idea that above this cutoff the gene functions are either very similar or identical.

The rule is that the first time a pair is seen, the first element of the pair becomes the name of the group. I am just using a FIFO scheme here. It really does not matter scientifically whether A or B (gene1 or gene2) gets assigned here. It matters however that once a group label has been identified that label is consistently used so that the same gene is not assigned to a different group. (Annihilannic trapped my mistakes smartly, very nice of you, special thanks. I will learn to stop working when my eyes are really blurry and my brain is fried.)

In my first example, we have A A, A B, and B B. This is 'coz A matches B, else we will only have A A and B B.
Since we see A A first in the list as A matches itself, we assign the group to be A. Now when we read further we get to either of B B or A B. But if A and B match A B or B A will come before B B. So B will be assigned to group A as A was seen before and got a label assigned and when we see B matching itself again we need to assign B to group A.
Alternately B B will be seen w/o either of A B or B A (if B does not match A, in which case we only have A A and B B) and hence will be assigned to group B.
So even if B matches itself (B B) it also matches A (when you see either of A B or B A) and A B is already assigned to A, so B's group will be A. If in a real example it appears in the order B B, A B, A A, no harm done, B will be the group label. So it will not scientifically matter even if we reverse the process and use the second col match as label but we need to then use the same grouping (and/or process of determining the grouping) consistently for other matches to both A and B as we move along.

Sorry, this "lecture" was unintended.
More questions? I will be very happy to answer.
# 9  
Old 09-10-2008
In the third paragraph in my previous post, in the last sentence (prior to the parentheses), by "same gene" I intend to say "a gene previously assigned in a pair to a group". Sorry again.
# 10  
Old 09-10-2008
You can edit posts here to correct them you know. :-) I'd be lost without that facility...

It makes sense to me... personally I think I would use an entirely different name for the groups to avoid confusing myself, maybe a group number. So, for example, A, B and X would be in group 1. That way you don't associate A with group 1 any more than you would X. But the end result is the same...
# 11  
Old 09-10-2008
Yeah, using a number for the group was where I originally wanted to be but the problem is that it tells a biologist little beyond the fact that "you just reached Group_10000289".
My next step (which I know how to do) is to substitute the group ID's w. the name of the gene. Now a biologist knows (s)he can look for "geneX" of his interest to see how many groups and how many members per group the gene has. And once I have merged that information I can use it to create interfaces for my analysis where people can come and query or manipulate or add to or analyze the results, etc.
Again, thanks for the valuable discussions. This helps me keep my project ideas clear in both the near and long term.
More after trying the code.
# 12  
Old 09-10-2008
Ok, now I have tried both codes on my initial inputs (the alphabet pairs) and they both produced identical results.
The logical net step was to do the same on a much smaller subset of my comparison data that I am familiar with to make sure I am getting what I expected from the codes.
I hit a snag here. I saw that the outputs wee slightly different and had errors. The issue was that the data had names like NC_008527.1|:1225155-1226045 and NC_008527.1|:c900661-899771. Note the difference in the two names |: and |:c. This was causing issues when the groups were being assigned. I hypothesized that it was only this difference causing troubles, based on other group assignments to pair w/o |:c in the names.
I was then able to write a Perl script to move the c from |:c to the end of the name and resort the data. When I used this new sorted data both scripts produced accurate and identical results.
Considering that I am a newbie to awk (I have used it for less than 10 h now) I earnestly appreciate the favor that you both, cfajohnson and annihilanic, have done me.
Call me old-fashioned, but I never forget the least tidbit of help anybody has ever done to me. Many, many grateful thanks!!
One last question, is there a way to use "tab" as the column separator before adding the group name in place of space?
# 13  
Old 09-10-2008
Quote:
Originally Posted by gbalsu
One last question, is there a way to use "tab" as the column separator before adding the group name in place of space?

Code:
{ printf "%s\t%s\n", $0, group }

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting cut to ignore cols in middle of records

I recently had to remove a number of columns from a sorted copy of a file, but couldn't get the cut command to take fields out, just what to keep. This is the only thing I could find as an example, but could it be simplified? tstamp=`date +%H%M%S` grep -v "T$" filename |egrep -v "^$" |sort... (3 Replies)
Discussion started by: wbport
3 Replies

2. Shell Programming and Scripting

Bitwise comparison of cols

Hello, I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress? Input Org1 Org2 Org3 A A T A ... (9 Replies)
Discussion started by: ritakadm
9 Replies

3. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

4. Shell Programming and Scripting

Join txt files with diff cols and rows

I am a new user of Unix/Linux, so this question might be a bit simple! I am trying to join two (very large) files that both have different # of cols and rows in each file. I want to keep 'all' rows and 'all' cols from both files in the joint file, and the primary key variables are in the rows.... (1 Reply)
Discussion started by: BNasir
1 Replies

5. Shell Programming and Scripting

awk -- print combinations for 2 cols

Dear all, could you please help me with awk please? I have such input: Input: a d b e c f The number of lines is unknown before reading the file. I need to print possible combination between the two columns like this: Output: a d b d c d a e b e c e a f (2 Replies)
Discussion started by: irrevocabile
2 Replies

6. Programming

Curses not updating LINES/COLS

I'm working with an extremely outdated and old system at work. We do not have ncurses, but we do have curses. I need to make a user interface for users connecting with xterm. One issue I've encountered is if the user resizes the window, I'd like to provide functionality to redraw the screen with... (4 Replies)
Discussion started by: nwboy74
4 Replies

7. Shell Programming and Scripting

sort and split file by 2 cols (1 col after the other)

Dear All, I am a newbie to shell scripting so this one is really over my head. I have a text file with five fields as below: 76576.867188 6232.454102 2.008904 55.000000 3 76576.867188 6232.454102 3.607231 55.000000 4 76576.867188 6232.454102 1.555146 65.000000 3 76576.867188 6232.454102... (19 Replies)
Discussion started by: Ghetz
19 Replies

8. Shell Programming and Scripting

How to find number of Cols in a file ?

Hi I have a requirement wherein the file is comma separated. Each records seems to have different number of columns, how I can detect like a row index wise, how many columns are present ? Thanks in advance. (2 Replies)
Discussion started by: videsh77
2 Replies

9. Shell Programming and Scripting

awk - print formatted without knowing no of cols

Hi, i want to print(f) the content of a file, but i don't know how many columns it has (i.e. it changes from each time my script is run). The number of columns is constant throughout the file. Any suggestions? (8 Replies)
Discussion started by: bistru
8 Replies

10. Shell Programming and Scripting

join cols from multi files into one file

Hi Fields in Files 1,2,3,4 are pipe"|" separated. Say I want to grep col1 from File1 col3 from File2 col4 from File3 and print to File4 in the following order: col3|col1|col4 what is the best way of doing this? Thanks (2 Replies)
Discussion started by: vbshuru
2 Replies
Login or Register to Ask a Question