Grouping matches by cols


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grouping matches by cols
# 1  
Old 09-09-2008
Grouping matches by cols

Dear all
I have a large file w. ~ 10 million lines.
The first two cols have matching partners.
For example:
A A
A B
B B

or

A A
B A
B B

The matches may be separated by an unknown number of lines.

My intention is to group them and add a "group" value in the last col.

For example

A A A
A B A
B B A

or

A A A
B A A
B B A

Rest assured that only one of A B and B A will be present and not both.
Any help will be highly appreciated.
A may have matches in addition to B and any number of of them. But in all cases I would like to name the group with the first partner of the first instance, i.e. A in this case.
Any help will be highly appreciated.
# 2  
Old 09-09-2008
Quote:
Originally Posted by gbalsu
Dear all
I have a large file w. ~ 10 million lines.
The first two cols have matching partners.
For example:
A A
A B
B B

or

A A
B A
B B

The matches may be separated by an unknown number of lines.

My intention is to group them and add a "group" value in the last col.

For example

A A A
A B A
B B A

or

A A A
B A A
B B A

How do you determine the group value? Why is the third line not B B B?
Quote:
Rest assured that only one of A B and B A will be present and not both.
Any help will be highly appreciated.
A may have matches in addition to B and any number of of them. But in all cases I would like to name the group with the first partner of the first instance, i.e. A in this case.

It would be helpful if you provided more examples from the file.

It might also help if you posted some real data in addition to the abbreviated, single-letter data.
# 3  
Old 09-09-2008
Group value is determined by the first pair to be detected by the script.

If A A was the first pair, A is the first group value.
If A B was the first pair, A is the first group value.
If B A was the first pair, B is the first group value.
If B B was the first pair, B is the first group value.

I am sorting a large gene comparison data set, to us it hardly matters who the "group" is as far as the members are highly identical as the results indicate. This is only one of several analysis steps in my project.

Here is one set of instances of my data.

NC_002662.1|:1000271-1001206 NC_002662.1|:1000271-1001206 100.00 936 0 0 1 936 1 936 0.0 1814
NC_002662.1|:1000271-1001206 NC_008527.1|:1000752-1001687 88.60 947 86 21 1 936 1 936 0.0 957
NC_008527.1|:1000752-1001687 NC_008527.1|:1000752-1001687 100.00 936 0 0 1 936 1 936 0.0 1754
# 4  
Old 09-09-2008
So it seems like the "group value" is always the same as the first field? If that's the case, why do you need to add another field?
# 5  
Old 09-09-2008
No, if it was the first field all the time, I would never have posted this.
I kindly request you to look at my input again - if A B was encountered previously, when you next see B B it needs to be assigned to A.

I wanted to only provide a simple example but I guess I made it too simple and now appear not so smart.

Lets add some more.

Input

A A
A X
C D
E F
X L
A B
O O
P P
M N
B B

Output

A A A
A X A
C D C
E F E
X L X
A B A
O O O
T X X
E E E
P P P
M N M
B B A

My apologies, this is literally the first time I am posting questions in a programming forum. Please help me with further queries as you deem necessary.
# 6  
Old 09-09-2008
Try this:

Code:
awk '
        $1 in group {
                print $0,group[$1]
                if ($2 in group) {
                        if (group[$1] != group[$2]) {
                                print $1" and "$2" are already in different groups!"
                        }
                } else {
                        group[$2]=group[$1]
                }
                next
        }
        $2 in group {
                print $0,group[$2]
                group[$1]=group[$2]
                next
        }
        {
                group[$1]=$1
                group[$2]=$1
                print $0,group[$1]
        }
' inputfile

I think you forgot to include the "T X" and "E E" lines in your example input data.

Note that the output is slightly different, e.g. T X A, not T X X because X is already in group A:

Code:
A A A
A X A
C D C
E F E
X L A
A B A
O O O
T X A
E E E
P P P
M N M
B B A

# 7  
Old 09-09-2008
Quote:
Originally Posted by gbalsu
No, if it was the first field all the time, I would never have posted this.

So what is the rule for determining the group?
Quote:
I kindly request you to look at my input again - if A B was encountered previously, when you next see B B it needs to be assigned to A.

When I "next see B B"? I haven't seen it before.
Quote:
I wanted to only provide a simple example but I guess I made it too simple and now appear not so smart.

Lets add some more.

Input

A A
A X
C D
E F
X L
A B
O O
P P
M N
B B

Output

A A A
A X A
C D C
E F E
X L X
A B A
O O O
T X X

Why is that T X X and not T X T?
Quote:
Why is that last line B B A and not B B B?
E E E
P P P
M N M
B B A

Why is that last line B B A and not B B B?

Does this do what you want?
Code:
awk '
{ group = (x[$1]) ? x[$1] : (x[$2]) ? x[$2] : $1 }
x[$1] || x[$2] { group = (x[$1]) ? x[$1] : x[$2] }
{print $0, group }
!x[$1] { x[$1] = group}
!x[$2] { x[$2] = group }
' "$FILE"

##

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting cut to ignore cols in middle of records

I recently had to remove a number of columns from a sorted copy of a file, but couldn't get the cut command to take fields out, just what to keep. This is the only thing I could find as an example, but could it be simplified? tstamp=`date +%H%M%S` grep -v "T$" filename |egrep -v "^$" |sort... (3 Replies)
Discussion started by: wbport
3 Replies

2. Shell Programming and Scripting

Bitwise comparison of cols

Hello, I want to compute the bitwise number of matches in pairwise fashion for all columns. The problem is I have 18486955 rows and 750 columns. Please help with code, I believe this will take a lot of time, is there a way of tracking progress? Input Org1 Org2 Org3 A A T A ... (9 Replies)
Discussion started by: ritakadm
9 Replies

3. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Hi all, I have two files, chap.txt and complex.txt. chap.txt looks like this: a d l m r k complex.txt looks like this: a c d e l m n j a d l p q r c p r m ......... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

4. Shell Programming and Scripting

Join txt files with diff cols and rows

I am a new user of Unix/Linux, so this question might be a bit simple! I am trying to join two (very large) files that both have different # of cols and rows in each file. I want to keep 'all' rows and 'all' cols from both files in the joint file, and the primary key variables are in the rows.... (1 Reply)
Discussion started by: BNasir
1 Replies

5. Shell Programming and Scripting

awk -- print combinations for 2 cols

Dear all, could you please help me with awk please? I have such input: Input: a d b e c f The number of lines is unknown before reading the file. I need to print possible combination between the two columns like this: Output: a d b d c d a e b e c e a f (2 Replies)
Discussion started by: irrevocabile
2 Replies

6. Programming

Curses not updating LINES/COLS

I'm working with an extremely outdated and old system at work. We do not have ncurses, but we do have curses. I need to make a user interface for users connecting with xterm. One issue I've encountered is if the user resizes the window, I'd like to provide functionality to redraw the screen with... (4 Replies)
Discussion started by: nwboy74
4 Replies

7. Shell Programming and Scripting

sort and split file by 2 cols (1 col after the other)

Dear All, I am a newbie to shell scripting so this one is really over my head. I have a text file with five fields as below: 76576.867188 6232.454102 2.008904 55.000000 3 76576.867188 6232.454102 3.607231 55.000000 4 76576.867188 6232.454102 1.555146 65.000000 3 76576.867188 6232.454102... (19 Replies)
Discussion started by: Ghetz
19 Replies

8. Shell Programming and Scripting

How to find number of Cols in a file ?

Hi I have a requirement wherein the file is comma separated. Each records seems to have different number of columns, how I can detect like a row index wise, how many columns are present ? Thanks in advance. (2 Replies)
Discussion started by: videsh77
2 Replies

9. Shell Programming and Scripting

awk - print formatted without knowing no of cols

Hi, i want to print(f) the content of a file, but i don't know how many columns it has (i.e. it changes from each time my script is run). The number of columns is constant throughout the file. Any suggestions? (8 Replies)
Discussion started by: bistru
8 Replies

10. Shell Programming and Scripting

join cols from multi files into one file

Hi Fields in Files 1,2,3,4 are pipe"|" separated. Say I want to grep col1 from File1 col3 from File2 col4 from File3 and print to File4 in the following order: col3|col1|col4 what is the best way of doing this? Thanks (2 Replies)
Discussion started by: vbshuru
2 Replies
Login or Register to Ask a Question