Grouping matches by cols

09-09-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Grouping matches by cols

Dear all
I have a large file w. ~ 10 million lines.
The first two cols have matching partners.
For example:
A A
A B
B B

or

A A
B A
B B

The matches may be separated by an unknown number of lines.

My intention is to group them and add a "group" value in the last col.

For example

A A A
A B A
B B A

or

A A A
B A A
B B A

Rest assured that only one of A B and B A will be present and not both.
Any help will be highly appreciated.
A may have matches in addition to B and any number of of them. But in all cases I would like to name the group with the first partner of the first instance, i.e. A in this case.
Any help will be highly appreciated.

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-09-2008

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by gbalsu

Dear all
I have a large file w. ~ 10 million lines.
The first two cols have matching partners.
For example:
A A
A B
B B

or

A A
B A
B B

The matches may be separated by an unknown number of lines.

My intention is to group them and add a "group" value in the last col.

For example

A A A
A B A
B B A

or

A A A
B A A
B B A

How do you determine the group value? Why is the third line not B B B?

Quote:

Rest assured that only one of A B and B A will be present and not both.
Any help will be highly appreciated.
A may have matches in addition to B and any number of of them. But in all cases I would like to name the group with the first partner of the first instance, i.e. A in this case.

It would be helpful if you provided more examples from the file.

It might also help if you posted some real data in addition to the abbreviated, single-letter data.

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

09-09-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Group value is determined by the first pair to be detected by the script.

If A A was the first pair, A is the first group value.
If A B was the first pair, A is the first group value.
If B A was the first pair, B is the first group value.
If B B was the first pair, B is the first group value.

I am sorting a large gene comparison data set, to us it hardly matters who the "group" is as far as the members are highly identical as the results indicate. This is only one of several analysis steps in my project.

Here is one set of instances of my data.

NC_002662.1|:1000271-1001206 NC_002662.1|:1000271-1001206 100.00 936 0 0 1 936 1 936 0.0 1814
NC_002662.1|:1000271-1001206 NC_008527.1|:1000752-1001687 88.60 947 86 21 1 936 1 936 0.0 957
NC_008527.1|:1000752-1001687 NC_008527.1|:1000752-1001687 100.00 936 0 0 1 936 1 936 0.0 1754

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-09-2008

Registered User

1,009, 2

Join Date: May 2008

Last Activity: 28 October 2009, 7:03 PM EDT

Location: Sydney, Australia

Posts: 1,009

Thanks Given: 0

Thanked 2 Times in 2 Posts

So it seems like the "group value" is always the same as the first field? If that's the case, why do you need to add another field?

Annihilannic

View Public Profile for Annihilannic

Find all posts by Annihilannic

09-09-2008

Registered User

7, 0

Join Date: Sep 2008

Last Activity: 11 September 2008, 1:25 PM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

No, if it was the first field all the time, I would never have posted this.
I kindly request you to look at my input again - if A B was encountered previously, when you next see B B it needs to be assigned to A.

I wanted to only provide a simple example but I guess I made it too simple and now appear not so smart.

Lets add some more.

Input

A A
A X
C D
E F
X L
A B
O O
P P
M N
B B

Output

A A A
A X A
C D C
E F E
X L X
A B A
O O O
T X X
E E E
P P P
M N M
B B A

My apologies, this is literally the first time I am posting questions in a programming forum. Please help me with further queries as you deem necessary.

gbalsu

View Public Profile for gbalsu

Find all posts by gbalsu

09-09-2008

Registered User

1,009, 2

Join Date: May 2008

Last Activity: 28 October 2009, 7:03 PM EDT

Location: Sydney, Australia

Posts: 1,009

Thanks Given: 0

Thanked 2 Times in 2 Posts

Try this:

Code:

awk '
        $1 in group {
                print $0,group[$1]
                if ($2 in group) {
                        if (group[$1] != group[$2]) {
                                print $1" and "$2" are already in different groups!"
                        }
                } else {
                        group[$2]=group[$1]
                }
                next
        }
        $2 in group {
                print $0,group[$2]
                group[$1]=group[$2]
                next
        }
        {
                group[$1]=$1
                group[$2]=$1
                print $0,group[$1]
        }
' inputfile

I think you forgot to include the "T X" and "E E" lines in your example input data.

Note that the output is slightly different, e.g. T X A, not T X X because X is already in group A:

Code:

A A A
A X A
C D C
E F E
X L A
A B A
O O O
T X A
E E E
P P P
M N M
B B A

Annihilannic

View Public Profile for Annihilannic

Find all posts by Annihilannic

09-09-2008

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by gbalsu

No, if it was the first field all the time, I would never have posted this.

So what is the rule for determining the group?

Quote:

I kindly request you to look at my input again - if A B was encountered previously, when you next see B B it needs to be assigned to A.

When I "next see B B"? I haven't seen it before.

Quote:

I wanted to only provide a simple example but I guess I made it too simple and now appear not so smart.

Lets add some more.

Input

A A
A X
C D
E F
X L
A B
O O
P P
M N
B B

Output

A A A
A X A
C D C
E F E
X L X
A B A
O O O
T X X

Why is that T X X and not T X T?

Quote:

Why is that last line B B A and not B B B?
E E E
P P P
M N M
B B A

Why is that last line B B A and not B B B?

Does this do what you want?

Code:

awk '
{ group = (x[$1]) ? x[$1] : (x[$2]) ? x[$2] : $1 }
x[$1] || x[$2] { group = (x[$1]) ? x[$1] : x[$2] }
{print $0, group }
!x[$1] { x[$1] = group}
!x[$2] { x[$2] = group }
' "$FILE"

##

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

Shell Programming and Scripting

Grouping matches by cols

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Getting cut to ignore cols in middle of records

Discussion started by: wbport

2. Shell Programming and Scripting

Bitwise comparison of cols

Discussion started by: ritakadm

3. Shell Programming and Scripting

Compare 2 files and print matches and non-matches in separate files

Discussion started by: AshwaniSharma09

4. Shell Programming and Scripting

Join txt files with diff cols and rows

Discussion started by: BNasir

5. Shell Programming and Scripting

awk -- print combinations for 2 cols

Discussion started by: irrevocabile

6. Programming

Curses not updating LINES/COLS

Discussion started by: nwboy74

7. Shell Programming and Scripting

sort and split file by 2 cols (1 col after the other)

Discussion started by: Ghetz

8. Shell Programming and Scripting

How to find number of Cols in a file ?

Discussion started by: videsh77

9. Shell Programming and Scripting

awk - print formatted without knowing no of cols

Discussion started by: bistru

10. Shell Programming and Scripting

join cols from multi files into one file

Discussion started by: vbshuru