Merging tables: identifiying common and unique elements

06-17-2013

Registered User

18, 0

Join Date: May 2013

Last Activity: 30 September 2015, 5:25 AM EDT

Posts: 18

Thanks Given: 8

Thanked 0 Times in 0 Posts

[Solved] Merging tables: identifiying common and unique elements

Hi all,

I know how to merge two tables and to remove the duplicated lines based on a field (Column 2) . My next challenge is to be able to identify in a new column those common elements between table A & B, those elements in table A not present in table B and vice versa. A simple count would be enough.

Here is a sample of my tables:

TABLE A:METHOD1

Method Chr:Start-End Gene_refgene
METHOD1 chr1:111111111-22222222 MUTYH
METHOD1 chr1:45794863-45794863 MUTYH
METHOD1 chr1:45794873-45794873 MUTYH
METHOD1 chr1:45794876-45794877 MUTYH

TABLE B:METHOD2

Method Chr:Start-End Gene_refgene
METHOD2 chr1:33333333-44444444 MUTYH
METHOD2 chr1:45794863-45794863 MUTYH
METHOD2 chr1:45794873-45794873 MUTYH
METHOD2 chr1:45794876-45794877 MUTYH

EXPECTED OUTPUT:

Method Chr:Start-End Gene_refgene Count
METHOD1 chr1:111111111-22222222 MUTYH 1
METHOD2 chr1:33333333-44444444 MUTYH 1
METHOD1 chr1:45794863-45794863 MUTYH 2
METHOD1 chr1:45794873-45794873 MUTYH 2
METODO1 chr1:45794876-45794877 MUTYH 2

lsantome

View Public Profile for lsantome

Find all posts by lsantome

06-17-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Q1: what partial solution do you have already? Please wrap your answer in "code" tags.
Q1a: are the two tables in two files?
Q2: are the tables "sorted", i.e. do identical lines always have the same line number?

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-17-2013

Registered User

18, 0

Join Date: May 2013

Last Activity: 30 September 2015, 5:25 AM EDT

Posts: 18

Thanks Given: 8

Thanked 0 Times in 0 Posts

Hi MadeInGermany, thank you for your quick reply!

A1: Yes, every table is contained in a single file. I merge them two by two, based on their filename (pattern) with the following code:

Code:

for sample in `for file in *.tab; do echo ${file/_*/}; done | sort | uniq`; do
    cat $sample* \
    | cut -f1-33 \
    | sort -u -k2,2 \
    > $sample.tab
done

Explanation:
- The pattern defines which files are going to be merged
- Open files and select columns 1 to 33
- Sort rows based on column 2, removing duplicates
- Create an output file based on the pattern used in step one.

A2: No, identical lines do not have the same line number

Thank you again

Best,

lsantome

lsantome

View Public Profile for lsantome

Find all posts by lsantome

06-18-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Code:

awk '{s[$2]=$0; c[$2]++} END {for (i in s) print s[i],c[i]}' *.tab

By indexing with field $2, a duplicate is overwritten in array s, and further increases the count in array c. Array s simply stores the whole line - it would save some memory to leave out the field $2.
At the end it prints all elements of array s (in a random order) together with the conter in array c (in the same order). The i variable is equal to the field $2 - not printed because s[i] is already the whole line.
For demonstration, here is a variant that consumes less memory but does not print field $3:

Code:

awk '{s[$2]=$1; c[$2]++} END {for (i in s) print s[i],i,c[i]}' *.tab

Last edited by MadeInGermany; 06-18-2013 at 06:53 AM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-19-2013

Registered User

18, 0

Join Date: May 2013

Last Activity: 30 September 2015, 5:25 AM EDT

Posts: 18

Thanks Given: 8

Thanked 0 Times in 0 Posts

thank you MadeIn Germany, it works nicely!

But let me ask you the last question:

The table header is not on its original position. Any tip to fix it?

Thank you!

lsantome

View Public Profile for lsantome

Find all posts by lsantome

06-19-2013

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

A quick hack is to directly print the 1st line of each file, and proceed with the next cycle.

Code:

awk 'FNR==1 {print; next} ...

You need GNU awk or nawk or Posix awk for that.

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

06-20-2013

Registered User

18, 0

Join Date: May 2013

Last Activity: 30 September 2015, 5:25 AM EDT

Posts: 18

Thanks Given: 8

Thanked 0 Times in 0 Posts

Brilliant!

thank you so much!

lsantome

View Public Profile for lsantome

Find all posts by lsantome

UNIX for Dummies Questions & Answers

Merging tables: identifiying common and unique elements

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Get unique elements from Array

Discussion started by: mohtashims

2. Shell Programming and Scripting

Merging two tables including multiple ocurrence of column identifiers and unique lines

Discussion started by: BSP

3. Shell Programming and Scripting

Count common elements in a column

Discussion started by: owwow14

4. Shell Programming and Scripting

Merging two files without any common pattern

Discussion started by: Priya Amaresh

5. Shell Programming and Scripting

Merging files with common IDs without JOIN

Discussion started by: hubleo

6. Shell Programming and Scripting

Creating array with non-duplicate / unique elements in ksh

Discussion started by: sanzee007

7. Shell Programming and Scripting

Merging 2 files based on a common column

Discussion started by: Lucky Ali

8. UNIX for Dummies Questions & Answers

Merging Tables by a column

Discussion started by: lColli

9. Shell Programming and Scripting

Merging two files with a common column

Discussion started by: manneni prakash

10. Shell Programming and Scripting

find common elements in 2 files (for loop)

Discussion started by: ibking