Merging tables: identifiying common and unique elements


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Merging tables: identifiying common and unique elements
# 1  
Old 06-17-2013
[Solved] Merging tables: identifiying common and unique elements

Hi all,

I know how to merge two tables and to remove the duplicated lines based on a field (Column 2) . My next challenge is to be able to identify in a new column those common elements between table A & B, those elements in table A not present in table B and vice versa. A simple count would be enough.

Here is a sample of my tables:

TABLE A:METHOD1

Method Chr:Start-End Gene_refgene
METHOD1 chr1:111111111-22222222 MUTYH
METHOD1 chr1:45794863-45794863 MUTYH
METHOD1 chr1:45794873-45794873 MUTYH
METHOD1 chr1:45794876-45794877 MUTYH

TABLE B:METHOD2

Method Chr:Start-End Gene_refgene
METHOD2 chr1:33333333-44444444 MUTYH
METHOD2 chr1:45794863-45794863 MUTYH
METHOD2 chr1:45794873-45794873 MUTYH
METHOD2 chr1:45794876-45794877 MUTYH


EXPECTED OUTPUT:

Method Chr:Start-End Gene_refgene Count
METHOD1 chr1:111111111-22222222 MUTYH 1
METHOD2 chr1:33333333-44444444 MUTYH 1
METHOD1 chr1:45794863-45794863 MUTYH 2
METHOD1 chr1:45794873-45794873 MUTYH 2
METODO1 chr1:45794876-45794877 MUTYH 2
# 2  
Old 06-17-2013
Q1: what partial solution do you have already? Please wrap your answer in "code" tags.
Q1a: are the two tables in two files?
Q2: are the tables "sorted", i.e. do identical lines always have the same line number?
# 3  
Old 06-17-2013
Hi MadeInGermany, thank you for your quick reply!

A1: Yes, every table is contained in a single file. I merge them two by two, based on their filename (pattern) with the following code:

Code:
for sample in `for file in *.tab; do echo ${file/_*/}; done | sort | uniq`; do
    cat $sample* \
    | cut -f1-33 \
    | sort -u -k2,2 \
    > $sample.tab
done

Explanation:
- The pattern defines which files are going to be merged
- Open files and select columns 1 to 33
- Sort rows based on column 2, removing duplicates
- Create an output file based on the pattern used in step one.

A2: No, identical lines do not have the same line number

Thank you again

Best,

lsantome
# 4  
Old 06-18-2013
Code:
awk '{s[$2]=$0; c[$2]++} END {for (i in s) print s[i],c[i]}' *.tab

By indexing with field $2, a duplicate is overwritten in array s, and further increases the count in array c. Array s simply stores the whole line - it would save some memory to leave out the field $2.
At the end it prints all elements of array s (in a random order) together with the conter in array c (in the same order). The i variable is equal to the field $2 - not printed because s[i] is already the whole line.
For demonstration, here is a variant that consumes less memory but does not print field $3:
Code:
awk '{s[$2]=$1; c[$2]++} END {for (i in s) print s[i],i,c[i]}' *.tab


Last edited by MadeInGermany; 06-18-2013 at 06:53 AM..
This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 06-19-2013
thank you MadeIn Germany, it works nicely!

But let me ask you the last question:

The table header is not on its original position. Any tip to fix it?

Thank you!
# 6  
Old 06-19-2013
A quick hack is to directly print the 1st line of each file, and proceed with the next cycle.
Code:
awk 'FNR==1 {print; next} ...

You need GNU awk or nawk or Posix awk for that.
This User Gave Thanks to MadeInGermany For This Post:
# 7  
Old 06-20-2013
Brilliant!

thank you so much!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Get unique elements from Array

I have an array code and output is below: echo $1 while read -r fline; do echo "%%%%%%$fline%%%%%" fmy_array+=("$fline") done <<< "$1" Output: CR30903 YU0007 SRIL CR30903 Yogesh SRIL %%%%%%CR30903 YU0007 SRIL%%%%% %%%%%%CR30903 Yogesh SRIL%%%%% ... (8 Replies)
Discussion started by: mohtashims
8 Replies

2. Shell Programming and Scripting

Merging two tables including multiple ocurrence of column identifiers and unique lines

I would like to merge two tables based on column 1: File 1: 1 today 1 green 2 tomorrow 3 red File 2: 1 a lot 1 sometimes 2 at work 2 at home 2 sometimes 3 new 4 a lot 5 sometimes 6 at work (4 Replies)
Discussion started by: BSP
4 Replies

3. Shell Programming and Scripting

Count common elements in a column

HI, I have a 3-column tab separated column (approx 1GB) in which I would like to count and output the frequency of all of the common elements in the 1st column. For instance: If my input was the following: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 My... (4 Replies)
Discussion started by: owwow14
4 Replies

4. Shell Programming and Scripting

Merging two files without any common pattern

Hi I have file1 as IJU_NSOMOW; SOWWOD_TWUIQ; and file2 as how are you?; fine there; Now my problem is i need the output file as IJU_NSOMOW; how are you?; SOWWOD_TWUIQ; fine there; (2 Replies)
Discussion started by: Priya Amaresh
2 Replies

5. Shell Programming and Scripting

Merging files with common IDs without JOIN

Hi, I am trying to merge information across 2 files. The first file is a "master" file, with all IDS. File 2 contains a subset of IDs of those in File 1. I would like to match up individuals in File 1 and File 2, and add information in File 2 to that of File 1 if they appear. However, if an... (3 Replies)
Discussion started by: hubleo
3 Replies

6. Shell Programming and Scripting

Creating array with non-duplicate / unique elements in ksh

Hi all, I have created 3 arrays which can have common elements in each like- arr_a contains str1 str2 str3 str4 str5 arr_b contains str3 str6 str7 str1 str8 arr_c contains str4 str9 str10 str2 each array is created with "set -A arr_name values" command. I want to create a resultant array-say... (1 Reply)
Discussion started by: sanzee007
1 Replies

7. Shell Programming and Scripting

Merging 2 files based on a common column

Hi All, I do have 2 files file 1 has 4 tab delimited columns 234 a c dfgyu 294 b g fih 302 c h jzh 328 z c san 597 f g son File 2 has 2 tab delimted columns 234 23 302 24 597 24 I want to merge file 2 with file 1 based on the data common in both files which is the first column so... (6 Replies)
Discussion started by: Lucky Ali
6 Replies

8. UNIX for Dummies Questions & Answers

Merging Tables by a column

Dear Friends, I really do not know Linux and I really would like to understand it because it does help to work with large data. I am reading this forum for 1 week to try a solution for my problem. I think that, using others post informations, I was almost there... I have 2 big tables... (4 Replies)
Discussion started by: lColli
4 Replies

9. Shell Programming and Scripting

Merging two files with a common column

Hi, I have two files file1 and file2. I have to merge the columns of those two files into file3 based on common column of two files. To be simple. file1: Row-id name1 13456 Rahul 16789 Vishal 18901 Karan file2 : Row-id place 18901 Mumbai ... (2 Replies)
Discussion started by: manneni prakash
2 Replies

10. Shell Programming and Scripting

find common elements in 2 files (for loop)

Hi, i'm new here (and to scripting too). I was hoping for some help in comparing two files. i have a file called 'file1' with a list of names in the following format: adam jones paul higgins kelly lowe i also have another file which may contain some of the names but with a lot of... (4 Replies)
Discussion started by: ibking
4 Replies
Login or Register to Ask a Question