Remove Duplicate by considering multiple columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove Duplicate by considering multiple columns
# 1  
Old 05-31-2012
Remove Duplicate by considering multiple columns

hi friends,

my input

Code:
chr1	exon	35204	35266	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	42357	42473	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	45261	45404	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	50701	50778	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51380	51391	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51649	51846	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51961	52077	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	52462	52695	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53305	53451	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53497	53778	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53914	54087	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	54187	54399	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	55691	55996	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	56045	56365	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	56636	56986	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	57161	57304	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	57335	57403	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	59371	59407	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	60822	60878	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	61836	61919	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	63192	63230	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	63393	63425	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	66019	66156	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	35204	35266	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	42357	42473	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	45261	45404	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	50701	50778	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	51380	51391	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	51649	51846	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	51961	52077	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	52462	52695	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	53305	53451	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	53497	53778	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	53914	54087	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	54187	54399	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	55691	55996	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	56045	56365	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	56636	56986	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	57161	57304	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	57335	57403	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	59371	59407	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	60822	60878	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	61836	61919	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	63192	63230	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	63393	63425	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	66019	66156	gene_id "GOLGB1"; transcript_id "GOLGB1_dup1";
chr1	exon	35204	35266	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	42357	42473	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	45261	45404	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	50701	50778	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	51380	51391	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	51649	51846	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	51961	52077	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	52462	52695	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	53305	53451	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	53497	53778	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	53914	54087	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	54187	54399	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	55691	55996	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	56045	56365	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	56636	56986	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	57161	57304	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	57335	57403	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	59371	59407	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	60822	60878	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	61836	61919	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	63192	63230	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	63393	63425	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";
chr1	exon	66019	66156	gene_id "GOLGB1"; transcript_id "GOLGB1_dup2";

my output

Code:
chr1	exon	35204	35266	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	42357	42473	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	45261	45404	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	50701	50778	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51380	51391	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51649	51846	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	51961	52077	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	52462	52695	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53305	53451	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53497	53778	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	53914	54087	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	54187	54399	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	55691	55996	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	56045	56365	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	56636	56986	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	57161	57304	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	57335	57403	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	59371	59407	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	60822	60878	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	61836	61919	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	63192	63230	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	63393	63425	gene_id "GOLGB1"; transcript_id "GOLGB1";
chr1	exon	66019	66156	gene_id "GOLGB1"; transcript_id "GOLGB1";

I have to compare on the first 4 columns for this file. The delimiters are both tab and space. I would really appreciate anything in awk that would make me flexible to add or delete the no. of columns to consider for matching. That would make my task easier in the future, if I have to work on other files with more no. of columns.

Thanks in advance.
# 2  
Old 05-31-2012
Code:
awk '!a[$1$2$3$4]++' filename

This User Gave Thanks to pravin27 For This Post:
# 3  
Old 05-31-2012
I tried
Quote:
awk '!a[$0]++'
.

I lost my brain to think in a more simplified fashion SmilieSmilieSmilie

Gracias my friend!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join and merge multiple files with duplicate key and fill void columns

Join and merge multiple files with duplicate key and fill void columns Hi guys, I have many files that I want to merge: file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: (5 Replies)
Discussion started by: yjacknewton
5 Replies

2. Shell Programming and Scripting

Do replace operation and awk to sum multiple columns if another column has duplicate values

Hi Experts, Please bear with me, i need help I am learning AWk and stuck up in one issue. First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique. Second point : For... (12 Replies)
Discussion started by: as7951
12 Replies

3. Shell Programming and Scripting

Remove columns with duplicate entries

I have a 13gb file. It has the following columns: The 3rd column is basically correlation values. I want to delete those rows which are repeated between the columns: A B 0.04 B C 0.56 B B 1 A A 1 C D 1 C C 1 Desired Output: (preferably in a .csv format A,B,0.04 B,C,0.56 C,D,1... (3 Replies)
Discussion started by: Sanchari
3 Replies

4. Shell Programming and Scripting

Remove Duplicates on multiple Key Columns and get the Latest Record from Date/Time Column

Hi Experts , we have a CDC file where we need to get the latest record of the Key columns Key Columns will be CDC_FLAG and SRC_PMTN_I and fetch the latest record from the CDC_PRCS_TS Can we do it with a single awk command. Please help.... (3 Replies)
Discussion started by: vijaykodukula
3 Replies

5. UNIX for Dummies Questions & Answers

remove duplicate lines based on two columns and judging from a third one

hello all, I have an input file with four columns like this with a lot of lines and for example, line 1 and line 5 match because the first 4 characters match and the fourth column matches too. I want to keep the line that has the lowest number in the third column. So I discard line 5.... (5 Replies)
Discussion started by: TheTransporter
5 Replies

6. Shell Programming and Scripting

command to remove multiple commands in particular columns

Hi Experts, I actually need to remove multiple commas within the column not the entire row. Its comma delimited file Actually the value seems to look like 1,006,000, when we open this in notepad or word pad the value look s like “1,006,000” Actually our Sed command removes single comma and... (7 Replies)
Discussion started by: bshivali
7 Replies

7. Shell Programming and Scripting

Remove duplicate columns in input file

hello, I have an input file which looks like this: 2 C:G 17 -0.14 8.75 33.35 3 G:C 16 -2.28 0.98 28.22 4 C:G 15 0.39 11.06 29.31 5 G:C 14 2.64 5.17 36.07 6 G:C 13 -0.65 2.05 21.94 7 C:G 11 138.96 21.64 14.40 9 C:G 27 -2.40 6.95 27.98 10 C:G 26 2.89 15.60 34.33 11 G:C... (7 Replies)
Discussion started by: linux_usr
7 Replies

8. Shell Programming and Scripting

Single command for add 2 columns and remove 2 columns in unix/performance tuning

Hi all, I have created a script which adding two columns and removing two columns for all files. Filename: Cust_information_1200_201010.txt Source Data: "1","Cust information","123","106001","street","1-203 high street" "1","Cust information","124","105001","street","1-203 high street" ... (0 Replies)
Discussion started by: onesuri
0 Replies

9. UNIX for Dummies Questions & Answers

Duplicate columns and lines

Hi all, I have a tab-delimited file and want to remove identical lines, i.e. all of line 1,2,4 because the columns are the same as the columns in other lines. Any input is appreciated. abc gi4597 9997 cgcgtgcg $%^&*()()* abc gi4597 9997 cgcgtgcg $%^&*()()* ttt ... (1 Reply)
Discussion started by: dr_sabz
1 Replies

10. Shell Programming and Scripting

how to identify duplicate columns in a row

Hi, How to identify duplicate columns in a row? Input data: may have 30 columns 9211480750 LK 120070417 920091030 9211480893 AZ 120070607 9205323621 O7 120090914 120090914 1420090914 2020090914 2020090914 9211479568 AZ 120070327 320090730 9211479571 MM 120070326 9211480892 MM 120070324... (3 Replies)
Discussion started by: suresh3566
3 Replies
Login or Register to Ask a Question