Remove duplicate columns in input file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicate columns in input file
# 1  
Old 02-22-2011
Java Remove duplicate columns in input file

hello,

I have an input file which looks like this:

Code:
2 C:G 17    -0.14 8.75 33.35
3 G:C 16   -2.28 0.98 28.22
4 C:G 15    0.39 11.06 29.31
5 G:C 14    2.64 5.17 36.07
6 G:C 13   -0.65 2.05 21.94
7 C:G 11   138.96 21.64 14.40
9 C:G 27   -2.40 6.95 27.98
10 C:G 26  2.89 15.60 34.33
11 G:C 25  113.42 -64.17 31.45
13 C:G 6    -2.64 5.17 36.07
14 C:G 5    -0.39 11.06 29.31
15 G:C 4    2.28 0.98 28.22
16 C:G 3    0.14 8.75 33.35
17 G:C 2   98.28 -81.15 -2.89
19 A:G 3    2.66 -5.60 -84.70
22 A:C 14  6.17 4.41 64.81
23 A:G 6  11.76 -18.78 -32.24
24 A:C 13  -3.54 6.70 25.06
25 C:G 11  -2.89 15.60 34.33
26 G:C 10  2.40 6.95 27.98
27 G:C 9

now in col 1 and 3 there are some numeric pairs like 2-17,3-16,4-15 etc. But there are some duplicate pairs in these columns. How can i format my file so that only unique pairs should remain without any redundant pairs??

Last edited by Franklin52; 02-23-2011 at 03:25 AM.. Reason: Please use code tags, thank you
# 2  
Old 02-22-2011
2-17 and 17-2 are considered same ?
# 3  
Old 02-22-2011
exactly....

yes..that's what i mean ...all 2-17/17-2 and such pairs are same..
# 4  
Old 02-22-2011
On idea could be gathering column 1 and 3 into a single column that contain the concatenation of the min value and max value of the pair
Code:
while read a b c d; do if (( $a > $c ));then e=$c; c=$a; a=$e ;fi ; echo "$a$c $b $d";done <infile | sort -k 1n -u

could be then sorted and filtered further (sort -u -k ... and uniq ...)

---------- Post updated at 07:49 PM ---------- Previous update was at 07:39 PM ----------

Code:
# cat tst2
2 C:G 17 -0.14 8.75 33.35
3 G:C 16 -2.28 0.98 28.22
4 C:G 15 0.39 11.06 29.31
5 G:C 14 2.64 5.17 36.07
6 G:C 13 -0.65 2.05 21.94
7 C:G 11 138.96 21.64 14.40
9 C:G 27 -2.40 6.95 27.98
10 C:G 26 2.89 15.60 34.33
11 G:C 25 113.42 -64.17 31.45
13 C:G 6 -2.64 5.17 36.07
14 C:G 5 -0.39 11.06 29.31
15 G:C 4 2.28 0.98 28.22
16 C:G 3 0.14 8.75 33.35
17 G:C 2 98.28 -81.15 -2.89
19 A:G 3 2.66 -5.60 -84.70
22 A:C 14 6.17 4.41 64.81
23 A:G 6 11.76 -18.78 -32.24
24 A:C 13 -3.54 6.70 25.06
25 C:G 11 -2.89 15.60 34.33
26 G:C 10 2.40 6.95 27.98
27 G:C 9

Code:
# while read a b c d; do if (( $a > $c )); then e=$c; c=$a; a=$e ;fi ; echo "$a$c $b $d"; done <tst2 | sort -k 1n -u
217 C:G -0.14 8.75 33.35
316 C:G 0.14 8.75 33.35
319 A:G 2.66 -5.60 -84.70
415 C:G 0.39 11.06 29.31
514 C:G -0.39 11.06 29.31
613 C:G -2.64 5.17 36.07
623 A:G 11.76 -18.78 -32.24
711 C:G 138.96 21.64 14.40
927 C:G -2.40 6.95 27.98
1026 C:G 2.89 15.60 34.33
1125 C:G -2.89 15.60 34.33
1324 A:C -3.54 6.70 25.06
1422 A:C 6.17 4.41 64.81


Last edited by ctsgnb; 02-22-2011 at 02:47 PM..
# 5  
Old 02-22-2011
do you want like this?
Code:
awk -F"[:| ]" '{a[$1$2$3$4]=$0;b[NR]=$0}{if($4$3$2$1 in a) c[NR]=1}END{for(i=1;i<=NR;i++) {if(!(i in c)) print b[i]}}' file
2 C:G 17 -0.14 8.75 33.35
3 G:C 16 -2.28 0.98 28.22
4 C:G 15 0.39 11.06 29.31
5 G:C 14 2.64 5.17 36.07
6 G:C 13 -0.65 2.05 21.94
7 C:G 11 138.96 21.64 14.40
9 C:G 27 -2.40 6.95 27.98
10 C:G 26 2.89 15.60 34.33
11 G:C 25 113.42 -64.17 31.45
19 A:G 3 2.66 -5.60 -84.70
22 A:C 14 6.17 4.41 64.81
23 A:G 6 11.76 -18.78 -32.24
24 A:C 13 -3.54 6.70 25.06

# 6  
Old 02-22-2011
Code:
awk '{s=$1+0>$3?$1$3:$3$1} s in a {next} !a[s]' file

A more readable, maintainable, ungolfed version:
Code:
awk '
{
    s = (($1+0 > $3) ? $1$3 : $3$1)
    if (s in a)
        next
    a[s] = 1
    print
}
' file

Regards,
Alister
# 7  
Old 02-22-2011
Bug thanks a lot

Thank you all for the useful inputs..

All the scripts are working perfectly except for some minor variations which i did at some places to change the output format...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove columns with duplicate entries

I have a 13gb file. It has the following columns: The 3rd column is basically correlation values. I want to delete those rows which are repeated between the columns: A B 0.04 B C 0.56 B B 1 A A 1 C D 1 C C 1 Desired Output: (preferably in a .csv format A,B,0.04 B,C,0.56 C,D,1... (3 Replies)
Discussion started by: Sanchari
3 Replies

2. Shell Programming and Scripting

Remove the duplicate content in a file

Here is the contents of test.txt Dependencies Resolved Changes in packages about to be updated: ChangeLog for: 1:perl-Archive-Extract-0.38-131.el6_4.x86_64, - Resolves: #915692 - CVE-2013-1667 (DoS in rehashing code) Dependencies Resolved Changes in packages about to be updated: ... (5 Replies)
Discussion started by: ashokvpp
5 Replies

3. Shell Programming and Scripting

How to Remove duplicate value from file?

if different branch code is available for same BIC code and one of the branch code is XXX.only one row will be stored and with branch code as XXX .rest of the rows for the BIC code will not be stored. for example if $7 is BIC code and $8 is branch code INPUT file are following... (9 Replies)
Discussion started by: mohan sharma
9 Replies

4. Shell Programming and Scripting

Remove Duplicate by considering multiple columns

hi friends, my input chr1 exon 35204 35266 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 42357 42473 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 45261 45404 gene_id "GOLGB1"; transcript_id "GOLGB1"; chr1 exon 50701 50778 gene_id "GOLGB1"; transcript_id "GOLGB1";... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

5. Shell Programming and Scripting

How do I remove the duplicate lines in this file?

Hey guys, need some help to fix this script. I am trying to remove all the duplicate lines in this file. I wrote the following script, but does not work. What is the problem? The output file should only contain five lines: Later! (5 Replies)
Discussion started by: Ernst
5 Replies

6. UNIX for Dummies Questions & Answers

remove duplicate lines based on two columns and judging from a third one

hello all, I have an input file with four columns like this with a lot of lines and for example, line 1 and line 5 match because the first 4 characters match and the fourth column matches too. I want to keep the line that has the lowest number in the third column. So I discard line 5.... (5 Replies)
Discussion started by: TheTransporter
5 Replies

7. Shell Programming and Scripting

Formatting a file - Remove Duplicate

Hi I have a file in the following format. Basically the file contains tablename and their aliases: TABLE1 TABLE1 A TABLE2 TABLE2 B TABLE3 TABLE4 TABLE4 C TABLE4 Upon formatting an sql statement I am getting such output. Problem: Whenever a tablename appears with alias, it has... (5 Replies)
Discussion started by: freakygs
5 Replies

8. Shell Programming and Scripting

Sort and Remove Duplicate on file

How do we sort and remove duplicate on column 1,2 retaining the record with maximum date (in feild 3) for the file with following format. aaa|1234|2010-12-31 aaa|1234|2010-11-10 bbb|345|2011-01-01 ccc|346|2011-02-01 bbb|345|2011-03-10 aaa|1234|2010-01-01 Required Output ... (5 Replies)
Discussion started by: mabarif16
5 Replies

9. Shell Programming and Scripting

Remove Duplicate Lines in File

I am doing KSH script to remove duplicate lines in a file. Let say the file has format below. FileA 1253-6856 3101-4011 1827-1356 1822-1157 1822-1157 1000-1410 1000-1410 1822-1231 1822-1231 3101-4011 1822-1157 1822-1231 and I want to simply it with no duplicate line as file... (5 Replies)
Discussion started by: Teh Tiack Ein
5 Replies
Login or Register to Ask a Question