Remove duplicate columns in input file

02-22-2011

Registered User

3, 0

Join Date: Feb 2011

Last Activity: 22 February 2011, 10:49 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Remove duplicate columns in input file

hello,

I have an input file which looks like this:

Code:

2 C:G 17    -0.14 8.75 33.35
3 G:C 16   -2.28 0.98 28.22
4 C:G 15    0.39 11.06 29.31
5 G:C 14    2.64 5.17 36.07
6 G:C 13   -0.65 2.05 21.94
7 C:G 11   138.96 21.64 14.40
9 C:G 27   -2.40 6.95 27.98
10 C:G 26  2.89 15.60 34.33
11 G:C 25  113.42 -64.17 31.45
13 C:G 6    -2.64 5.17 36.07
14 C:G 5    -0.39 11.06 29.31
15 G:C 4    2.28 0.98 28.22
16 C:G 3    0.14 8.75 33.35
17 G:C 2   98.28 -81.15 -2.89
19 A:G 3    2.66 -5.60 -84.70
22 A:C 14  6.17 4.41 64.81
23 A:G 6  11.76 -18.78 -32.24
24 A:C 13  -3.54 6.70 25.06
25 C:G 11  -2.89 15.60 34.33
26 G:C 10  2.40 6.95 27.98
27 G:C 9

now in col 1 and 3 there are some numeric pairs like 2-17,3-16,4-15 etc. But there are some duplicate pairs in these columns. How can i format my file so that only unique pairs should remain without any redundant pairs??

Last edited by Franklin52; 02-23-2011 at 03:25 AM.. Reason: Please use code tags, thank you

linux_usr

View Public Profile for linux_usr

Find all posts by linux_usr

02-22-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

2-17 and 17-2 are considered same ?

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

02-22-2011

Registered User

3, 0

Join Date: Feb 2011

Last Activity: 22 February 2011, 10:49 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

exactly....

yes..that's what i mean ...all 2-17/17-2 and such pairs are same..

linux_usr

View Public Profile for linux_usr

Find all posts by linux_usr

02-22-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

On idea could be gathering column 1 and 3 into a single column that contain the concatenation of the min value and max value of the pair

Code:

while read a b c d; do if (( $a > $c ));then e=$c; c=$a; a=$e ;fi ; echo "$a$c $b $d";done <infile | sort -k 1n -u

could be then sorted and filtered further (sort -u -k ... and uniq ...)

---------- Post updated at 07:49 PM ---------- Previous update was at 07:39 PM ----------

Code:

# cat tst2
2 C:G 17 -0.14 8.75 33.35
3 G:C 16 -2.28 0.98 28.22
4 C:G 15 0.39 11.06 29.31
5 G:C 14 2.64 5.17 36.07
6 G:C 13 -0.65 2.05 21.94
7 C:G 11 138.96 21.64 14.40
9 C:G 27 -2.40 6.95 27.98
10 C:G 26 2.89 15.60 34.33
11 G:C 25 113.42 -64.17 31.45
13 C:G 6 -2.64 5.17 36.07
14 C:G 5 -0.39 11.06 29.31
15 G:C 4 2.28 0.98 28.22
16 C:G 3 0.14 8.75 33.35
17 G:C 2 98.28 -81.15 -2.89
19 A:G 3 2.66 -5.60 -84.70
22 A:C 14 6.17 4.41 64.81
23 A:G 6 11.76 -18.78 -32.24
24 A:C 13 -3.54 6.70 25.06
25 C:G 11 -2.89 15.60 34.33
26 G:C 10 2.40 6.95 27.98
27 G:C 9

Code:

# while read a b c d; do if (( $a > $c )); then e=$c; c=$a; a=$e ;fi ; echo "$a$c $b $d"; done <tst2 | sort -k 1n -u
217 C:G -0.14 8.75 33.35
316 C:G 0.14 8.75 33.35
319 A:G 2.66 -5.60 -84.70
415 C:G 0.39 11.06 29.31
514 C:G -0.39 11.06 29.31
613 C:G -2.64 5.17 36.07
623 A:G 11.76 -18.78 -32.24
711 C:G 138.96 21.64 14.40
927 C:G -2.40 6.95 27.98
1026 C:G 2.89 15.60 34.33
1125 C:G -2.89 15.60 34.33
1324 A:C -3.54 6.70 25.06
1422 A:C 6.17 4.41 64.81

Last edited by ctsgnb; 02-22-2011 at 02:47 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

02-22-2011

Registered User

436, 107

Join Date: Feb 2011

Last Activity: 24 March 2015, 6:12 AM EDT

Posts: 436

Thanks Given: 9

Thanked 107 Times in 106 Posts

do you want like this?

Code:

awk -F"[:| ]" '{a[$1$2$3$4]=$0;b[NR]=$0}{if($4$3$2$1 in a) c[NR]=1}END{for(i=1;i<=NR;i++) {if(!(i in c)) print b[i]}}' file
2 C:G 17 -0.14 8.75 33.35
3 G:C 16 -2.28 0.98 28.22
4 C:G 15 0.39 11.06 29.31
5 G:C 14 2.64 5.17 36.07
6 G:C 13 -0.65 2.05 21.94
7 C:G 11 138.96 21.64 14.40
9 C:G 27 -2.40 6.95 27.98
10 C:G 26 2.89 15.60 34.33
11 G:C 25 113.42 -64.17 31.45
19 A:G 3 2.66 -5.60 -84.70
22 A:C 14 6.17 4.41 64.81
23 A:G 6 11.76 -18.78 -32.24
24 A:C 13 -3.54 6.70 25.06

yinyuemi

View Public Profile for yinyuemi

Find all posts by yinyuemi

02-22-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Code:

awk '{s=$1+0>$3?$1$3:$3$1} s in a {next} !a[s]' file

A more readable, maintainable, ungolfed version:

Code:

awk '
{
    s = (($1+0 > $3) ? $1$3 : $3$1)
    if (s in a)
        next
    a[s] = 1
    print
}
' file

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

02-22-2011

Registered User

3, 0

Join Date: Feb 2011

Last Activity: 22 February 2011, 10:49 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

thanks a lot

Thank you all for the useful inputs..

All the scripts are working perfectly except for some minor variations which i did at some places to change the output format...

linux_usr

View Public Profile for linux_usr

Find all posts by linux_usr

Shell Programming and Scripting

Remove duplicate columns in input file

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove columns with duplicate entries

Discussion started by: Sanchari

2. Shell Programming and Scripting

Remove the duplicate content in a file

Discussion started by: ashokvpp

3. Shell Programming and Scripting

How to Remove duplicate value from file?

Discussion started by: mohan sharma

4. Shell Programming and Scripting

Remove Duplicate by considering multiple columns

Discussion started by: jacobs.smith

5. Shell Programming and Scripting

How do I remove the duplicate lines in this file?

Discussion started by: Ernst

6. UNIX for Dummies Questions & Answers

remove duplicate lines based on two columns and judging from a third one

Discussion started by: TheTransporter

7. Shell Programming and Scripting

Formatting a file - Remove Duplicate

Discussion started by: freakygs

8. Shell Programming and Scripting

Sort and Remove Duplicate on file

Discussion started by: mabarif16

9. Shell Programming and Scripting

Remove Duplicate Lines in File

Discussion started by: Teh Tiack Ein