Filter file to remove duplicate values in first column

07-23-2016

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Filter file to remove duplicate values in first column

Hello,

I have a script that is generating a tab delimited output file.

Code:

num     Name            PCA_A1     PCA_A2       PCA_A3
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
0       compound_00     -3.5054     -1.1207     -2.4372
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732

I need to trim this down so that there a no duplicates in the first column. Actually, the entire row would be a duplicate, but I don't see any reason to look at anything other than the index value. There is no particular rational to the order and there could be any number of duplicates of a given row.

The final results should look like this,

Code:

num     Name            PCA_A1     PCA_A2       PCA_A3
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

I need one, and only one, instance of each index value ("num" column value) in the file, not just the lines with num values that appear only once. There always seems to be some confusion about that with discussions of "unique" lines.

The only thing I could think of was to sort the rows on the num column value and then loop through checking if the num value was equal to the previous line. If it is not equal, copy it to a new array, etc.

Any suggestions? There always seems to be some simple one line solution that I don't know about.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

07-23-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello LMHmedchem,

Could you please try following and let me know if this helps you.

Code:

awk 'NR==1{print;next} {A[$1]=$0;C=C<$1?$1:C} END{;for(i=0;i<=C;i++){if(A[i]){print A[i]}}}'   Input_file

Output will be as follows.

Code:

num     Name            PCA_A1     PCA_A2       PCA_A3
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

Thanks,
R. Singh

Last edited by RavinderSingh13; 07-24-2016 at 09:01 AM.. Reason: Removed color from first line of solution.

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

07-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

It is always worthwhile to comb through these fora for similar problems and their solutions; five examples are usually offered at the bottom of this page (of which at least three solve your problem), and more may be available, helping you to help yourself.

Anyway, try

Code:

awk '!T[$1]++' file
num     Name            PCA_A1     PCA_A2       PCA_A3
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-23-2016

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Seems like sort with unique option works for me:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate remove all identical lines, sort.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C sort pass-fail

FILE=${1-data1}

pl " Input data file $FILE:"
head $FILE

pl " Expected output:"
head expected-output.txt

pl " Results:"
sort -u -k1,1 $FILE |
tee f1

pass-fail f1 expected-output.txt

exit 0

producing

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
sort (GNU coreutils) 8.23
pass-fail - ( local: RepRev 1.6, ~/bin/pass-fail, 2016-07-23 )

-----
 Input data file data1:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
0       compound_00     -3.5054     -1.1207     -2.4372
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114

-----
 Expected output:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

-----
 Results:
0       compound_00     -3.5054     -1.1207     -2.4372
1       compound_01     -2.2641     0.4287      -1.6120
3       compound_03     -1.3053     1.8495      -1.0224
4       compound_04     -1.1845     -0.3377     -2.9453
7       compound_07     -0.2988     1.3539      -1.6114
8       compound_08     2.6872     -1.3726      -5.9732
9       compound_09     -1.4546     -0.8284     -3.5016

-----
 Comparison of 7 created lines with 7 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.

The pass-fail code is basically just a wrapper around cmp for some extra checking and reporting.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

Filter file to remove duplicate values in first column

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Discussion started by: as7951

2. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

Discussion started by: as7951

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

4. Shell Programming and Scripting

Remove duplicate values in a column(not in the file)

Discussion started by: ratheeshjulk

5. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Discussion started by: deadyetagain

6. Linux

Filter a .CSV file based on the 5th column values

Discussion started by: dhruuv369

7. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Discussion started by: avikaljain

8. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Discussion started by: pathunkathunk

9. Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

Discussion started by: mukeshguliao