finding duplicates in columns and removing lines

04-24-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 20 February 2009, 2:11 AM EST

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

finding duplicates in columns and removing lines

I am trying to figure out how to scan a file like so:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

and end up with this:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

specifically, I'm needing to look for duplicates in column 3 in csv file, if a duplicate is found, remove "lines" based on duplicates found in column 3. In the instance above line two is removed or filtered.

Does anyone know if the unix uniq command can be utilized or perl? uniq doesn't seen to have a delimiter flag to use only character count or bit.

Thanks!
Totus

Last edited by totus; 04-24-2008 at 05:31 PM..

totus

View Public Profile for totus

Find all posts by totus

04-24-2008

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Code:

awk -F, '! mail[$3]++' inputfile

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

04-24-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 20 February 2009, 2:11 AM EST

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

your kidding me...

how does that work? I'm vaguely familiar with awk.

totus

View Public Profile for totus

Find all posts by totus

04-24-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

awk has associative arrays - the key for the mail array is field #3 ($3).
The first time $3 shows up the value of mail[$3] is zero, mail[$3]++ increments that array element to one. The next time $3 is found to have a value of 1. It does not print.

!mail[$3] only evaluates true when mail[$3] == 0, so when it is 1, 2 ,3 ... it evaluates as false.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-24-2008

Registered User

621, 177

Join Date: Oct 2007

Last Activity: 1 November 2018, 10:22 PM EDT

Location: East Coast

Posts: 621

Thanks Given: 1

Thanked 177 Times in 163 Posts

With the 'uniq' command:

uniq -1 [inputfile]

Hope this helps.

in2nix4life

View Public Profile for in2nix4life

Find all posts by in2nix4life

04-24-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 20 February 2009, 2:11 AM EST

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by aigles

Code:

awk -F, '! mail[$3]++' inputfile

Jean-Pierre.

Jean-Pierre,

This seemed to work but I noticed that there seem to be a few duplicated left behind. How does the array know what the delimiter? $3 is the field, but not clear on delimiter. Would the same work with tabs for delimiter?

Cheers!

totus

View Public Profile for totus

Find all posts by totus

04-24-2008

Registered User

110, 2

Join Date: Jul 2007

Last Activity: 28 December 2015, 1:11 PM EST

Posts: 110

Thanks Given: 0

Thanked 2 Times in 2 Posts

Hi Totus,

from aigles solution.... delimitter is ,
so, if you have tabs/spaces...i think you can use it as
awk -F " " '!mail[$4]++' inputfile

(logic is you have to specify the column correctly; i hope you noticed that i am using $4)

-ilan

ilan

View Public Profile for ilan

Find all posts by ilan

Shell Programming and Scripting

finding duplicates in columns and removing lines

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing carriage returns from multiple lines in multiple files of different number of columns

Discussion started by: dJHa

2. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

3. Shell Programming and Scripting

UNIX scripting for finding duplicates and null records in pk columns

Discussion started by: praveenraj.1991

4. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Discussion started by: saj

5. Shell Programming and Scripting

Help in removing duplicates

Discussion started by: rkrish

6. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

7. Shell Programming and Scripting

Removing duplicates from string (not duplicate lines)

Discussion started by: vickylife

8. Shell Programming and Scripting

Finding duplicates from positioned substring across lines

Discussion started by: gapprasath

9. Shell Programming and Scripting

Help removing lines with duplicated columns

Discussion started by: yahyaaa

10. UNIX for Dummies Questions & Answers

Removing lines that are (same in content) based on columns

Discussion started by: adsforall