finding duplicates in columns and removing lines

04-25-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 20 February 2009, 2:11 AM EST

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by ilan

Hi Totus,

from aigles solution.... delimitter is ,
so, if you have tabs/spaces...i think you can use it as
awk -F " " '!mail[$4]++' inputfile

(logic is you have to specify the column correctly; i hope you noticed that i am using $4)

-ilan

Thanks ilan, I think I got it. In order to use tabs in awk it's awk -F"\t"

Thanks everyone for your help, it was greatly appreciated.

totus

View Public Profile for totus

Find all posts by totus

04-29-2008

Registered User

1,305, 26

Join Date: Jun 2007

Last Activity: 11 November 2016, 3:44 AM EST

Location: Beijing China

Posts: 1,305

Thanks Given: 0

Thanked 26 Times in 26 Posts

Code:

awk 'BEGIN{FS="\""}
{
a[$5]++
if (a[$5]<=1)
	print
}' file

summer_cherry

View Public Profile for summer_cherry

Find all posts by summer_cherry

04-29-2008

Registered User

7, 0

Join Date: Oct 2007

Last Activity: 16 May 2015, 10:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

I do have an idea to resolve this issue. Taking the uniq values of the third column with the help of awk, sort and uniq. Grep the values in the original file using for loop and use head -1 to get only first entry values out of duplicate entires

Here, I took the entries in the file named duplicate.txt

$ cat duplicate.txt
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

I followed the below procedure at command prompt....

$ for id in ` cat duplicate.txt | awk -F, {'print $3'} | sort | uniq`
do
grep ${id} duplicate.txt | head -1
done

Output I got as follows:

3 kims office","555-555-5555","kims@mail.com","www.ralph.com
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com
$

Hope, this works...

Thanks,
Aketi.

Mr.Aketi

View Public Profile for Mr.Aketi

Find all posts by Mr.Aketi

04-29-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 29 October 2008, 2:19 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

I have a similar, but not identical problem.

I have data like this:
It's sorted by the 2nd field (TID).
envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,23:14:25,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,04:23:39,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:41:58,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:42:44,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:49:43,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:50:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:53:23,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:38:40,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:52:22,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,10:27:13,RB00083,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,11:36:27,RB00084,0009,ENVOY,ERROR,26
envoy,90000000000001034800010001,04/01/2008,23:59:15,RB00294,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/02/2008,23:59:12,RB00295,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/03/2008,23:59:11,RB00296,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/04/2008,23:59:08,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/05/2008,23:59:04,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/06/2008,22:59:06,RB00297,0030,ENVOY,ERROR,57

I want to do the following:
Check the second field to see if the TID is the same as the previous line. If the TID has been seen before then check the 7th field to see if that is the same as the previous line. If both are the same, I want to remove the line and increment a counter.

My ideal output would look something like this.
11,envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
3,envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26

etc.

I figure I actually need to do an awk script rather than a 1 liner. The other option is for the last 3 fields to be one field and compare by the TID field and the error field, then split them into 3 on the output.

Any thoughts? I've looked at other stuff removing dups with awk and it's mostly one liners.

I'd love to ask get some explanation of WHY it works so that I can mod it if need be.

kinksville

View Public Profile for kinksville

Find all posts by kinksville

04-29-2008

Registered User

7,747, 559

Join Date: Feb 2007

Last Activity: 20 April 2020, 11:28 AM EDT

Location: The Netherlands

Posts: 7,747

Thanks Given: 139

Thanked 559 Times in 520 Posts

kinksville,

Please don't hijack another ones thread, but start a new thread for your problem.

Thanks.

Franklin52

View Public Profile for Franklin52

Find all posts by Franklin52

04-29-2008

Registered User

7, 0

Join Date: Apr 2008

Last Activity: 29 October 2008, 2:19 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Sorry about that, I'm happy to start a new thread, I hadn't wanted to post something that was already being answered.

kinksville

View Public Profile for kinksville

Find all posts by kinksville

05-16-2008

Registered User

112, 0

Join Date: May 2008

Last Activity: 2 October 2016, 7:44 AM EDT

Location: Jordan

Posts: 112

Thanks Given: 6

Thanked 0 Times in 0 Posts

Help

Hi Guys...

Please Could you help me with the following ?

aaaa bbbb cccc sdsd
aaaa bbbb cccc qwer

as you can see, the 2 lines are matched in three fields...
how can I delete this dupicate ? I mean to delete the second line if 3 fields were matched?

Thanks

yahyaaa

View Public Profile for yahyaaa

Find all posts by yahyaaa

Shell Programming and Scripting

finding duplicates in columns and removing lines

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing carriage returns from multiple lines in multiple files of different number of columns

Discussion started by: dJHa

2. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

3. Shell Programming and Scripting

UNIX scripting for finding duplicates and null records in pk columns

Discussion started by: praveenraj.1991

4. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Discussion started by: saj

5. Shell Programming and Scripting

Help in removing duplicates

Discussion started by: rkrish

6. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

7. Shell Programming and Scripting

Removing duplicates from string (not duplicate lines)

Discussion started by: vickylife

8. Shell Programming and Scripting

Finding duplicates from positioned substring across lines

Discussion started by: gapprasath

9. Shell Programming and Scripting

Help removing lines with duplicated columns

Discussion started by: yahyaaa

10. UNIX for Dummies Questions & Answers

Removing lines that are (same in content) based on columns

Discussion started by: adsforall