finding duplicates in columns and removing lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting finding duplicates in columns and removing lines
# 8  
Old 04-25-2008
Quote:
Originally Posted by ilan
Hi Totus,

from aigles solution.... delimitter is ,
so, if you have tabs/spaces...i think you can use it as
awk -F " " '!mail[$4]++' inputfile

(logic is you have to specify the column correctly; i hope you noticed that i am using $4)

-ilan
Thanks ilan, I think I got it. In order to use tabs in awk it's awk -F"\t"

Thanks everyone for your help, it was greatly appreciated.Smilie
# 9  
Old 04-29-2008
Code:
awk 'BEGIN{FS="\""}
{
a[$5]++
if (a[$5]<=1)
	print
}' file

# 10  
Old 04-29-2008
Hi,

I do have an idea to resolve this issue. Taking the uniq values of the third column with the help of awk, sort and uniq. Grep the values in the original file using for loop and use head -1 to get only first entry values out of duplicate entires


Here, I took the entries in the file named duplicate.txt

$ cat duplicate.txt
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

I followed the below procedure at command prompt....

$ for id in ` cat duplicate.txt | awk -F, {'print $3'} | sort | uniq`
do
grep ${id} duplicate.txt | head -1
done

Output I got as follows:

3 kims office","555-555-5555","kims@mail.com","www.ralph.com
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com
$


Hope, this works...


Thanks,
Aketi.
# 11  
Old 04-29-2008
Error I have a similar, but not identical problem.

I have data like this:
It's sorted by the 2nd field (TID).
envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,23:14:25,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,04:23:39,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:41:58,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:42:44,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:49:43,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:50:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:53:23,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:38:40,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:52:22,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,10:27:13,RB00083,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,11:36:27,RB00084,0009,ENVOY,ERROR,26
envoy,90000000000001034800010001,04/01/2008,23:59:15,RB00294,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/02/2008,23:59:12,RB00295,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/03/2008,23:59:11,RB00296,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/04/2008,23:59:08,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/05/2008,23:59:04,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/06/2008,22:59:06,RB00297,0030,ENVOY,ERROR,57

I want to do the following:
Check the second field to see if the TID is the same as the previous line. If the TID has been seen before then check the 7th field to see if that is the same as the previous line. If both are the same, I want to remove the line and increment a counter.

My ideal output would look something like this.
11,envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
3,envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26

etc.

I figure I actually need to do an awk script rather than a 1 liner. The other option is for the last 3 fields to be one field and compare by the TID field and the error field, then split them into 3 on the output.

Any thoughts? I've looked at other stuff removing dups with awk and it's mostly one liners.

I'd love to ask get some explanation of WHY it works so that I can mod it if need be.
# 12  
Old 04-29-2008
kinksville,

Please don't hijack another ones thread, but start a new thread for your problem.

Thanks.
# 13  
Old 04-29-2008
Sorry about that, I'm happy to start a new thread, I hadn't wanted to post something that was already being answered.
# 14  
Old 05-16-2008
Help

Hi Guys...

Please Could you help me with the following ?

aaaa bbbb cccc sdsd
aaaa bbbb cccc qwer

as you can see, the 2 lines are matched in three fields...
how can I delete this dupicate ? I mean to delete the second line if 3 fields were matched?

Thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing carriage returns from multiple lines in multiple files of different number of columns

Hello Gurus, I have a multiple pipe separated files which have records going over multiple Lines. End of line separator is \n and records going over multiple lines have <CR> as separator. below is example from one file. 1|ABC DEF|100|10 2|PQ RS T|200|20 3| UVWXYZ|300|30 4| GHIJKL|400|40... (7 Replies)
Discussion started by: dJHa
7 Replies

2. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker... Column #1 is a simple ID, which is used to identify the duplicate. Once dups are identified, I need to only keep the one... (2 Replies)
Discussion started by: kevinprood
2 Replies

3. Shell Programming and Scripting

UNIX scripting for finding duplicates and null records in pk columns

Hi, I have a requirement.for eg: i have a text file with pipe symbol as delimiter(|) with 4 columns a,b,c,d. Here a and b are primary key columns.. i want to process that file to find the duplicates and null values are in primary key columns(a,b) . I want to write the unique records in which... (5 Replies)
Discussion started by: praveenraj.1991
5 Replies

4. Shell Programming and Scripting

Removing duplicates in fixed width file which has multiple key columns

Hi All , I have a requirement where I need to remove duplicates from a fixed width file which has multiple key columns .Also , need to capture the duplicate records into another file . File has 8 columns. Key columns are col1 and col2. Col1 has the length of 8 col 2 has the length of 3. ... (5 Replies)
Discussion started by: saj
5 Replies

5. Shell Programming and Scripting

Help in removing duplicates

I have an input file abc.txt with info like: abcd rateuse inklite robet rateuse abcd I need to remove duplicates from the file (eg: abcd,rateuse) from the file and need to place the contents in same file abc.txt if needed can be placed in another file. can anyone help me in this :( (4 Replies)
Discussion started by: rkrish
4 Replies

6. Shell Programming and Scripting

finding duplicates in csv based on key columns

Hi team, I have 20 columns csv files. i want to find the duplicates in that file based on the column1 column10 column4 column6 coulnn8 coulunm2 . if those columns have same values . then it should be a duplicate record. can one help me on finding the duplicates, Thanks in advance. ... (2 Replies)
Discussion started by: baskivs
2 Replies

7. Shell Programming and Scripting

Removing duplicates from string (not duplicate lines)

please help me in getting following: Input Desired output x="foo" foo x="foo foo" foo x="foo foo" foo x="foo abc foo" foo abc x="foo foo1 foo2" foo foo1 foo2 I need to remove duplicated from string.. (8 Replies)
Discussion started by: vickylife
8 Replies

8. Shell Programming and Scripting

Finding duplicates from positioned substring across lines

I have million's of records each containing exactly 50 characters and have to check the uniqueness of 4 character substring of 50 character (postion known prior) and report if any duplicates are found. Eg. data... AAAA00000000000000XXXX0000 0000000000... upto50 chars... (2 Replies)
Discussion started by: gapprasath
2 Replies

9. Shell Programming and Scripting

Help removing lines with duplicated columns

Hi Guys... Please Could you help me with the following ? aaaa bbbb cccc sdsd aaaa bbbb cccc qwer as you can see, the 2 lines are matched in three fields... how can I delete this pupicate ? I mean to delete the second one if 3 fields were duplicated ? Thanks (14 Replies)
Discussion started by: yahyaaa
14 Replies

10. UNIX for Dummies Questions & Answers

Removing lines that are (same in content) based on columns

I have a file which looks like AA BB CC DD EE FF GG HH KK AA BB GG HH KK FF CC DD EE AA BB CC DD EE UU VV XX ZZ AA BB VV XX ZZ UU CC DD EE .... I want the script to give me only one line based on duplicate contents: AA BB CC DD EE FF GG HH KK AA BB CC DD EE UU VV XX ZZ (7 Replies)
Discussion started by: adsforall
7 Replies
Login or Register to Ask a Question