Filter/remove duplicate .dat file with certain criteria

03-14-2011

Registered User

18, 1

Join Date: Mar 2011

Last Activity: 10 August 2011, 3:02 AM EDT

Posts: 18

Thanks Given: 5

Thanked 1 Time in 1 Post

Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.

contents of file looks like

Code:

 
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40343369,OTC,mart_rec,99, ,0
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0

only first one is valid, though( complete line is may or may not be duplicated)

simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0

I need to make a script which creates a file( from manipluating the input file) as

Code:

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.

I would be greatful if any of you please advice me, how can I do it.

I hope I have explained the problem clearly.

mukeshguliao

View Public Profile for mukeshguliao

Find all posts by mukeshguliao

03-14-2011

Registered User

236, 55

Join Date: Mar 2011

Last Activity: 8 March 2012, 11:08 AM EST

Posts: 236

Thanks Given: 1

Thanked 55 Times in 54 Posts

Code:

awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} yourFile

tested with your example data here, it returned:

Code:

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

This User Gave Thanks to sk1418 For This Post:

sk1418

View Public Profile for sk1418

Find all posts by sk1418

03-14-2011

Registered User

18, 1

Join Date: Mar 2011

Last Activity: 10 August 2011, 3:02 AM EDT

Posts: 18

Thanks Given: 5

Thanked 1 Time in 1 Post

thanks for the quick response.

i am trying this in /usr/bin/csh shell. it gives an error
magu1@gmmagappu1% awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} UnixEg.dat
unmatched '

do i have to run it in some other shell?

mukeshguliao

View Public Profile for mukeshguliao

Find all posts by mukeshguliao

03-14-2011

Registered User

236, 55

Join Date: Mar 2011

Last Activity: 8 March 2012, 11:08 AM EST

Posts: 236

Thanks Given: 1

Thanked 55 Times in 54 Posts

hi, i checked my post. it was my fault. one ' was missing. no idea how this happened, i did a copy&paste... sorry.
try the following line:

Code:

awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' yourFile

sk1418

View Public Profile for sk1418

Find all posts by sk1418

03-14-2011

Registered User

18, 1

Join Date: Mar 2011

Last Activity: 10 August 2011, 3:02 AM EDT

Posts: 18

Thanks Given: 5

Thanked 1 Time in 1 Post

it seems fine to me, but its still anit working

trying this (csh shell)
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat

getting
awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

mukeshguliao

View Public Profile for mukeshguliao

Find all posts by mukeshguliao

03-14-2011

Registered User

236, 55

Join Date: Mar 2011

Last Activity: 8 March 2012, 11:08 AM EST

Posts: 236

Thanks Given: 1

Thanked 55 Times in 54 Posts

it works here:

Code:

kent$ echo "30002157,40342424,OTC,mart_rec,100, ,0 
dquote> 30002157,40343369,OTC,mart_rec,95, ,0
dquote> 30002157,40342424,OTC,mart_rec,98, ,0
dquote> 30002157,40343369,OTC,mart_rec,99, ,0
dquote> 30002157,40342424,OTC,mart_rec,100, ,0
dquote> 30002157,40343369,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0" | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

tried in zsh and bash. both worked. I don't have csh installed.
which awk do you have? gawk?

sk1418

View Public Profile for sk1418

Find all posts by sk1418

03-15-2011

Registered User

18, 1

Join Date: Mar 2011

Last Activity: 10 August 2011, 3:02 AM EDT

Posts: 18

Thanks Given: 5

Thanked 1 Time in 1 Post

hi i tried these two options in csh and bash

Code:

 
cat UnixEg.dat | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'
 
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat

in both I am getting an

Code:

awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

:banghead

---------- Post updated at 11:22 PM ---------- Previous update was at 09:47 PM ----------

found the solution

https://www.unix.com/shell-programmin...laination.html

mukeshguliao

View Public Profile for mukeshguliao

Find all posts by mukeshguliao

Shell Programming and Scripting

Filter/remove duplicate .dat file with certain criteria

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

Discussion started by: as7951

2. Shell Programming and Scripting

Awk/sed/cut to filter out records from a file based on criteria

Discussion started by: MIA651

3. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Discussion started by: LMHmedchem

4. Shell Programming and Scripting

Filter datablocks meeting criteria

Discussion started by: sheetalk

5. Shell Programming and Scripting

Help with filter result that fulfill criteria

Discussion started by: perl_beginner

6. Shell Programming and Scripting

Remove <CR><LF> from the dat file in unix

Discussion started by: mr_harish80

7. Shell Programming and Scripting

Filter or remove duplicate block of text without distinguishing marks or fields

Discussion started by: samask

8. Shell Programming and Scripting

Remove interspersed headers in .dat file with AWK

Discussion started by: gd9629

9. Shell Programming and Scripting

How to remove duplicates from the .dat file

Discussion started by: Oracle_User

10. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Discussion started by: joggdial3000