Filter/remove duplicate .dat file with certain criteria


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filter/remove duplicate .dat file with certain criteria
# 1  
Old 03-14-2011
Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.

contents of file looks like
Code:
 
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40343369,OTC,mart_rec,99, ,0
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0

only first one is valid, though( complete line is may or may not be duplicated)

simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0

I need to make a script which creates a file( from manipluating the input file) as

Code:
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.

I would be greatful if any of you please advice me, how can I do it.

I hope I have explained the problem clearly.
# 2  
Old 03-14-2011
Code:
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} yourFile

tested with your example data here, it returned:
Code:
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

This User Gave Thanks to sk1418 For This Post:
# 3  
Old 03-14-2011
thanks for the quick response.

i am trying this in /usr/bin/csh shell. it gives an error
magu1@gmmagappu1% awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} UnixEg.dat
unmatched '

do i have to run it in some other shell?
# 4  
Old 03-14-2011
hi, i checked my post. it was my fault. one ' was missing. no idea how this happened, i did a copy&paste... sorry.
try the following line:
Code:
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' yourFile

# 5  
Old 03-14-2011
it seems fine to me, but its still anit working

trying this (csh shell)
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat


getting
awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

Smilie Smilie
# 6  
Old 03-14-2011
it works here:
Code:
kent$ echo "30002157,40342424,OTC,mart_rec,100, ,0 
dquote> 30002157,40343369,OTC,mart_rec,95, ,0
dquote> 30002157,40342424,OTC,mart_rec,98, ,0
dquote> 30002157,40343369,OTC,mart_rec,99, ,0
dquote> 30002157,40342424,OTC,mart_rec,100, ,0
dquote> 30002157,40343369,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0" | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

tried in zsh and bash. both worked. I don't have csh installed.
which awk do you have? gawk?
# 7  
Old 03-15-2011
hi i tried these two options in csh and bash

Code:
 
cat UnixEg.dat | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'
 
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat

in both I am getting an

Code:
awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

:banghead

---------- Post updated at 11:22 PM ---------- Previous update was at 09:47 PM ----------

found the solution

https://www.unix.com/shell-programmin...laination.html
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

I have csv file with 30, 40 columns Pasting just three column for problem description I want to filter record if column 1 matches CN or DN then, check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345 and if column 2 contains 6789, 6789... (5 Replies)
Discussion started by: as7951
5 Replies

2. Shell Programming and Scripting

Awk/sed/cut to filter out records from a file based on criteria

I have two files and would need to filter out records based on certain criteria, these column are of variable lengths, but the lengths are uniform throughout all the records of the file. I have shown a sample of three records below. Line 1-9 is the item number "0227546_1" in the case of the first... (15 Replies)
Discussion started by: MIA651
15 Replies

3. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

4. Shell Programming and Scripting

Filter datablocks meeting criteria

Hello, I am trying to extract valid data blocks from invalid ones. In the input the data blocks are separated by one or more blank rows. The criteria are 1) second column value must be 30 or more for the row to be valid and considered for calculation and output. 2) the sum of all valid... (2 Replies)
Discussion started by: sheetalk
2 Replies

5. Shell Programming and Scripting

Help with filter result that fulfill criteria

Input file: ##fileformat=tab ##reference=file:input.txt #Line Position Score Input_185827_2127 1071 67 Input_18213_21 1021 100 Input_9012_214 200 150 Input_935_217 124 70 Output file: ##fileformat=tab ##reference=file:input.txt #Line Position Score Input_18213_21 1021... (2 Replies)
Discussion started by: perl_beginner
2 Replies

6. Shell Programming and Scripting

Remove <CR><LF> from the dat file in unix

Hi, The source system has created the file in the dat format and put into the linux directory as mentioned below. I want to do foloowing things. a) Delete the Line started with <CR><LF> in the record b)Also line ...........................................................<CR><LF> ... (1 Reply)
Discussion started by: mr_harish80
1 Replies

7. Shell Programming and Scripting

Filter or remove duplicate block of text without distinguishing marks or fields

Hello, Although I have found similar questions, I could not find advice that could help with our problem. The issue: We have several hundreds text files containing repeated blocks of text (I guess back at the time they were prepared like that to optmize printing). The block of texts... (13 Replies)
Discussion started by: samask
13 Replies

8. Shell Programming and Scripting

Remove interspersed headers in .dat file with AWK

Heya there, A small selection of my data is shown below. DATE TIME FRAC_DAYS_SINCE_JAN1 2011-06-25 08:03:20.000 175.33564815 2011-06-25 08:03:25.000 175.33570602 2011-06-25 ... (4 Replies)
Discussion started by: gd9629
4 Replies

9. Shell Programming and Scripting

How to remove duplicates from the .dat file

All, I have a file 1181CUSTOMER-L061411_003500.dat.Z having duplicate records in it. bash-2.05$ zcat 1181CUSTOMER-L061411_003500.dat.Z|grep "90876251S" 90876251S|ABG, AN ADAYANA COMPANY|3550 DEPAUW BLVD|||US|IN|INDIANAPOLIS||DAL|46268||||||GEN|||||||USD|||ABG, AN ADAYANA... (3 Replies)
Discussion started by: Oracle_User
3 Replies

10. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Hello to all, I have this file 2002 1 23 0 0 2435.60 131.70 5.60 20.99 0.89 0.00 285.80 2303.90 2002 1 23 15 0 2436.60 132.90 6.45 21.19 1.03 0.00 285.80 2303.70 2002 1 23 ... (6 Replies)
Discussion started by: joggdial3000
6 Replies
Login or Register to Ask a Question