Filter/remove duplicate .dat file with certain criteria
I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.
contents of file looks like
first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0
only first one is valid, though( complete line is may or may not be duplicated)
simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0
I need to make a script which creates a file( from manipluating the input file) as
first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.
I would be greatful if any of you please advice me, how can I do it.
i am trying this in /usr/bin/csh shell. it gives an error
magu1@gmmagappu1% awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} UnixEg.dat
unmatched '
I have csv file with 30, 40 columns
Pasting just three column for problem description
I want to filter record if column 1 matches CN or DN then,
check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345
and if column 2 contains 6789, 6789... (5 Replies)
I have two files and would need to filter out records based on certain criteria, these column are of variable lengths, but the lengths are uniform throughout all the records of the file. I have shown a sample of three records below. Line 1-9 is the item number "0227546_1" in the case of the first... (15 Replies)
Hello,
I have a script that is generating a tab delimited output file.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 ... (3 Replies)
Hello,
I am trying to extract valid data blocks from invalid ones. In the input the data blocks are separated by one or more blank rows. The criteria are
1) second column value must be 30 or more for the row to be valid and considered for calculation and output.
2) the sum of all valid... (2 Replies)
Hi,
The source system has created the file in the dat format and put into the linux directory as mentioned below. I want to do foloowing things.
a) Delete the Line started with <CR><LF> in the record
b)Also line
...........................................................<CR><LF>
... (1 Reply)
Hello,
Although I have found similar questions, I could not find advice that
could help with our problem.
The issue:
We have several hundreds text files containing repeated blocks of text
(I guess back at the time they were prepared like that to optmize
printing).
The block of texts... (13 Replies)
Heya there,
A small selection of my data is shown below.
DATE TIME FRAC_DAYS_SINCE_JAN1
2011-06-25 08:03:20.000 175.33564815
2011-06-25 08:03:25.000 175.33570602
2011-06-25 ... (4 Replies)
All,
I have a file 1181CUSTOMER-L061411_003500.dat.Z having duplicate records in it.
bash-2.05$ zcat 1181CUSTOMER-L061411_003500.dat.Z|grep "90876251S"
90876251S|ABG, AN ADAYANA COMPANY|3550 DEPAUW BLVD|||US|IN|INDIANAPOLIS||DAL|46268||||||GEN|||||||USD|||ABG, AN ADAYANA... (3 Replies)