Sponsored Content
Top Forums Shell Programming and Scripting Filter/remove duplicate .dat file with certain criteria Post 302504302 by mukeshguliao on Monday 14th of March 2011 08:51:46 AM
Old 03-14-2011
Filter/remove duplicate .dat file with certain criteria

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.

contents of file looks like
Code:
 
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40343369,OTC,mart_rec,99, ,0
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0

only first one is valid, though( complete line is may or may not be duplicated)

simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0

I need to make a script which creates a file( from manipluating the input file) as

Code:
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.

I would be greatful if any of you please advice me, how can I do it.

I hope I have explained the problem clearly.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Hello to all, I have this file 2002 1 23 0 0 2435.60 131.70 5.60 20.99 0.89 0.00 285.80 2303.90 2002 1 23 15 0 2436.60 132.90 6.45 21.19 1.03 0.00 285.80 2303.70 2002 1 23 ... (6 Replies)
Discussion started by: joggdial3000
6 Replies

2. Shell Programming and Scripting

How to remove duplicates from the .dat file

All, I have a file 1181CUSTOMER-L061411_003500.dat.Z having duplicate records in it. bash-2.05$ zcat 1181CUSTOMER-L061411_003500.dat.Z|grep "90876251S" 90876251S|ABG, AN ADAYANA COMPANY|3550 DEPAUW BLVD|||US|IN|INDIANAPOLIS||DAL|46268||||||GEN|||||||USD|||ABG, AN ADAYANA... (3 Replies)
Discussion started by: Oracle_User
3 Replies

3. Shell Programming and Scripting

Remove interspersed headers in .dat file with AWK

Heya there, A small selection of my data is shown below. DATE TIME FRAC_DAYS_SINCE_JAN1 2011-06-25 08:03:20.000 175.33564815 2011-06-25 08:03:25.000 175.33570602 2011-06-25 ... (4 Replies)
Discussion started by: gd9629
4 Replies

4. Shell Programming and Scripting

Filter or remove duplicate block of text without distinguishing marks or fields

Hello, Although I have found similar questions, I could not find advice that could help with our problem. The issue: We have several hundreds text files containing repeated blocks of text (I guess back at the time they were prepared like that to optmize printing). The block of texts... (13 Replies)
Discussion started by: samask
13 Replies

5. Shell Programming and Scripting

Remove <CR><LF> from the dat file in unix

Hi, The source system has created the file in the dat format and put into the linux directory as mentioned below. I want to do foloowing things. a) Delete the Line started with <CR><LF> in the record b)Also line ...........................................................<CR><LF> ... (1 Reply)
Discussion started by: mr_harish80
1 Replies

6. Shell Programming and Scripting

Help with filter result that fulfill criteria

Input file: ##fileformat=tab ##reference=file:input.txt #Line Position Score Input_185827_2127 1071 67 Input_18213_21 1021 100 Input_9012_214 200 150 Input_935_217 124 70 Output file: ##fileformat=tab ##reference=file:input.txt #Line Position Score Input_18213_21 1021... (2 Replies)
Discussion started by: perl_beginner
2 Replies

7. Shell Programming and Scripting

Filter datablocks meeting criteria

Hello, I am trying to extract valid data blocks from invalid ones. In the input the data blocks are separated by one or more blank rows. The criteria are 1) second column value must be 30 or more for the row to be valid and considered for calculation and output. 2) the sum of all valid... (2 Replies)
Discussion started by: sheetalk
2 Replies

8. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

9. Shell Programming and Scripting

Awk/sed/cut to filter out records from a file based on criteria

I have two files and would need to filter out records based on certain criteria, these column are of variable lengths, but the lengths are uniform throughout all the records of the file. I have shown a sample of three records below. Line 1-9 is the item number "0227546_1" in the case of the first... (15 Replies)
Discussion started by: MIA651
15 Replies

10. Shell Programming and Scripting

Filter duplicate records from csv file with condition on one column

I have csv file with 30, 40 columns Pasting just three column for problem description I want to filter record if column 1 matches CN or DN then, check for values in column 2 if column contain 1235, 1235 then in column 3 values must be sequence of 2345, 2345 and if column 2 contains 6789, 6789... (5 Replies)
Discussion started by: as7951
5 Replies
PDF::API2::Basic::PDF::Filter(3pm)			User Contributed Perl Documentation			PDF::API2::Basic::PDF::Filter(3pm)

NAME
PDF::API2::Basic::PDF::Filter - Abstract superclass for PDF stream filters SYNOPSIS
$f = PDF::API2::Basic::PDF::Filter->new; $str = $f->outfilt($str, 1); print OUTFILE $str; while (read(INFILE, $dat, 4096)) { $store .= $f->infilt($dat, 0); } $store .= $f->infilt("", 1); DESCRIPTION
A Filter object contains state information for the process of outputting and inputting data through the filter. The precise state information stored is up to the particular filter and may range from nothing to whole objects created and destroyed. Each filter stores different state information for input and output and thus may handle one input filtering process and one output filtering process at the same time. METHODS
PDF::API2::Basic::PDF::Filter->new Creates a new filter object with empty state information ready for processing data both input and output. $dat = $f->infilt($str, $isend) Filters from output to input the data. Notice that $isend == 0 implies that there is more data to come and so following it $f may contain state information (usually due to the break-off point of $str not being tidy). Subsequent calls will incorporate this stored state information. $isend == 1 implies that there is no more data to follow. The final state of $f will be that the state information is empty. Error messages are most likely to occur here since if there is required state information to be stored following this data, then that would imply an error in the data. $str = $f->outfilt($dat, $isend) Filter stored data ready for output. Parallels "infilt". NAME
PDF::API2::Basic::PDF::ASCII85Decode - Ascii85 filter for PDF streams. Inherits from PDF::API2::Basic::PDF::Filter NAME
PDF::API2::Basic::PDF::RunLengthDecode - Run Length encoding filter for PDF streams. Inherits from PDF::API2::Basic::PDF::Filter NAME
PDF::API2::Basic::PDF::ASCIIHexDecode - Ascii Hex encoding (very inefficient) for PDF streams. Inherits from PDF::API2::Basic::PDF::Filter perl v5.14.2 2011-03-10 PDF::API2::Basic::PDF::Filter(3pm)
All times are GMT -4. The time now is 11:10 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy