Sponsored Content
Full Discussion: Checking file for duplicates
Top Forums Shell Programming and Scripting Checking file for duplicates Post 302422724 by pxy2d1 on Wednesday 19th of May 2010 08:16:55 AM
Old 05-19-2010
Checking file for duplicates

Hi all,

I am due to start receiving a weekly csv containing around 6 million rows. I need to do some processing on this file and then send it on elsewhere.

My problem is that after week 1 the files that I will receive are likely to contain data already received in previous files and I need to strip this data out before sending on.

Initially my plan was to keep a list of each row of data sent and then to check if each row in a new file is already present in my sent list. However it soon became clear that at week 2 I would be checking each of 6 million rows to see if they appeared on a list of 6 million already sent, but at week 5 would be checking against 30 million rows.

I was hoping that someone may have a more efficient way to achieve this.

It is likely that the data will start to be purged after week10 so I would say a max sent list of around 60 million rows.

Any ideas would be appreciated
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Avoid Duplicates in a file

Hi Gurus, I had a question regarding avoiding duplicates.i have a file abc.txt abc.txt ------- READER_1_1_1> HIER_28056 XML Reader: Error occurred while parsing:; line number ; column number READER_1_3_1> Sun Mar 23 23:52:48 2008 READER_1_3_1> HIER_28056 XML Reader: Error occurred while... (7 Replies)
Discussion started by: pssandeep
7 Replies

2. Shell Programming and Scripting

Remove duplicates from a file

Hi, I need to remove duplicates from a file. The file will be like this 0003 10101 20100120 abcdefghi 0003 10101 20100121 abcdefghi 0003 10101 20100122 abcdefghi 0003 10102 20100120 abcdefghi 0003 10103 20100120 abcdefghi 0003 10103 20100121 abcdefghi Here if the first colum and... (6 Replies)
Discussion started by: gpaulose
6 Replies

3. UNIX for Dummies Questions & Answers

CSV file:Find duplicates, save original and duplicate records in a new file

Hi Unix gurus, Maybe it is too much to ask for but please take a moment and help me out. A very humble request to you gurus. I'm new to Unix and I have started learning Unix. I have this project which is way to advanced for me. File format: CSV file File has four columns with no header... (8 Replies)
Discussion started by: arvindosu
8 Replies

4. Shell Programming and Scripting

Removing Duplicates from file

Hi Experts, Please check the following new requirement. I got data like the following in a file. FILE_HEADER 01cbbfde7898410| 3477945| home| 1 01cbc275d2c122| 3478234| WORK| 1 01cbbe4362743da| 3496386| Rich Spare| 1 01cbc275d2c122| 3478234| WORK| 1 This is pipe separated file with... (3 Replies)
Discussion started by: tinufarid
3 Replies

5. Shell Programming and Scripting

Duplicates in an XML file

Hi All, I have an xml file that contains information like this <ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID> <ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID> <ID>412659<COMMENT>TEXT TEXT TEXT TEXT TEXT</COMMENT></ID> <ID>873520<COMMENT>TEXT</COMMENT></ID>... (5 Replies)
Discussion started by: TasosARISFC
5 Replies

6. Shell Programming and Scripting

Remove the partial duplicates by checking the length of a field

Hi Folks - I'm quite new to awk and didn't come across such issues before. The problem statement is that, I've a file with duplicate records in 3rd and 4th fields. The sample is as below: aaaaaa|a12|45|56 abbbbaaa|a12|45|56 bbaabb|b1|51|45 bbbbbabbb|b2|51|45 aaabbbaaaa|a11|45|56 ... (3 Replies)
Discussion started by: asyed
3 Replies

7. Programming

[Solved] Removing duplicates from the file and saving as new file

Dear All I have 200 data files and each files has many duplicates. I am looking for the automated awk script such that it checks and removes the duplicates from the each file and saving them as new files for all 200 files in the respective folder. For example my data looks like this.. ... (12 Replies)
Discussion started by: bala06
12 Replies

8. UNIX for Dummies Questions & Answers

Remove duplicates from a file

Can u tell me how to remove duplicate records from a file? (11 Replies)
Discussion started by: saga20
11 Replies

9. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Hi All, I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file I,01,000131,764,2,4.00 I,01,000131,765,2,4.00 I,01,000131,772,2,4.00 I,01,000131,773,2,4.00 I,01,000168,762,2,2.00 I,01,000168,763,2,2.00... (5 Replies)
Discussion started by: Sri3001
5 Replies

10. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies
funtablerowget(3)						SAORD Documentation						 funtablerowget(3)

NAME
FunTableRowGet - get Funtools rows SYNOPSIS
#include <funtools.h> void *FunTableRowGet(Fun fun, void *rows, int maxrow, char *plist, int *nrow) DESCRIPTION
The FunTableRowGet() routine retrieves rows from a Funtools binary table or raw event file, and places the values of columns selected by FunColumnSelect() into an array of user structs. Selected column values are automatically converted to the specified user data type (and to native data format) as necessary. The first argument is the Fun handle associated with this row data. The second rows argument is the array of user structs into which the selected columns will be stored. If NULL is passed, the routine will automatically allocate space for this array. (This includes proper allocation of pointers within each struct, if the "@" pointer type is used in the selection of columns. Note that if you pass NULL in the second argument, you should free this space using the standard free() system call when you are finished with the array of rows.) The third maxrow argument specifies the maximum number of rows to be returned. Thus, if rows is allocated by the user, it should be at least of size maxrow*sizeof(evstruct). The fourth plist argument is a param list string. Currently, the keyword/value pair "mask=transparent" is supported in the plist argument. If this string is passed in the call's plist argument, then all rows are passed back to the user (instead of just rows passing the filter). This is only useful when FunColumnSelect() also is used to specify "$region" as a column to return for each row. In such a case, rows found within a region have a returned region value greater than 0 (corresponding to the region id of the region in which they are located), rows passing the filter but not in a region have region value of -1, and rows not passing any filter have region value of 0. Thus, using "mask=transparent" and the returned region value, a program can process all rows and decide on an action based on whether a given row passed the filter or not. The final argument is a pointer to an int variable that will return the actual number of rows returned. The routine returns a pointer to the array of stored rows, or NULL if there was an error. (This pointer will be the same as the second argument, if the latter is non-NULL). /* get rows -- let routine allocate the row array */ while( (buf = (Ev)FunTableRowGet(fun, NULL, MAXROW, NULL, &got)) ){ /* process all rows */ for(i=0; i<got; i++){ /* point to the i'th row */ ev = buf+i; /* rearrange some values. etc. */ ev->energy = (ev->pi+ev->pha)/2.0; ev->pha = -ev->pha; ev->pi = -ev->pi; } /* write out this batch of rows */ FunTableRowPut(fun2, buf, got, 0, NULL); /* free row data */ if( buf ) free(buf); } As shown above, successive calls to FunTableRowGet() will return the next set of rows from the input file until all rows have been read, i.e., the routine behaves like sequential Unix I/O calls such as fread(). See evmerge example code for a more complete example. Note that FunTableRowGet() also can be called as FunEventsGet(), for backward compatibility. SEE ALSO
See funtools(7) for a list of Funtools help pages version 1.4.2 January 2, 2008 funtablerowget(3)
All times are GMT -4. The time now is 07:59 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy