Matching 10 Million file records with 10 Million in other file | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Matching 10 Million file records with 10 Million in other file

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-12-2012
vguleria vguleria is offline
Registered User
 
Join Date: Mar 2008
Last Activity: 28 March 2014, 3:00 AM EDT
Location: India(Sarkaghat)
Posts: 31
Thanks: 0
Thanked 0 Times in 0 Posts
Matching 10 Million file records with 10 Million in other file

Dear All,

I have two files both containing 10 Million records each separated by comma(csv fmt).
One file is input.txt other is status.txt.

Input.txt-> contains fields with one unique id field (primary key we can say)
Status.txt -> contains two fields only:1. unique id and 2. status

problem: match id from input.txt to id from status.txt and update/log the status accordingly in output file.

requirement: need efficient algo for getting the solution in minimal time. tried perl, but system hangs during processing. Pls suggest if there's a workable way to do the same. Is it doable in perl or c/c++/java ?

Thanks.
Sponsored Links
    #2  
Old 06-12-2012
methyl methyl is offline Forum Advisor  
Advisor
 
Join Date: Mar 2008
Last Activity: 18 April 2014, 5:13 AM EDT
Posts: 6,396
Thanks: 287
Thanked 672 Times in 642 Posts
What Operating System and version are you running?

How big are the files?
Are either or both of the files in sorted order?
Do the records in each file match one-for-one?
Does the output order matter?

Can you post sample input and matching output?

Is this an extract from a database where it might be easier to work on the data while it is still in the database?
Sponsored Links
    #3  
Old 06-12-2012
Chubler_XL's Avatar
Chubler_XL Chubler_XL is online now Forum Advisor  
Registered User
 
Join Date: Oct 2010
Last Activity: 2 September 2014, 5:13 PM EDT
Posts: 2,823
Thanks: 106
Thanked 887 Times in 830 Posts
Additional questions:

Do you need to run this match frequently or is it a once-off job?
How frequently are the data files updated?
Are there just new records appended to the files or are they completly re-written?
    #4  
Old 06-13-2012
vguleria vguleria is offline
Registered User
 
Join Date: Mar 2008
Last Activity: 28 March 2014, 3:00 AM EDT
Location: India(Sarkaghat)
Posts: 31
Thanks: 0
Thanked 0 Times in 0 Posts
The OS is linux, it's a one time job(occasionally). these are offline files and not being updated. Need to make a process for future requirements.

Its not in DB.. actually these are application log files.
The size of files are 1.5G approx. Right now only thinking of the best way/approach to complete the task...
Had tried using perl hashes(didn't work), i guess keeping that much data in memory is not possible... hence algorithm has to be really efficient here.


Sample files:
Input.txt

Code:
20.04.2012 11.08.44;RECV;APPNAME@HOSTNAME06:11496059192;processed;Location;contact;status;email_id;2
20.04.2012 11.08.44;RECV;APPNAME@HOSTNAME06:11496059168;processed;Location;contact;status;email_id;1
20.04.2012 11.08.44;RECV;APPNAME@HOSTNAME06:11496059220;processed;Location;contact;status;email_id;2

Status.txt

Code:
APPNAME@HOSTNAME06:11496059192;SUCCESS
APPNAME@HOSTNAME06:11496059224;SUCCESS
APPNAME@HOSTNAME06:11496059168;FAILURE
APPNAME@HOSTNAME06:11496059220;FAILURE
APPNAME@HOSTNAME06:11496059193;SUCCESS

need to update the status field in input.txt with the status(success/failure) in status.txt

Last edited by Franklin52; 06-13-2012 at 07:47 AM.. Reason: Please use code tags for data and code samples
Sponsored Links
    #5  
Old 06-13-2012
methyl methyl is offline Forum Advisor  
Advisor
 
Join Date: Mar 2008
Last Activity: 18 April 2014, 5:13 AM EDT
Posts: 6,396
Thanks: 287
Thanked 672 Times in 642 Posts
Any comment about the order of the data in the files and whether there is a one-for-one match between the two files (in which case the paste command might be suitable?

Edit: Posts crossed. I can see that neither file is in any particular order and that your sample does not show a one-for-one match.

It's going to be necessary to sort both files. Does the order of the final output data matter?

Last edited by methyl; 06-13-2012 at 07:35 AM..
Sponsored Links
    #6  
Old 06-13-2012
vguleria vguleria is offline
Registered User
 
Join Date: Mar 2008
Last Activity: 28 March 2014, 3:00 AM EDT
Location: India(Sarkaghat)
Posts: 31
Thanks: 0
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by methyl View Post
Any comment about the order of the data in the files and whether there is a one-for-one match between the two files (in which case the paste command might be suitable?

Edit: Posts crossed. I can see that neither file is in any particular order and that your sample does not show a one-for-one match.

It's going to be necessary to sort both files. Does the order of the final output data matter?
There is one-for-one match and its not necessary that the id from input.txt is always found in status.txt, if its found then there is only one match.
No the order doesn't matter here.
Sponsored Links
    #7  
Old 06-13-2012
Lem Lem is offline
Registered User
 
Join Date: Jun 2012
Last Activity: 13 June 2014, 10:20 AM EDT
Location: Lombardia, Italy
Posts: 182
Thanks: 5
Thanked 38 Times in 38 Posts
Quote:
Originally Posted by vguleria View Post

Code:
APPNAME@HOSTNAME06:11496059192;SUCCESS
APPNAME@HOSTNAME06:11496059224;SUCCESS

Are these :11496059224; numbers unique identifiers? Or can there be two or more lines with the same number? If they're unique identifiers, I think you could try with this thing.

First of all, if status.txt is too big, let's split it in many "tiny" files:


Code:
split -l 100000 status.txt tinyfile

Then, here we go:


Code:
IFS=";:"
declare -a status
for file in tinyfile*; do
  while read -r x y z; do
     status[$y]=$z
     done < $file
  while read a b c d e f g h i l; do
     h=${status[$d]}
     [[ $h = "" ]] || printf '%s;%s;%s:%s;%s;%s;%s;%s;%s;%s\n' "$a" "$b" "$c" "$d" "$e" "$f" "$g" "$h" "$i" "$l"
     done  < input.txt >> output.txt
  unset status
  done

I haven't tried it, so I don't know how fast or slow it can be.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Tail 86000 lines from 1.2 million line file? robsonde Shell Programming and Scripting 8 11-11-2009 05:48 PM
Pick a Number Between 0 and 20 for 1 Million Bits Neo What is on Your Mind? 24 11-04-2009 04:49 PM
sort a file which has 3.7 million records greenworld Shell Programming and Scripting 6 07-15-2009 12:59 PM
Extract data from large file 80+ million records learner16s Shell Programming and Scripting 2 06-02-2009 12:48 PM



All times are GMT -4. The time now is 05:23 PM.