Search and replace ---A huge number of files

06-06-2013

Registered User

1,170, 106

Join Date: Sep 2008

Last Activity: 10 October 2019, 7:06 AM EDT

Posts: 1,170

Thanks Given: 22

Thanked 106 Times in 101 Posts

Search and replace ---A huge number of files

Hello Friends,

I have the below scenario in my current project. Suggest me which tool ( perl,python etc) is best to this scenario. Or should I go for Programming language ( C/Java )..

(1) I will be having a very big file ( information about 200million subscribers will be stored in it ). This is a static data and will be getting changed once in a month. fields in this file will be like : AAA, BBB , CCC

(2)I have to process input data ( number of records / hour could be around 100 million records ) ( fields : AAA , XXX , YYY , ZZZ etc ).
A lookup needs to be made for each field in input file with file in step(1) and produce output : AAA , XXX , YYY , ZZZ , CCC. ( Lookup will be based on field "AAA" ).

Any suggestion on:
how to process each and every input record against such big static file?

Regards,
Ravi

panyam

View Public Profile for panyam

Find all posts by panyam

06-06-2013

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

The very best tool for this is a database application - mysql, oracle, etc. Create an indexed table from your "big file", update it once a month. You gain scalability, meaning you can write one small db app, and run many separate parallel processes. Or threads.

Otherwise you would need a hash of 200 million records to do real time lookups. Not that this is not possible, it just seems like an unstable or error prone approach to me.
Plus it may not scale well as load increases.

So, with no database you need major hash support in your app- and tons of free memory

Code:

200 million * [big file record size]

probably way more 4GB.

perl, ruby, C will work either with or without a db. Shell/awk will not work at all well.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

06-06-2013

Registered User

1,170, 106

Join Date: Sep 2008

Last Activity: 10 October 2019, 7:06 AM EDT

Posts: 1,170

Thanks Given: 22

Thanked 106 Times in 101 Posts

Hi Jim,

Thanks for the suggestion.

I got your point and was thinking in the same way.

However, If I store the "static data" in DB,

While processing each and every record ( input dynamic data which is around 100 million / hour ) , I have to do a DB lookup for each and every record...!

Is not that expensive?

panyam

View Public Profile for panyam

Find all posts by panyam

06-06-2013

Moderator

6,876, 694

Join Date: Sep 2005

Last Activity: 10 February 2021, 3:50 AM EST

Location: Switzerland - GE

Posts: 6,876

Thanks Given: 594

Thanked 694 Times in 627 Posts

I agree with Jim's point of vue, I am lucky, in such case I go see my firends one level lower, since I am responsible of their architecture I do ahve some favors when needed: I use SAS... but SAS cost is $$$
What is the expected file size?

vbe

View Public Profile for vbe

Find all posts by vbe

06-06-2013

Registered User

1,170, 106

Join Date: Sep 2008

Last Activity: 10 October 2019, 7:06 AM EDT

Posts: 1,170

Thanks Given: 22

Thanked 106 Times in 101 Posts

Hi Vbe ,

Not sure which file you are referring to here: Expected file size of?

The static file will be having 200 million records with record having around characters in it approx. This file i can store in DB which is a one time task ( once in a month ofcourse).

Now,my worry is , I have to do a DB look up for each and every input record I receive and extract some value from "DB" and do the changes in input record and produce the output.

The input record will be around 200~250 bytes in length and approx 100 million records needs to be processed per hour.

Any suggestions?

Regards,
Ravi

panyam

View Public Profile for panyam

Find all posts by panyam

06-06-2013

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by panyam

While processing each and every record ( input dynamic data which is around 100 million / hour ) , I have to do a DB lookup for each and every record...!

Is not that expensive?

Certainly less expensive than looking up records in a flat file!

Or do you mean that you coudn't just "do x for all records" in a database? Because actually, you can.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Search and replace ---A huge number of files

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split a folder with huge number of files in n folders

Discussion started by: AlokKumbhare

2. Shell Programming and Scripting

search a number in very very huge amount of data

Discussion started by: vsachan

3. Shell Programming and Scripting

Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

Discussion started by: manishkomar007

4. Shell Programming and Scripting

How to delete a huge number of files at a time

Discussion started by: lisp21

5. Shell Programming and Scripting

highly specific search and replace for a large number of files

Discussion started by: ksubrama

6. UNIX for Dummies Questions & Answers

Search and replace a number

Discussion started by: hs.giri

7. UNIX for Advanced & Expert Users

Search and replace a number

Discussion started by: hs.giri

8. UNIX for Advanced & Expert Users

Best way to search for patterns in huge text files

Discussion started by: andy2000

9. Shell Programming and Scripting

awk - replace number of string length from search and replace for a serialized array

Discussion started by: otrotipo