Performance issue with 'grep' command for huge file size


 
Thread Tools Search this Thread
Operating Systems HP-UX Performance issue with 'grep' command for huge file size
# 1  
Old 11-17-2011
Performance issue with 'grep' command for huge file size


I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is:
Code:
while read line
do
emp_name=`echo $line`
grep -e $emp_name details.txt >> output.txt
done < emp.txt

Above code is working fine and I am getting expected result. But, this code is taking too much time (I don't have exact time, more than 6 hrs, later on cancelled the script) while the file size is huge. As an example, I have details.txt of around 2.5GB and record count is around 7.5lacs and the emp.txt has 55K employee name. Can you please suggest any other option/ command which will be better to handle such huge file. Thanks.

Last edited by vbe; 11-18-2011 at 09:33 AM.. Reason: attempt to use code tags ( fighting with fonts...)
# 2  
Old 11-17-2011
could you show snippets of both files? (using code tags)
# 3  
Old 11-17-2011
What's your region set to? GNU grep has to do a lot more work for UTF8 than C.

Code:
emp_name=`echo $line`

I'm trying to understand the purpose of this line... Flattening whitespace?
# 4  
Old 11-17-2011
Don't use a loop to get this done, your processing the 2.5GB details.txt file for each name in emp.txt. So if you had 2 names in emp.txt your processing 5GB of detail.txt. 10 names = 25GB. It doesn't scale well that way.

Try this:

Code:
 
grep -F -f emp.txt details.txt

Then you are only processing details.txt once, and of course however big emp.txt is.

Using -F might also save some time. If you don't have the '-F' option look for 'fgrep'.
But being on HP-UX the standard 'grep' should have the -F option available.

Last edited by rwuerth; 11-17-2011 at 01:59 PM.. Reason: I'm scatter brained today. Keep thinking of things to add, after the fact.
These 2 Users Gave Thanks to rwuerth For This Post:
# 5  
Old 11-17-2011
Thank you all for your quick response !! Thanks a lot rwuertn; '-F' option is working and I am able to extract the required data within less time period.

However, the files are like:
Code:
emp.txt
------------
John
Kevin
Prakash
Susan
Ken

details.txt
-------------
HDR|Prakash D
DTL|Prakash|EMP0000010|Sr Associate|FL
HDR|Kevin T
DTL|Kevin|EMP0000004|Analyst|IL
HDR|John M
DTL|John|EMP0000184|Manager|CA

Thanks again Smilie

Last edited by Scott; 11-17-2011 at 05:52 PM.. Reason: Code tags...
# 6  
Old 11-18-2011
What was the time savings?

Also, you said,
Quote:
However, the files are like:
And proceeded to show your input files.

Was there a question there that you wanted to ask?

Is it working as you'd expect it to?
# 7  
Old 11-21-2011
Nope, I do not have any further query right now. I did mention the file details as someone else was looking for the file structure.

Thanks rwuerth for your suggesstion. It is working fine. I would let you know about the saving by couple of days as full volume testing is yet pending.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Performance Issue for a file search command

Hi All, This query is regarding performance improvement of a command. I have a list of IDs in a file (say file1 with single ID column) and file2 has the data rows. I need to get the IDs from file1 and search in file2, matching rows from file2 should be written to a file3. For this... (4 Replies)
Discussion started by: Tanu
4 Replies

2. UNIX for Dummies Questions & Answers

What is the faster way to grep from huge file?

Hi All, I am new to this forum and this is my first post. My requirement is like to optimize the time taken to grep the file with 40000 lines. There are two files FILEA(40000 lines) FILEB(40000 lines). The requirement is like this, both the file will be in the format below... (11 Replies)
Discussion started by: mad man
11 Replies

3. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies

4. Shell Programming and Scripting

Performance issue while using find command

Hi, I have created a shell script for Server Log Automation Process. I have used find xargs grep command to search the string. for Example, find -name | xargs grep "816995225" > test.txt . Here my problem is, We have lot of records and we want to grep the string... (4 Replies)
Discussion started by: nanthagopal
4 Replies

5. Shell Programming and Scripting

Optimised way for search & replace a value on one line in a very huge file (File Size is 24 GB).

Hi Experts, I had to edit (a particular value) in header line of a very huge file so for that i wanted to search & replace a particular value on a file which was of 24 GB in Size. I managed to do it but it took long time to complete. Can anyone please tell me how can we do it in a optimised... (7 Replies)
Discussion started by: manishkomar007
7 Replies

6. Shell Programming and Scripting

FTP a huge Size file

Dear All, Good Evening!! I have a requirement to ftp a 220GB backup file to a remote backup server. I wrote a script for this purpose. But it takes more than 8 hours to transfer this file. Is there any other method to do it in less time??? Thanks in Advance!!! ---------- Post updated... (5 Replies)
Discussion started by: Naga06
5 Replies

7. Shell Programming and Scripting

Severe performance issue while 'grep'ing on large volume of data

Background ------------- The Unix flavor can be any amongst Solaris, AIX, HP-UX and Linux. I have below 2 flat files. File-1 ------ Contains 50,000 rows with 2 fields in each row, separated by pipe. Row structure is like Object_Id|Object_Name, as following: 111|XXX 222|YYY 333|ZZZ ... (6 Replies)
Discussion started by: Souvik
6 Replies

8. Shell Programming and Scripting

Implement in one line sed or awk having no delimiter and file size is huge

I have file which contains around 5000 lines. The lines are fixed legth but having no delimiter.Each line line contains nearly 3000 characters. I want to delete the lines a> if it starts with 1 and if 576th postion is a digit i,e 0-9 or b> if it starts with 0 or 9(i,e header and footer) ... (4 Replies)
Discussion started by: millan
4 Replies

9. Shell Programming and Scripting

performance of shell script ( grep command)

Hi, I have to find out the run time for 40-45 different componets. These components writes in to a genreric log file in a single directory. eg. directory is LOG and the log file name format is generic_log_<process_id>_<date YY_MM_DD_HH_MM_SS>.log i am taking the run time using the time... (3 Replies)
Discussion started by: vikash_k
3 Replies

10. Shell Programming and Scripting

Grep matched records from huge file

111111111100000000001111111111 123232323200000010001114545454 232435424200000000001232131212 342354234301000000002323423443 232435424200000000001232131212 2390898994200000000001238908092 This is the record format. From 11th position to 20th position in a record there are 0's occuring,and... (6 Replies)
Discussion started by: mjkreddy
6 Replies
Login or Register to Ask a Question