Best way to search for patterns in huge text files


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Best way to search for patterns in huge text files
# 1  
Old 01-08-2010
Bug Best way to search for patterns in huge text files

I have the following situation:

a text file with 50000 string patterns:

Code:
abc2344536
gvk6575556
klo6575556
....

and 3 text files each with more than 1 million lines:

Code:
...
000000 abc2344536      46575 0000
000000 abc2344536      46575 4444
000000 abc2344555      46575 1234
...

I have to extract all lines from the 3 files that match all patterns from the pattern file :-)

Any ideas, please!!!

Andy

Last edited by radoulov; 01-08-2010 at 06:32 PM.. Reason: Please use code tags!
# 2  
Old 01-08-2010
If I understand correctly ("all patterns" is a bit confusing) ...
Try this:

Code:
awk 'NR == FNR { p[$1]; next } $2 in p' pattern_file input_file1 input_file2 ...

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
# 3  
Old 01-08-2010
thank u for your quick replay...but it does not work ..

I need something faster than the following solution:
Code:
for i in `cat pattern_file`
do
  grep $i  input_file1 >> output_file1
  grep $i  input_file2 >> output_file2  
  grep $i  input_file3 >> output_file3
done

Any ideas please!!!

Last edited by Scott; 01-08-2010 at 07:32 PM.. Reason: Please use code tags
# 4  
Old 01-08-2010
Hi.

Is:
Code:
grep -f pattern_file input_file[123] > output_fileX

quicker?

Also, please say WHAT doesn't work with the awk solution.

Like radoulov, I don't know what you mean by "all patterns".

Last edited by Scott; 01-08-2010 at 07:41 PM..
# 5  
Old 01-08-2010
scottn ,
Code:
grep -f pattern_file input_file[123] > output_fileX

I get out of memory :-)

"all patterns" means: returns every line that contains any string pattern from the pattern file. And the results of each file should be separatly..excuse me my english :-(

Last edited by Scott; 01-09-2010 at 08:26 PM.. Reason: Added code tags
# 6  
Old 01-08-2010
Hi Andy.

You're English is not an issue - it's rather good Smilie.

You didn't say what was wrong with the original awk (why it didn't work).

You can try:
Code:
cat input_file1 input_file2 etc | grep -f pattern_file

A slight change to the awk:
Code:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

(changing the filenames to suit your requirements)
# 7  
Old 01-09-2010
scottn,

thank you for your Patience.

The first solution:

Code:
cat input_file1 input_file2 etc | grep -f pattern_file

returns virtual memory exhausted :-(

The second one:

Code:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

Does not work!! It returns all the lines of "input_file". A second bad thing is that, the searched pattern is NOT always at the second position, e.g.
Code:
...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

:-(

Last edited by radoulov; 01-09-2010 at 12:50 PM.. Reason: Please use code tags!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search and replace ---A huge number of files

Hello Friends, I have the below scenario in my current project. Suggest me which tool ( perl,python etc) is best to this scenario. Or should I go for Programming language ( C/Java ).. (1) I will be having a very big file ( information about 200million subscribers will be stored in it ). This... (5 Replies)
Discussion started by: panyam
5 Replies

2. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

3. Shell Programming and Scripting

How to fix line breaks format text for huge files?

Hi, I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly. Except the header and trailer, each record starts with 'D'. Requirement:Scan the whole file except the header and trailer records and see if any of the records start with... (19 Replies)
Discussion started by: kikionline
19 Replies

4. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

5. SuSE

Search all files based on first and in all listed files search the second patterns

Hello Linux Masters, I am not a linux expert therefore i need help from linux gurus. Well i have a requirement where i need to search all files based on first patterns and after seraching all files then serach second pattern in all files which i have extracted based on first pattern.... (1 Reply)
Discussion started by: Black-Linux
1 Replies

6. Shell Programming and Scripting

to read two files, search for patterns and store the output in third file

hello i have two files temp.txt and temp_unique.text the second file consists the unique fields from the temp.txt file the strings stored are in the following form 4,4 17,12 15,65 4,4 14,41 15,65 65,89 1254,1298i'm able to run the following script to get the total count of a... (3 Replies)
Discussion started by: vaibhavkorde
3 Replies

7. UNIX for Dummies Questions & Answers

script to search patterns inside list of files

>testfile while read x do if then echo $x >> testfile else fi if then echo $x >> testfile else fi done < list_of_files is there any efficient way to search abc.dml and xyz.dml ? (2 Replies)
Discussion started by: dr46014
2 Replies

8. Shell Programming and Scripting

Perl - How to search a text file with multiple patterns?

Good day, great gurus, I'm new to Perl, and programming in general. I'm trying to retrieve a column of data from my text file which spans a non-specific number of lines. So I did a regexp that will pick out the columns. However,my pattern would vary. I tried using a foreach loop unsuccessfully.... (2 Replies)
Discussion started by: Sp3ck
2 Replies

9. UNIX Desktop Questions & Answers

how to search files efficiently using patterns

hi friens, :) if i need to find files with extension .c++,.C++,.cpp,.Cpp,.CPp,.cPP,.CpP,.cpP,.c,.C wat is the pattern for finding them :confused: (2 Replies)
Discussion started by: arunsubbhian
2 Replies

10. Solaris

Huge (repeated Entry) text files

Somebody HELP! I have a huge log file (TEXT) 76298035 bytes. It's a logfile of IMEIs and IMSIS that I get from my EIR node. Here is how the contents of the file look like: 000000, 1 33016382000913 652020100423994 1 33016382002353 652020100430743 1 33017035101003 652020100441736... (4 Replies)
Discussion started by: axl
4 Replies
Login or Register to Ask a Question