Best way to search for patterns in huge text files


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Best way to search for patterns in huge text files
# 8  
Old 01-09-2010
Quote:
Originally Posted by andy2000
[...]
[...]
A second bad thing is that, the searched pattern is NOT always at the second position, e.g.
Code:
...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

Well,
you had to be more specific in your first post ...

I'd use Perl, given the file size:

Code:
perl -e'
  my %p;
  open my $PH, "<", shift or die "$!\n";
  $p{$_} = 1 while <$PH>;
  close $PH or warn "$!\n";
  
  while (my $line = <>) {
    for (keys %p) {
      print $line and last if $line =~ /$_/;
      }
    }' pattern_file input_file1 inputfile2 ...


Or Python:

Code:
python -c'
from sys import argv

pf = open(argv.pop(1), "r")
pd = {}
for l in pf:
  pd[l] = 1

for fn in argv[1:]:
  f = open(fn, "r")
  for line in f:
    for p in pd:
        if p in line:
	      print line,
	      break
			    
  f.close()
' pattern_file input_file1 input_file2 ...


Last edited by radoulov; 01-09-2010 at 02:29 PM..
# 9  
Old 01-09-2010
scottn,

it works fine with (fgrep):

Code:
cat input_file1 input_file2 etc | fgrep -f pattern_file

the ugly message "virtual memory exhausted" appears no more.

How could u solve the problem with awk:

Code:
awk 'NR == FNR { p[$1]; next } $0 ~ p[$2]' pattern_file input_file*

Does not work!! It returns all the lines of "input_file". the searched pattern is NOT always at the second position, e.g.

Code:
...
000000 abc2344536      46575 0000
000000 89798798798    abc2344536
000000 abc2344555      46575 1234
000000 7777777777     abc2344536
...

Thank you

Last edited by Scott; 01-09-2010 at 07:27 PM.. Reason: Added code tags - again
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search and replace ---A huge number of files

Hello Friends, I have the below scenario in my current project. Suggest me which tool ( perl,python etc) is best to this scenario. Or should I go for Programming language ( C/Java ).. (1) I will be having a very big file ( information about 200million subscribers will be stored in it ). This... (5 Replies)
Discussion started by: panyam
5 Replies

2. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

3. Shell Programming and Scripting

How to fix line breaks format text for huge files?

Hi, I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly. Except the header and trailer, each record starts with 'D'. Requirement:Scan the whole file except the header and trailer records and see if any of the records start with... (19 Replies)
Discussion started by: kikionline
19 Replies

4. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

5. SuSE

Search all files based on first and in all listed files search the second patterns

Hello Linux Masters, I am not a linux expert therefore i need help from linux gurus. Well i have a requirement where i need to search all files based on first patterns and after seraching all files then serach second pattern in all files which i have extracted based on first pattern.... (1 Reply)
Discussion started by: Black-Linux
1 Replies

6. Shell Programming and Scripting

to read two files, search for patterns and store the output in third file

hello i have two files temp.txt and temp_unique.text the second file consists the unique fields from the temp.txt file the strings stored are in the following form 4,4 17,12 15,65 4,4 14,41 15,65 65,89 1254,1298i'm able to run the following script to get the total count of a... (3 Replies)
Discussion started by: vaibhavkorde
3 Replies

7. UNIX for Dummies Questions & Answers

script to search patterns inside list of files

>testfile while read x do if then echo $x >> testfile else fi if then echo $x >> testfile else fi done < list_of_files is there any efficient way to search abc.dml and xyz.dml ? (2 Replies)
Discussion started by: dr46014
2 Replies

8. Shell Programming and Scripting

Perl - How to search a text file with multiple patterns?

Good day, great gurus, I'm new to Perl, and programming in general. I'm trying to retrieve a column of data from my text file which spans a non-specific number of lines. So I did a regexp that will pick out the columns. However,my pattern would vary. I tried using a foreach loop unsuccessfully.... (2 Replies)
Discussion started by: Sp3ck
2 Replies

9. UNIX Desktop Questions & Answers

how to search files efficiently using patterns

hi friens, :) if i need to find files with extension .c++,.C++,.cpp,.Cpp,.CPp,.cPP,.CpP,.cpP,.c,.C wat is the pattern for finding them :confused: (2 Replies)
Discussion started by: arunsubbhian
2 Replies

10. Solaris

Huge (repeated Entry) text files

Somebody HELP! I have a huge log file (TEXT) 76298035 bytes. It's a logfile of IMEIs and IMSIS that I get from my EIR node. Here is how the contents of the file look like: 000000, 1 33016382000913 652020100423994 1 33016382002353 652020100430743 1 33017035101003 652020100441736... (4 Replies)
Discussion started by: axl
4 Replies
Login or Register to Ask a Question