Sponsored Content
Top Forums UNIX for Advanced & Expert Users Best way to search for patterns in huge text files Post 302385614 by andy2000 on Friday 8th of January 2010 04:27:50 PM
Old 01-08-2010
Bug Best way to search for patterns in huge text files

I have the following situation:

a text file with 50000 string patterns:

Code:
abc2344536
gvk6575556
klo6575556
....

and 3 text files each with more than 1 million lines:

Code:
...
000000 abc2344536      46575 0000
000000 abc2344536      46575 4444
000000 abc2344555      46575 1234
...

I have to extract all lines from the 3 files that match all patterns from the pattern file :-)

Any ideas, please!!!

Andy

Last edited by radoulov; 01-08-2010 at 05:32 PM.. Reason: Please use code tags!
 

10 More Discussions You Might Find Interesting

1. Solaris

Huge (repeated Entry) text files

Somebody HELP! I have a huge log file (TEXT) 76298035 bytes. It's a logfile of IMEIs and IMSIS that I get from my EIR node. Here is how the contents of the file look like: 000000, 1 33016382000913 652020100423994 1 33016382002353 652020100430743 1 33017035101003 652020100441736... (4 Replies)
Discussion started by: axl
4 Replies

2. UNIX Desktop Questions & Answers

how to search files efficiently using patterns

hi friens, :) if i need to find files with extension .c++,.C++,.cpp,.Cpp,.CPp,.cPP,.CpP,.cpP,.c,.C wat is the pattern for finding them :confused: (2 Replies)
Discussion started by: arunsubbhian
2 Replies

3. Shell Programming and Scripting

Perl - How to search a text file with multiple patterns?

Good day, great gurus, I'm new to Perl, and programming in general. I'm trying to retrieve a column of data from my text file which spans a non-specific number of lines. So I did a regexp that will pick out the columns. However,my pattern would vary. I tried using a foreach loop unsuccessfully.... (2 Replies)
Discussion started by: Sp3ck
2 Replies

4. UNIX for Dummies Questions & Answers

script to search patterns inside list of files

>testfile while read x do if then echo $x >> testfile else fi if then echo $x >> testfile else fi done < list_of_files is there any efficient way to search abc.dml and xyz.dml ? (2 Replies)
Discussion started by: dr46014
2 Replies

5. Shell Programming and Scripting

to read two files, search for patterns and store the output in third file

hello i have two files temp.txt and temp_unique.text the second file consists the unique fields from the temp.txt file the strings stored are in the following form 4,4 17,12 15,65 4,4 14,41 15,65 65,89 1254,1298i'm able to run the following script to get the total count of a... (3 Replies)
Discussion started by: vaibhavkorde
3 Replies

6. SuSE

Search all files based on first and in all listed files search the second patterns

Hello Linux Masters, I am not a linux expert therefore i need help from linux gurus. Well i have a requirement where i need to search all files based on first patterns and after seraching all files then serach second pattern in all files which i have extracted based on first pattern.... (1 Reply)
Discussion started by: Black-Linux
1 Replies

7. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

8. Shell Programming and Scripting

How to fix line breaks format text for huge files?

Hi, I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly. Except the header and trailer, each record starts with 'D'. Requirement:Scan the whole file except the header and trailer records and see if any of the records start with... (19 Replies)
Discussion started by: kikionline
19 Replies

9. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

10. Shell Programming and Scripting

Search and replace ---A huge number of files

Hello Friends, I have the below scenario in my current project. Suggest me which tool ( perl,python etc) is best to this scenario. Or should I go for Programming language ( C/Java ).. (1) I will be having a very big file ( information about 200million subscribers will be stored in it ). This... (5 Replies)
Discussion started by: panyam
5 Replies
bup-margin(1)						      General Commands Manual						     bup-margin(1)

NAME
bup-margin - figure out your deduplication safety margin SYNOPSIS
bup margin [options...] DESCRIPTION
bup margin iterates through all objects in your bup repository, calculating the largest number of prefix bits shared between any two entries. This number, n, identifies the longest subset of SHA-1 you could use and still encounter a collision between your object ids. For example, one system that was tested had a collection of 11 million objects (70 GB), and bup margin returned 45. That means a 46-bit hash would be sufficient to avoid all collisions among that set of objects; each object in that repository could be uniquely identified by its first 46 bits. The number of bits needed seems to increase by about 1 or 2 for every doubling of the number of objects. Since SHA-1 hashes have 160 bits, that leaves 115 bits of margin. Of course, because SHA-1 hashes are essentially random, it's theoretically possible to use many more bits with far fewer objects. If you're paranoid about the possibility of SHA-1 collisions, you can monitor your repository by running bup margin occasionally to see if you're getting dangerously close to 160 bits. OPTIONS
--predict Guess the offset into each index file where a particular object will appear, and report the maximum deviation of the correct answer from the guess. This is potentially useful for tuning an interpolation search algorithm. --ignore-midx don't use .midx files, use only .idx files. This is only really useful when used with --predict. EXAMPLE
$ bup margin Reading indexes: 100.00% (1612581/1612581), done. 40 40 matching prefix bits 1.94 bits per doubling 120 bits (61.86 doublings) remaining 4.19338e+18 times larger is possible Everyone on earth could have 625878182 data sets like yours, all in one repository, and we would expect 1 object collision. $ bup margin --predict PackIdxList: using 1 index. Reading indexes: 100.00% (1612581/1612581), done. 915 of 1612581 (0.057%) SEE ALSO
bup-midx(1), bup-save(1) BUP
Part of the bup(1) suite. AUTHORS
Avery Pennarun <apenwarr@gmail.com>. Bup unknown- bup-margin(1)
All times are GMT -4. The time now is 01:50 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy