Using perl to grep a list of patterns from an input file
I have been struggling to grep a file of NGrams (basically clusters of consonants or Consonant and Vowel) acting as a pattern file from an Input file which contains a long list of words, one word per line. The script would do two things:
Firstly read a text pattern from a large file of such patterns: they are all consonant clusters and grep them from the input file which will have one word per line. It would be great if the script could also identify the clusters whether they occur in the beginning, middle or end of the file. But that would be the icing on the cake.
Secondly the output should be sorted on the clusters found. In case a given cluster is not found, it whould be marked as such.
An example would help:
The pattern file is
The input file would be
The desired out put would look like:
Is it possible to write a Perl code to do something of the sort?
I have used grep and egrep with the tag to grep from a pattern file, but the data is so huge that the utilities do not give satisfactory results.
Many thanks in advance
The pattern file will have around 300 patterns. It is the Input file i.e. the file which has to be grepped which is the worrisome part. It will have around 300,000 unique records. The grep from pattern gave really goofed up results and this is why I posted the request.
Many thanks for your interest
---------- Post updated at 11:22 PM ---------- Previous update was at 08:29 PM ----------
I tested the data on the large file and it worked perfectly, all the more so since the data is in UTF-8 format.
How can I modify the code if I want it to give me only two to three samples from the input file.
I am new to AWK and I tried to change the
operator but could not make it work for a numeric value.
Many thanks again for your kind help
300,000 records is not that large unless they are really, really huge records. And 'really goofed up results' is not the same as 'the data is so huge that the utilities do not give satisfactory results' -- if the data were really too large, it would have said 'out of memory' or some such.
T[N] is how many times its been printed. I've been using it to tell whether I should put NONE in a file. Now I also use it to check if a pattern has been printed enough times to stop printing it any more.
Many thanks for the explanation.
The size of the pattern file has increased but awk seems to respond pretty well. To speed up the process, I converted all Unicode to 8 bit and got a fair amount of speed in computation.
Thanks a lot.
---------- Post updated at 11:45 PM ---------- Previous update was at 11:13 AM ----------
Sorry to hassle again. I got the logic, but I have two queries:
A simple one:suppose I wanted to print all the data found on one line instead of on separate lines, what would I have to tweak ? I tried tweaking the print command but could not get the script to print all words on one line.
More complex. My pattern file has gone huge. I am testing 4 letter combos which means around 64000 strings to match. When I run the file against around a similar size of inputs, the script executes but does not give the expected result.
Any solutions. I am quite desperate to solve this issue. Many thanks in advance but with my limited knowledge of AWK and the reading I am doing from OReilley's book on sed and awk, things do not seem to work out.
I am able to grep multiple patterns which stored in a files. However, how could we replace the whole line with either the pattern or new string?
For example:
pattern_file: *Info in the () is not part of the pattern file. They are the intended name to replace the whole line after the pattern... (5 Replies)
When I use the following grep command with options -F and -f, its just displaying the text related to only the last pattern.
Command: $ grep -f pattern_file.txt input_file.txt
Output: doc-C2-16354
Even the following command yields the same output:
Command: $ grep -Ff pattern_file.txt... (6 Replies)
How do I use grep to select words that start with I or O, end in box, and contain at least one letter in between them?
the text file mailinfo.txt contains
Inbox
the Inbox
Is a match box
Doesn't match
INBOX
Outbox
Outbox1
InbOX
Ibox
I box
If the command works correctly it... (4 Replies)
Hi, I want to list all file that match user input ( specified shell wildcard) but when I compile it dont list me
#!/usr/bin/perl -w
print "Enter Advance Search Function: ";
chomp ($func = <STDIN>);
my @files = glob("$func");
foreach my $file (@files)
{
print "$file\n";... (1 Reply)
Hi All, I need to grep few files which has words like the below in the file name , which i want to put it in a file and and grep for the files which contain these names and move it to a new directory ,
full file name -C20091210.1000-20091210.1100_SMGBSC3:1000... (2 Replies)
Hi Gurus,
I have a file say for ex. file1 which has 3500 lines in it which are different account numbers and another file (file2) which has 230000 lines in it. I want to read all the lines in file1 and delete all those lines from file2 which has that same pattern as in file1. I am not quite... (4 Replies)
Good day, great gurus,
I'm new to Perl, and programming in general. I'm trying to retrieve a column of data from my text file which spans a non-specific number of lines. So I did a regexp that will pick out the columns. However,my pattern would vary. I tried using a foreach loop unsuccessfully.... (2 Replies)
Hi,
From the pattern mentioned below remove lines based on pattern range.
Conditions
1 Look For all lines starting with ALTER TABLE and Ending with ; and contains the word MOVE.I wanto to remove these lines from the file sample below.
Note : The above pattern list could be found in... (1 Reply)
Hi
I have 3 patterns for example to be searched.
These three patterns are available in file1.
The patterns to be searched are in file2.
I want to search the pattern of file1 to file2.
Can any one help with example?
Regards
Dhana (1 Reply)