Using perl to grep a list of patterns from an input file | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Using perl to grep a list of patterns from an input file

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 01-25-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 10 August 2014, 9:00 AM EDT
Posts: 186
Thanks: 80
Thanked 1 Time in 1 Post
Using perl to grep a list of patterns from an input file

I have been struggling to grep a file of NGrams (basically clusters of consonants or Consonant and Vowel) acting as a pattern file from an Input file which contains a long list of words, one word per line. The script would do two things:
Firstly read a text pattern from a large file of such patterns: they are all consonant clusters and grep them from the input file which will have one word per line. It would be great if the script could also identify the clusters whether they occur in the beginning, middle or end of the file. But that would be the icing on the cake.
Secondly the output should be sorted on the clusters found. In case a given cluster is not found, it whould be marked as such.
An example would help:
The pattern file is

Code:
cr
pl
sl
st
pn

The input file would be

Code:
please
crawl
creep
slip
slide
apnea
pneumatic

The desired out put would look like:

Code:
#cr
crawl
creep
#pl
please
#sl
slip
slide
#st NONE
#pn
apnea
pneumatic

Is it possible to write a Perl code to do something of the sort?
I have used grep and egrep with the tag to grep from a pattern file, but the data is so huge that the utilities do not give satisfactory results.
Many thanks in advance
Sponsored Links
    #2  
Old 01-25-2013
elixir_sinari's Avatar
elixir_sinari elixir_sinari is offline Forum Advisor  
Gotham Knight
 
Join Date: Mar 2012
Last Activity: 25 August 2014, 2:00 AM EDT
Location: India
Posts: 1,412
Thanks: 101
Thanked 495 Times in 472 Posts
How big are the files?
Sponsored Links
    #3  
Old 01-25-2013
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 28 August 2014, 4:23 PM EDT
Location: Saskatchewan
Posts: 19,269
Thanks: 774
Thanked 3,236 Times in 3,034 Posts
If grep is running out of memory, you may need to rethink your strategy. Just how large is the pattern file?

Are duplicate results allowed? Are things allowed to appear in more than one list?

If the pattern file's gigantic, you might be in trouble. Otherwise I might try something like this:


Code:
awk 'NR==FNR { A[++L]=$1; print "#" $1 > $1; next }
        { for(N=1; N<=L; N++) if(index($0, A[N])) { print > A[N] ; T[N]++} }
        END {
                for(N=1; N<=L; N++)
                if(!T[N]) { close(A[N]); print "#"A[N]" NONE" >A[N] }
        }' listfile inputdata

This will create output files named cr, pl, etc which you can cat together later if you need to.
The Following User Says Thank You to Corona688 For This Useful Post:
gimley (01-25-2013)
    #4  
Old 01-25-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 10 August 2014, 9:00 AM EDT
Posts: 186
Thanks: 80
Thanked 1 Time in 1 Post
The pattern file will have around 300 patterns. It is the Input file i.e. the file which has to be grepped which is the worrisome part. It will have around 300,000 unique records. The grep from pattern gave really goofed up results and this is why I posted the request.
Many thanks for your interest

---------- Post updated at 11:22 PM ---------- Previous update was at 08:29 PM ----------

I tested the data on the large file and it worked perfectly, all the more so since the data is in UTF-8 format.
How can I modify the code if I want it to give me only two to three samples from the input file.
I am new to AWK and I tried to change the

Code:
N++

operator but could not make it work for a numeric value.
Many thanks again for your kind help
Sponsored Links
    #5  
Old 01-26-2013
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 28 August 2014, 4:23 PM EDT
Location: Saskatchewan
Posts: 19,269
Thanks: 774
Thanked 3,236 Times in 3,034 Posts
300,000 records is not that large unless they are really, really huge records. And 'really goofed up results' is not the same as 'the data is so huge that the utilities do not give satisfactory results' -- if the data were really too large, it would have said 'out of memory' or some such.

T[N] is how many times its been printed. I've been using it to tell whether I should put NONE in a file. Now I also use it to check if a pattern has been printed enough times to stop printing it any more.


Code:
awk 'NR==FNR { A[++L]=$1; print "#" $1 > $1; next }
        { for(N=1; N<=L; N++) if(index($0, A[N])) { T[N]++ ; if(T[N] <= 3) print > A[N] ; } }
        END {
                for(N=1; N<=L; N++)
                if(!T[N]) { close(A[N]); print "#"A[N]" NONE" >A[N] }
        }' listfile inputdata

Sponsored Links
    #6  
Old 01-26-2013
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 10 August 2014, 9:00 AM EDT
Posts: 186
Thanks: 80
Thanked 1 Time in 1 Post
Many thanks for the explanation.
The size of the pattern file has increased but awk seems to respond pretty well. To speed up the process, I converted all Unicode to 8 bit and got a fair amount of speed in computation.
Thanks a lot.

---------- Post updated at 11:45 PM ---------- Previous update was at 11:13 AM ----------

Sorry to hassle again. I got the logic, but I have two queries:
A simple one:suppose I wanted to print all the data found on one line instead of on separate lines, what would I have to tweak ? I tried tweaking the print command but could not get the script to print all words on one line.
More complex. My pattern file has gone huge. I am testing 4 letter combos which means around 64000 strings to match. When I run the file against around a similar size of inputs, the script executes but does not give the expected result.
Any solutions. I am quite desperate to solve this issue. Many thanks in advance but with my limited knowledge of AWK and the reading I am doing from OReilley's book on sed and awk, things do not seem to work out.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
List all file that match user input perl guidely Shell Programming and Scripting 1 09-20-2011 04:34 AM
grep for certain files using a file as input to grep and then move anita07 Shell Programming and Scripting 2 12-10-2009 03:59 AM
Perl - How to search a text file with multiple patterns? Sp3ck Shell Programming and Scripting 2 03-25-2009 11:30 PM
sed/awk help to match list of patterns and remove from org file rajan_san Shell Programming and Scripting 1 12-07-2008 01:56 AM
grep patterns - File dhanamurthy Shell Programming and Scripting 1 05-08-2008 10:24 PM



All times are GMT -4. The time now is 01:16 AM.