Using perl or awk to create ngrams

10-13-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Using perl or awk to create ngrams

Hello,
I am interested in writing a context driven NGram analysis i.e. detecting the frequency of utterance of a given character based on its immediate context i.e. the character which can preced and follow the given entity. In the case of Intial and Final the context would be immediate character following or preceding the entity respectively.
An example would illustrate what is meant. Given the following words as Input

Code:

input
inside

The NGram output with frequency would be as under:

Code:

in	2
inp	1
npu	1
ut	1
ins	1
nsi	1
sid	1
de	1

Although this is feasible in a program in C or Java, I wonder if a Perl or AWK script would do the job.
I am sure this tool will help quite a few people working in Natural language processing.

2 remarks, please:
My working environment is Windows hence Piping is impossible
The data on which the script would be used for training would be very large.

Many thanks.

gimley

View Public Profile for gimley

Find all posts by gimley

10-13-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by gimley

Code:

input
inside

The NGram output with frequency would be as under:

Code:

in	2
inp	1
npu	1
ut	1
ins	1
nsi	1
sid	1
de	1

I don't understand what you're trying to do.

Why isn't the following your NGram output list:

Code:

in      2
np      1
pu      1
ut      1
inp     1
npu     1
put     1
inpu    1
nput    1
input   1
ns      1
si      1
id      1
de      1
ins     1
nsi     1
sid     1
ide     1
insi    1
nsid    1
side    1
insid   1
nside   1
inside  1

Even if you define "entity" to be a single character, you still seem to be missing:

Code:

put     1
   and
ide     1

from your output list.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-13-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

I guess I did the analysis manually and hence slipped up.
I agree traditional ngrams work the way you have defined, but I am interested in contextual ngrams in which the frequency of occurrence of a given string is determined by its immediate context.
Since the analysis is at a micro-level and not a macrol-evel, such NGrams can be used for predicting whether a given string complies with the training data and witha few additional tweaks even suggest a valid structure.
I hope I have made the idea clear and why the analysis in terms of context driven Ngrams is slightly different.
Many thanks for your response

gimley

View Public Profile for gimley

Find all posts by gimley

10-13-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could try something like:

Code:

awk '
{       for(i = 1; i <= NF; i++) {
                l = length($i)
                if(l < 2) next
                if(l == 2) {
                        c[$i]++
                        next
                }
                c[substr($i, 1, 2)]++
                c[substr($i, l - 1)]++
                for(j = 1; j <= l - 2; j++)
                        c[substr($i, j, 3)]++
        }
}       
END {   for(i in c) printf("%s\t%d\n", i, c[i])
}' Input

If your input is always one word per line, you can make this run a little bit faster by removing the code in red above and changing every occurrence of $i to $1.

If you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of just awk.

Last edited by Don Cragun; 10-13-2013 at 06:50 AM.. Reason: Fix typo (code; not lines).

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-13-2013

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. I made the change you suggested and it worked very fast.
All data is tokenised and is one word per line. Hence it speeded up the process.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Using perl or awk to create ngrams

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to create variables to pass into a bash loop to create a download link

Discussion started by: cmccabe

2. Shell Programming and Scripting

Eliminating words from a file through ngrams stored in another file

Discussion started by: gimley

3. Programming

perl script to create hash.

Discussion started by: veerubiji

4. Shell Programming and Scripting

Create an XML tree using perl

Discussion started by: vanitham

5. Shell Programming and Scripting

How to create hash dynamically in perl?

Discussion started by: vanitham

6. Shell Programming and Scripting

how to create a file in perl

Discussion started by: tprayush

7. Shell Programming and Scripting

I need help to create a file using Perl

Discussion started by: romanhr

8. AIX

create widgets using perl pk module

Discussion started by: sathumenon

9. Shell Programming and Scripting

create an user in perl

Discussion started by: praneshbmishra

10. Shell Programming and Scripting

create a directory in perl

Discussion started by: aajan