Using perl or awk to create ngrams


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using perl or awk to create ngrams
# 1  
Old 10-13-2013
Using perl or awk to create ngrams

Hello,
I am interested in writing a context driven NGram analysis i.e. detecting the frequency of utterance of a given character based on its immediate context i.e. the character which can preced and follow the given entity. In the case of Intial and Final the context would be immediate character following or preceding the entity respectively.
An example would illustrate what is meant. Given the following words as Input
Code:
input
inside

The NGram output with frequency would be as under:
Code:
in	2
inp	1
npu	1
ut	1
ins	1
nsi	1
sid	1
de	1

Although this is feasible in a program in C or Java, I wonder if a Perl or AWK script would do the job.
I am sure this tool will help quite a few people working in Natural language processing.

2 remarks, please:
My working environment is Windows hence Piping is impossible
The data on which the script would be used for training would be very large.

Many thanks.
# 2  
Old 10-13-2013
Quote:
Originally Posted by gimley
Hello,
I am interested in writing a context driven NGram analysis i.e. detecting the frequency of utterance of a given character based on its immediate context i.e. the character which can preced and follow the given entity. In the case of Intial and Final the context would be immediate character following or preceding the entity respectively.
An example would illustrate what is meant. Given the following words as Input
Code:
input
inside

The NGram output with frequency would be as under:
Code:
in	2
inp	1
npu	1
ut	1
ins	1
nsi	1
sid	1
de	1

Although this is feasible in a program in C or Java, I wonder if a Perl or AWK script would do the job.
I am sure this tool will help quite a few people working in Natural language processing.

2 remarks, please:
My working environment is Windows hence Piping is impossible
The data on which the script would be used for training would be very large.

Many thanks.
I don't understand what you're trying to do.

Why isn't the following your NGram output list:
Code:
in      2
np      1
pu      1
ut      1
inp     1
npu     1
put     1
inpu    1
nput    1
input   1
ns      1
si      1
id      1
de      1
ins     1
nsi     1
sid     1
ide     1
insi    1
nsid    1
side    1
insid   1
nside   1
inside  1

Even if you define "entity" to be a single character, you still seem to be missing:
Code:
put     1
   and
ide     1

from your output list.
# 3  
Old 10-13-2013
I guess I did the analysis manually and hence slipped up.
I agree traditional ngrams work the way you have defined, but I am interested in contextual ngrams in which the frequency of occurrence of a given string is determined by its immediate context.
Since the analysis is at a micro-level and not a macrol-evel, such NGrams can be used for predicting whether a given string complies with the training data and witha few additional tweaks even suggest a valid structure.
I hope I have made the idea clear and why the analysis in terms of context driven Ngrams is slightly different.
Many thanks for your response
# 4  
Old 10-13-2013
You could try something like:
Code:
awk '
{       for(i = 1; i <= NF; i++) {
                l = length($i)
                if(l < 2) next
                if(l == 2) {
                        c[$i]++
                        next
                }
                c[substr($i, 1, 2)]++
                c[substr($i, l - 1)]++
                for(j = 1; j <= l - 2; j++)
                        c[substr($i, j, 3)]++
        }
}       
END {   for(i in c) printf("%s\t%d\n", i, c[i])
}' Input

If your input is always one word per line, you can make this run a little bit faster by removing the code in red above and changing every occurrence of $i to $1.

If you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of just awk.

Last edited by Don Cragun; 10-13-2013 at 06:50 AM.. Reason: Fix typo (code; not lines).
# 5  
Old 10-13-2013
Many thanks. I made the change you suggested and it worked very fast.
All data is tokenised and is one word per line. Hence it speeded up the process.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to create variables to pass into a bash loop to create a download link

I have created one file that contains all the necessary info in it to create a download link. In each of the lines /results/analysis/output/Home/Auto_user_S5-00580-6-Medexome_67_032/plugin_out/FileExporter_out.67... (8 Replies)
Discussion started by: cmccabe
8 Replies

2. Shell Programming and Scripting

Eliminating words from a file through ngrams stored in another file

Hello, I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear: kpaware nlupset rrrbring In other words these words are invalid in English and constitute garbage in the data. I have identified such... (2 Replies)
Discussion started by: gimley
2 Replies

3. Programming

perl script to create hash.

Hi, I have the xml file file this, perl script to create hash<p> <university> <name>svu</name> <location>ravru</location> <branch> <electronics> <student name="xxx" number="12"> <semester number="1"subjects="7" rank="2"/> </student> <student name="xxx"... (1 Reply)
Discussion started by: veerubiji
1 Replies

4. Shell Programming and Scripting

Create an XML tree using perl

Hi, I am having an xml file which looks like this: <Nodes> <Node> <Nodename>Student</Nodename> <Filename>1.txt</filename> <Node> <Nodename>Dummy</Nodename> <Filename>22.txt</filename> </Node> </Node> </Nodes> The text files will have data like this: #1.txt... (8 Replies)
Discussion started by: vanitham
8 Replies

5. Shell Programming and Scripting

How to create hash dynamically in perl?

Hi, I have one file name file.txt It has the following contents: #File Contents StartTime,EndTime,COUNTER1,COUNTER2,COUNTER3 12:13,12:14,0,1,0 The output should be like this: StartTime: 12:13 ENDTIME: 12:14 (2 Replies)
Discussion started by: vanitham
2 Replies

6. Shell Programming and Scripting

how to create a file in perl

hey gurus! i m a perl newbie!! i want to create an empty file and also directory in perl... how to print a msg if the present working directory has ".db" extension. like in shell if ] ; then echo "hello " i want to do this in perl!! please help.. (4 Replies)
Discussion started by: tprayush
4 Replies

7. Shell Programming and Scripting

I need help to create a file using Perl

Hi, i have some files in text format and i want to create a file with all the information in the others files, but i don't want copy all the information exactly i just need the information from the fourth line to the end of file I will try to explain with an example: file1.txt abc abc... (1 Reply)
Discussion started by: romanhr
1 Replies

8. AIX

create widgets using perl pk module

hi I am posting this for my friend... is it possible to create widgets using perl pk module in IBM AIX 5.3? They dont have a GUI so is it possible to create the above mentioned thing in a CUI? thanks! Sathish (1 Reply)
Discussion started by: sathumenon
1 Replies

9. Shell Programming and Scripting

create an user in perl

hi friends, i want to create an new user in my home directory , only just for checking. if it is possible, please help me. thanks (1 Reply)
Discussion started by: praneshbmishra
1 Replies

10. Shell Programming and Scripting

create a directory in perl

Hi Guys!!!!!!!!!!!!!!!!!!!!! can we create or copy directories in perl without using system commands like "mkdir" and "cp" script needed urgent !!!!!!!!!!!!!!!!!!!!!!!!!!! cheers, aajan (7 Replies)
Discussion started by: aajan
7 Replies
Login or Register to Ask a Question