Using awk to find sentences.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using awk to find sentences.
# 8  
Old 11-19-2012
Quote:
Originally Posted by jim mcnamara
Please post relevant text on this site, not external sites. The external site will age out the text post, then some future searcher will not have a clue as what this thread is really about.

In fact it went to lala land (404) just now..... Nobody can effectively help you now.

Thank you.
I'm going to start a new thread with sample text and simpler request. Thank you for the guidance.
# 9  
Old 11-20-2012
A perl approach

Hi.

This is a perl approach to this problem. One of the modules at CPAN is Sentence. I won't post the less-than-40-line perl code, p1, unless necessary. Here is a sample use on a small data file:
HTML Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate identifying English sentences, perl modules.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl divepm
pl " perl modules:"
divepm Sentence Slurp

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
./p1 $FILE

exit 0
producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
perl 5.10.0
divepm (local) 1.2

-----
 perl modules:
 Note - /usr/lib/perl/5.10 points to 5.10.0
 Note - /usr/share/perl/5.10 points to 5.10.0
 0.25	Lingua::EN::Sentence
 0.03	Perl6::Slurp

-----
 Input data file data1:
Now is the time
for all good men
to come to the aid
of their country.
Gobble, gobble.
Mr. Erickson said to Dr.
Olson, "Three, e.g.".
The AAA came out to change my tire!  Isn't that great?

-----
 Results:
1) Now is the time
for all good men
to come to the aid
of their country.
1 [ \n to space ]) Now is the time for all good men to come to the aid of their country.
2) Gobble, gobble.
3) Mr. Erickson said to Dr.
Olson, "Three, e.g.".
3 [ \n to space ]) Mr. Erickson said to Dr. Olson, "Three, e.g.".
4) The AAA came out to change my tire!
5) Isn't that great?
 Found 5 sentences in data1

For the 60957 lines in the posted link, it found 31017 sentences in 260 seconds, so it's not the fastest code, but it seems to get the job done.

Obviously this of little value if the OP desires awk, although the regular expression might be able to be used, along with the algorithm of the perl module of marking the possible sentences, and then checking for exceptions like the list of known abbreviations.

Best wishes ... cheers, drl
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to use $variable in conditional sentences?

Hello all I am doing a Makefile but I can't return the value of $var to use it in conditional sentences: #!/bin/sh GO=$(shell) go GOPATH=$(GO) env GOPATH make: @$(GOPATH) @if ; then mkdir -p "$(GOPATH)/bin" ; fi When I type "make", @$GOPATH returns /home/icvallejo/go... (5 Replies)
Discussion started by: icvallejo
5 Replies

2. Shell Programming and Scripting

Adding tags in between sentences with awk

Hi, I need an awk to modify the following file. It is 2-column tab-separated. Hi PP my VBD name DT is NN . SENT Her PP name VBD is DT the NN same WRT . SENT <s> Hi PP - (6 Replies)
Discussion started by: owwow14
6 Replies

3. Shell Programming and Scripting

extracting sentences that only contain a word

Hi guys Need your help how do I extract sentences with only a word i.e. today is hot hot very humid humid2 Sample output hot (6 Replies)
Discussion started by: jamestan
6 Replies

4. UNIX for Dummies Questions & Answers

extracting sentences that only contain a word

Hi guys Need your help how do I extract sentences with only a word i.e. today is hot hot very humid humid2 Sample output hot very (0 Replies)
Discussion started by: jamestan
0 Replies

5. UNIX for Dummies Questions & Answers

How to filter sentences??

Hi, I have few sentences here. $a1="Division of Hematology-Oncology, and Stem cell transplantation, Schneider Childrens Hospital, Albert Einstein College of Medicine, New Hyde Park, New York. "; $a2="Department of Cell Biology and Anatomy, College of Medicine, National Cheng Kung... (3 Replies)
Discussion started by: vanitham
3 Replies

6. Shell Programming and Scripting

comparing sentences

Hi, I have a file and that file contains the following sentences. Here we show that a virus-encoded transcription factor, viral mRNA, cellular RNA-binding protein heterodimer G3BP/Caprin-1 (p137), translation initiation factors eIF4E and eIF4G, and ribosomal proteins are concentrated in the... (4 Replies)
Discussion started by: vanitham
4 Replies

7. Shell Programming and Scripting

How to get exact match sentences?

Hi, I have sentences like this: $sent= Protein modeling studies reveal that the RG-rich region is part of a three to four strand antiparallel beta-sheet, which in other RNA binding protein functions as a platform for nucleic acid interactions. Heterogeneous nuclear ribonucleoparticle... (19 Replies)
Discussion started by: vanitham
19 Replies

8. Shell Programming and Scripting

How to identify sentences from a text?

Hi, I have to identify sentences from this text. If i split these statements by this way: @sentence= split(/\.\W*/,$text); I will get these following things also in the output along with proper sentences. Biol Reprod. 2002 Mar;66(3):785-95. Egydio de Carvalho C, Tanaka H,... (2 Replies)
Discussion started by: vanitham
2 Replies

9. Shell Programming and Scripting

Anyways to find sentences with data format and extract it???

Hi guys,i got this problem which is..i need to find those sentences with date inside and extract them out,the input is somehow like this eg: $DATA42.GANTRY2.GA161147 DISKFILE 2007-10-16 11:56:45 SUPER.OPR \NETS.$Y4CB.#IN ... (4 Replies)
Discussion started by: cyberray
4 Replies

10. UNIX for Dummies Questions & Answers

spliting up sentences

hello, i'm looking to split up text into a list of words but can't figure it out, any help would be great. thanks steven (2 Replies)
Discussion started by: stevox
2 Replies
Login or Register to Ask a Question