Normalizing files for sentence count


 
Thread Tools Search this Thread
Special Forums UNIX Desktop Questions & Answers Normalizing files for sentence count
# 1  
Old 11-23-2012
Normalizing files for sentence count

I have files with many different formats and breaks in odd places. now I want to normalize them to be able to count the sentence in each file

1: I want to count the sentences is they finish with ! . ?
2: but I don't want it to count if there is no space after the Full stop. e.g. S.O.L

I have the following line but don't know how to make it work with second condition

Code:
FILES="basic/*"
for X in $FILES
do
	name=$(basename $X) 
	sed -n -e ":a" -e "$ s/\n/ /gp;N;b a" $X| tr '\. ' '\n '| tr '\? ' '\n '|tr '\! ' '\n '| grep -v "^[[:blank:]]*$" | wc -l > count/${name}
done

can someone please help me in this regards?:Smilie
# 2  
Old 11-23-2012
I am translating your requirement to mean count all of the . ! and ? characters in a file.
This is part of what it means to find sentences. It will have problems, ex.: in text with numbers that have decimals in them. And sentences that end in an ellipsis.... < that is one! Neat. I made a self-referential sentence.

Code:
awk '{ total+=gsub(/[\.\?\!]/,"", $0); next}
END{print "total sentences=",total} ' somefile.txt

You have to decide on the correctness of your approach, based on your data.
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 11-23-2012
Thank you very much for the code
I have to break the files into sentence per line as well and dont want it to divide the lines if there is a word or number of the "." so i have to know how to identify it.
can you explain this bit please?
Quote:
,"", $0)
# 4  
Old 11-23-2012
$0 represents the whole record. Below is the syntax of gsub function:-
Code:
gsub(regexp, replacement, target)

The gsub function returns the number of substitutions made.
This User Gave Thanks to Yoda For This Post:
# 5  
Old 11-23-2012
The normal way of doing this is to change spaces and tabs to newlines and then count the number of lines that end in ., !, and ?.
Code:
tr '[ \t]' '\n' file|grep -c '[.!?]$'

This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 11-24-2012
Hi.
Quote:
Originally Posted by A-V
... I have to break the files into sentence per line as well and dont want it to divide the lines if there is a word or number of the "." ...
I have been looking at the topic of processing English sentence lately. Here is a demonstration of a perl script to place sentences on separate lines (minimal version):
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate minimal English sentence separation.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl divepm 

pl " Perl modules:"
divepm -q --input=minimal-sese

FILE=${1-data5}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
./minimal-sese -d $FILE

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
perl 5.10.0
divepm (local) 1.2

-----
 Perl modules:
 1.04	strict
 1.06	warnings
 0.03	Perl6::Slurp
 0.25	Lingua::EN::Sentence

-----
 Input data file data5:
Now is the time
for all good men
to come to the aid
of their country.
Gobble, gobble.
Mr. Erickson said to Dr.
Olson, "Pi is approximated by 3.1415, that's S.O.P.". The AAA
came out to change my tire!  Isn't that great?

-----
 Results:
1) Now is the time
for all good men
to come to the aid
of their country.
Now is the time for all good men to come to the aid of their country.
2) Gobble, gobble.
Gobble, gobble.
3) Mr. Erickson said to Dr.
Olson, "Pi is approximated by 3.1415, that's S.O.P.".
Mr. Erickson said to Dr. Olson, "Pi is approximated by 3.1415, that's S.O.P.".
4) The AAA
came out to change my tire!
The AAA came out to change my tire!
5) Isn't that great?
Isn't that great?

The file uploaded needs to be copied to file minimal-sese and then made executable. The perl module Lingua/EN/Sentence.pm may be available in your repository. Otherwise it needs to be copied from the URL noted in the script comments.

Posting samples of your input and desired output will help invite on-point solutions.

Best wishes ... cheers, drl

Last edited by drl; 11-24-2012 at 08:32 AM..
These 2 Users Gave Thanks to drl For This Post:
# 7  
Old 11-26-2012
Thank you very much... I will give this a go and ask if I have any question Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script for field wise record count for different Files .csv files

Hi, Very good wishes to all! Please help to provide the shell script for generating the record counts in filed wise from the .csv file My question: Source file: Field1 Field2 Field3 abc 12f sLm 1234 hjd 12d Hyd 34 Chn My target file should generate the .csv file with the... (14 Replies)
Discussion started by: Kirands
14 Replies

2. Shell Programming and Scripting

Error files count while coping files from source to destination locaton as well count success full

hi All, Any one answer my requirement. I have source location src_dir="/home/oracle/arun/IRMS-CM" My Target location dest_dir="/home/oracle/arun/LiveLink/IRMS-CM/$dc/$pc/$ct" my source text files check with below example.text file content $fn "\t" $dc "\t" $pc "\t" ... (3 Replies)
Discussion started by: sravanreddy
3 Replies

3. UNIX for Dummies Questions & Answers

How to count different id from a files?

Hi Guys, Please help for counting different task_id:- file name is: sms_push_123.ac:011:045 file records: Now we need to output like: (1 Reply)
Discussion started by: aaditya321
1 Replies

4. Programming

Normalizing date value to a single timezone

Hi, Am trying to get a normalized date value irrespective of the time zone of the machine in which following code is run. When the following code is run in 2 different machines with TZ=UTC and TZ=PDT, I get 2 different values. I simply want to normalize the output that is specific to a... (3 Replies)
Discussion started by: matrixmadhan
3 Replies

5. Shell Programming and Scripting

[grep] how to grep a sentence which has quotation marks "sentence"

I would like to check with grep in this configuration file: { "alt-speed-down": 200, "alt-speed-enabled": true, "alt-speed-time-begin": 1140, "alt-speed-time-day": 127, "...something..." : true, ... } "alt-speed-enabled" (the third line of the file) is setted to... (2 Replies)
Discussion started by: ciro314
2 Replies

6. Shell Programming and Scripting

Count Files

I was wondering if anyone could help me with this problem: Write a script called countFiles that takes two arguments, the initial directory and the number of levels and returns the count of all files (including directories) in the directories and subdirectories up to the number of levels. ... (4 Replies)
Discussion started by: clammy
4 Replies

7. UNIX for Dummies Questions & Answers

Count number of files in directory excluding existing files

Hi, Please let me know how to find out number of files in a directory excluding existing files..The existing file format will be unknown..each time.. Thanks (3 Replies)
Discussion started by: ammu
3 Replies

8. Shell Programming and Scripting

Count todays created files and old files

Hello experts, I used following approach to get listing of all files of remote server. Now I have remote server file information on same server. I am getting listing in the output.txt I want to count today's created files and old files. I want to compare the numbers... (11 Replies)
Discussion started by: dipeshvshah
11 Replies

9. UNIX for Dummies Questions & Answers

Script to ask for a sentence and then count number of spaces in the sentence

Hi People, I need some Help to write a unix script that asks for a sentence to be typed out then with the sentence. Counts the number of spaces within the sentence and then echo's out "The Number Of Spaces In The Sentence is 4" as a example Thanks Danielle (12 Replies)
Discussion started by: charlie101208
12 Replies
Login or Register to Ask a Question