Problems with deleting punctuation and apostrophes

06-15-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

Problems with deleting punctuation and apostrophes

Dear All,

I have a file which I want to get the list of frequency of each word, ignoring list of stop words and now I have problems which punctuations and " 's ".

what I am doing is:

Code:

sed 's/[^a-zA-Z ]//g' file01.txt > file01-clear.txt
cat file01-clear.txt | tr "[:upper:]" "[:lower:]"| tr ' ' '\012' |sort |uniq -c |sort -n -r -k 1 > file01-FQ.txt
grep -v -F -f rejectfile.txt file01-FQ.txt > file01-results.txt

I have realized the sed comment is deleting some of my words and I dont know y.

like in my file I have 26 word general but in file01-clear I get only one.
because my file01-clear is wrong, I cant see whether the final cammand to delete stop words is right or wrong either

moreover, no matter what I did the file didnt delete the 's so I have to do it manually.

I dont really know what I am doing wrong

can you pleaseeeeeeeeeeee help me

A-V

A-V

View Public Profile for A-V

Find all posts by A-V

06-15-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

useless use of cat

Code:

awk 'NR==FNR { STOP[$1]++; next }

{
        gsub(/[^a-zA-Z \t]/, ""); # Replace non-whitespace with nothing.
        $0=tolower($0); # Lowercase and separate the string
        for(N=1; N<=NF; N++) if(!($N in STOP)) W[$1]++ # Count all non-stopwords
}

END { for(X in W) print W[X], X; } # Print count of words' stopwords.txt words.txt

Use nawk on solaris.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-15-2012

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

First off, it might help to post a sample of your text (especially the one file with 26 words which is boiled down to one) because for most of us it is easier to understand what the problem is when we see it.

Second: you have hit the first real problem of all regexp constructors, which is: "what is a word"? Lets start.

The first attempt is what i found as a definition in IBM's "Handbook of Data Processing" some 30 years ago, which said:

A word is a sequence of nonblank characters separated by blanks.

To put it this in regexp (i write "<spc>" and "<tab> to denote space and tab characters, whitespace is hard to read in postings):

Code:

/[<spc><tab>]\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]/

This is fine, but it does not take into account line starts and line endings, where the preceeding (trailing) <spc> would be missing. So we have to refine this:

Code:

/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]*/

Now this works better, but we have to take into account punctuation, which might come after a word. So we allow ",;.:!" after the characters instead of whitespace too:

Code:

/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!]*/

Finally there might be some quotation we have to take into account. Consider: "this" should be recognized as word, yes. So we have to allow for single and double quotes before and after our characters:

Code:

/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

At last we have two decisions to make, at least if we parse English text:

The first is, how we should treat hyphenated words: should "this-that" be treated as a single word or as two words "this" and "that"? The first possibility would already be treated correctly by now, but if we decide to have the second we simply could replace the hyphens with white space before:

Code:

s/-/<spc>/g
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

The second decision is tougher, though: what should we do with abbreviations like "i'll" or "doesn't"? Should they be counted as words in their own right? Should they be decomposed? ("i'll" -> "i"+"will", "doesn't" -> "does"+"not") Should only the first word in such a concatenation count? ("i'll" -> "i", "doesn't" -> "does")

And what about more outlandish language quirks, like the introductory abbreviation? Look at the first word:

`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

What is it - or, better, what should it be? "Twas"? "`Twas"? "It"+"was"?

Seems like you have to answer this first before a solution can be provided. As usual, problems of pattern matching are usually easily solved by laying out exactly what the pattern searched for will constitute.

I hope this helps.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

06-16-2012

Registered User

155, 2

Join Date: May 2012

Last Activity: 29 April 2016, 10:07 AM EDT

Posts: 155

Thanks Given: 97

Thanked 2 Times in 2 Posts

thanx for your help

problem solved

A-V

View Public Profile for A-V

Find all posts by A-V

UNIX for Dummies Questions & Answers

Problems with deleting punctuation and apostrophes

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Adding text from a variable using sed (Or awk) with punctuation

Discussion started by: joeg1484

2. HP-UX

Problems after deleting /var/tmp

Discussion started by: anaigini45

3. Shell Programming and Scripting

Printing apostrophes by using awk

Discussion started by: Padavan

4. Shell Programming and Scripting

PHP Labeling/Punctuation Syntax Question

Discussion started by: LinQ

5. Shell Programming and Scripting

Replacing punctuation marks with the help of sed

Discussion started by: ambijat

6. Shell Programming and Scripting

grep ignoring punctuation

Discussion started by: jack_gb

7. Programming

Regex to pull out all words in apostrophes in a string

Discussion started by: neptun79

8. What is on Your Mind?

Names/nicknames for certain punctuation

Discussion started by: SilversleevesX

9. Shell Programming and Scripting

How to include a variable between apostrophes within a command

Discussion started by: guarriman

10. Shell Programming and Scripting

whacky punctuation dealies

Discussion started by: [MA]Flying_Meat