Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-15-2012
A-V A-V is offline
Registered User
 
Join Date: May 2012
Posts: 103
Thanks: 54
Thanked 2 Times in 2 Posts
Data Problems with deleting punctuation and apostrophes

Dear All,

I have a file which I want to get the list of frequency of each word, ignoring list of stop words and now I have problems which punctuations and " 's ".

what I am doing is:


Code:
sed 's/[^a-zA-Z ]//g' file01.txt > file01-clear.txt
cat file01-clear.txt | tr "[:upper:]" "[:lower:]"| tr ' ' '\012' |sort |uniq -c |sort -n -r -k 1 > file01-FQ.txt
grep -v -F -f rejectfile.txt file01-FQ.txt > file01-results.txt

I have realized the sed comment is deleting some of my words and I dont know y.
like in my file I have 26 word general but in file01-clear I get only one.
because my file01-clear is wrong, I cant see whether the final cammand to delete stop words is right or wrong either
moreover, no matter what I did the file didnt delete the 's so I have to do it manually.

I dont really know what I am doing wrong

can you pleaseeeeeeeeeeee help me

A-V
Sponsored Links
    #2  
Old 06-15-2012
Mead Rotor
 
Join Date: Aug 2005
Location: Saskatchewan
Posts: 16,371
Thanks: 490
Thanked 2,534 Times in 2,417 Posts
useless use of cat


Code:
awk 'NR==FNR { STOP[$1]++; next }

{
        gsub(/[^a-zA-Z \t]/, ""); # Replace non-whitespace with nothing.
        $0=tolower($0); # Lowercase and separate the string
        for(N=1; N<=NF; N++) if(!($N in STOP)) W[$1]++ # Count all non-stopwords
}

END { for(X in W) print W[X], X; } # Print count of words' stopwords.txt words.txt

Use nawk on solaris.
The Following User Says Thank You to Corona688 For This Useful Post:
A-V (06-16-2012)
Sponsored Links
    #3  
Old 06-15-2012
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
 
Join Date: May 2005
Location: In the leftmost byte of /dev/kmem
Posts: 3,290
Thanks: 27
Thanked 450 Times in 351 Posts
First off, it might help to post a sample of your text (especially the one file with 26 words which is boiled down to one) because for most of us it is easier to understand what the problem is when we see it.

Second: you have hit the first real problem of all regexp constructors, which is: "what is a word"? Lets start.

The first attempt is what i found as a definition in IBM's "Handbook of Data Processing" some 30 years ago, which said:

A word is a sequence of nonblank characters separated by blanks.

To put it this in regexp (i write "<spc>" and "<tab> to denote space and tab characters, whitespace is hard to read in postings):


Code:
/[<spc><tab>]\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]/

This is fine, but it does not take into account line starts and line endings, where the preceeding (trailing) <spc> would be missing. So we have to refine this:


Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]*/

Now this works better, but we have to take into account punctuation, which might come after a word. So we allow ",;.:!" after the characters instead of whitespace too:


Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!]*/

Finally there might be some quotation we have to take into account. Consider: "this" should be recognized as word, yes. So we have to allow for single and double quotes before and after our characters:


Code:
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

At last we have two decisions to make, at least if we parse English text:

The first is, how we should treat hyphenated words: should "this-that" be treated as a single word or as two words "this" and "that"? The first possibility would already be treated correctly by now, but if we decide to have the second we simply could replace the hyphens with white space before:


Code:
s/-/<spc>/g
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

The second decision is tougher, though: what should we do with abbreviations like "i'll" or "doesn't"? Should they be counted as words in their own right? Should they be decomposed? ("i'll" -> "i"+"will", "doesn't" -> "does"+"not") Should only the first word in such a concatenation count? ("i'll" -> "i", "doesn't" -> "does")

And what about more outlandish language quirks, like the introductory abbreviation? Look at the first word:

`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.


What is it - or, better, what should it be? "Twas"? "`Twas"? "It"+"was"?

Seems like you have to answer this first before a solution can be provided. As usual, problems of pattern matching are usually easily solved by laying out exactly what the pattern searched for will constitute.

I hope this helps.

bakunin
The Following User Says Thank You to bakunin For This Useful Post:
A-V (06-16-2012)
    #4  
Old 06-16-2012
A-V A-V is offline
Registered User
 
Join Date: May 2012
Posts: 103
Thanks: 54
Thanked 2 Times in 2 Posts
thanx for your help

problem solved
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Regex to pull out all words in apostrophes in a string neptun79 Programming 4 06-06-2011 01:35 AM
Names/nicknames for certain punctuation SilversleevesX What's on Your Mind? 6 04-20-2010 03:42 PM
How to include a variable between apostrophes within a command guarriman Shell Programming and Scripting 2 03-16-2007 11:44 AM
whacky punctuation dealies [MA]Flying_Meat Shell Programming and Scripting 3 06-14-2005 07:40 PM
'make' problems (compliation problems?) xyyz UNIX for Advanced & Expert Users 5 11-05-2001 09:47 PM



All times are GMT -4. The time now is 02:35 AM.