|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Dear All, I have a file which I want to get the list of frequency of each word, ignoring list of stop words and now I have problems which punctuations and " 's ". what I am doing is: Code:
sed 's/[^a-zA-Z ]//g' file01.txt > file01-clear.txt cat file01-clear.txt | tr "[:upper:]" "[:lower:]"| tr ' ' '\012' |sort |uniq -c |sort -n -r -k 1 > file01-FQ.txt grep -v -F -f rejectfile.txt file01-FQ.txt > file01-results.txt I have realized the sed comment is deleting some of my words and I dont know y. like in my file I have 26 word general but in file01-clear I get only one. because my file01-clear is wrong, I cant see whether the final cammand to delete stop words is right or wrong either ![]() moreover, no matter what I did the file didnt delete the 's so I have to do it manually. I dont really know what I am doing wrong can you pleaseeeeeeeeeeee help me A-V |
| Sponsored Links | ||
|
|
#2
|
|||
|
|||
|
useless use of cat Code:
awk 'NR==FNR { STOP[$1]++; next }
{
gsub(/[^a-zA-Z \t]/, ""); # Replace non-whitespace with nothing.
$0=tolower($0); # Lowercase and separate the string
for(N=1; N<=NF; N++) if(!($N in STOP)) W[$1]++ # Count all non-stopwords
}
END { for(X in W) print W[X], X; } # Print count of words' stopwords.txt words.txtUse nawk on solaris. |
| The Following User Says Thank You to Corona688 For This Useful Post: | ||
A-V (06-16-2012) | ||
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
First off, it might help to post a sample of your text (especially the one file with 26 words which is boiled down to one) because for most of us it is easier to understand what the problem is when we see it. Second: you have hit the first real problem of all regexp constructors, which is: "what is a word"? Lets start. The first attempt is what i found as a definition in IBM's "Handbook of Data Processing" some 30 years ago, which said: A word is a sequence of nonblank characters separated by blanks. To put it this in regexp (i write "<spc>" and "<tab> to denote space and tab characters, whitespace is hard to read in postings): Code:
/[<spc><tab>]\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]/ This is fine, but it does not take into account line starts and line endings, where the preceeding (trailing) <spc> would be missing. So we have to refine this: Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]*/ Now this works better, but we have to take into account punctuation, which might come after a word. So we allow ",;.:!" after the characters instead of whitespace too: Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!]*/ Finally there might be some quotation we have to take into account. Consider: "this" should be recognized as word, yes. So we have to allow for single and double quotes before and after our characters: Code:
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/ At last we have two decisions to make, at least if we parse English text: The first is, how we should treat hyphenated words: should "this-that" be treated as a single word or as two words "this" and "that"? The first possibility would already be treated correctly by now, but if we decide to have the second we simply could replace the hyphens with white space before: Code:
s/-/<spc>/g /[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/ The second decision is tougher, though: what should we do with abbreviations like "i'll" or "doesn't"? Should they be counted as words in their own right? Should they be decomposed? ("i'll" -> "i"+"will", "doesn't" -> "does"+"not") Should only the first word in such a concatenation count? ("i'll" -> "i", "doesn't" -> "does") And what about more outlandish language quirks, like the introductory abbreviation? Look at the first word: `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. What is it - or, better, what should it be? "Twas"? "`Twas"? "It"+"was"? Seems like you have to answer this first before a solution can be provided. As usual, problems of pattern matching are usually easily solved by laying out exactly what the pattern searched for will constitute. I hope this helps. bakunin |
| The Following User Says Thank You to bakunin For This Useful Post: | ||
A-V (06-16-2012) | ||
|
#4
|
|||
|
|||
|
thanx for your help
problem solved ![]() |
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Regex to pull out all words in apostrophes in a string | neptun79 | Programming | 4 | 06-06-2011 01:35 AM |
| Names/nicknames for certain punctuation | SilversleevesX | What's on Your Mind? | 6 | 04-20-2010 03:42 PM |
| How to include a variable between apostrophes within a command | guarriman | Shell Programming and Scripting | 2 | 03-16-2007 11:44 AM |
| whacky punctuation dealies | [MA]Flying_Meat | Shell Programming and Scripting | 3 | 06-14-2005 07:40 PM |
| 'make' problems (compliation problems?) | xyyz | UNIX for Advanced & Expert Users | 5 | 11-05-2001 09:47 PM |
|
|