Problems with deleting punctuation and apostrophes
Dear All,
I have a file which I want to get the list of frequency of each word, ignoring list of stop words and now I have problems which punctuations and " 's ".
what I am doing is:
I have realized the sed comment is deleting some of my words and I dont know y.
like in my file I have 26 word general but in file01-clear I get only one.
because my file01-clear is wrong, I cant see whether the final cammand to delete stop words is right or wrong either
moreover, no matter what I did the file didnt delete the 's so I have to do it manually.
First off, it might help to post a sample of your text (especially the one file with 26 words which is boiled down to one) because for most of us it is easier to understand what the problem is when we see it.
Second: you have hit the first real problem of all regexp constructors, which is: "what is a word"? Lets start.
The first attempt is what i found as a definition in IBM's "Handbook of Data Processing" some 30 years ago, which said:
A word is a sequence of nonblank characters separated by blanks.
To put it this in regexp (i write "<spc>" and "<tab> to denote space and tab characters, whitespace is hard to read in postings):
This is fine, but it does not take into account line starts and line endings, where the preceeding (trailing) <spc> would be missing. So we have to refine this:
Now this works better, but we have to take into account punctuation, which might come after a word. So we allow ",;.:!" after the characters instead of whitespace too:
Finally there might be some quotation we have to take into account. Consider: "this" should be recognized as word, yes. So we have to allow for single and double quotes before and after our characters:
At last we have two decisions to make, at least if we parse English text:
The first is, how we should treat hyphenated words: should "this-that" be treated as a single word or as two words "this" and "that"? The first possibility would already be treated correctly by now, but if we decide to have the second we simply could replace the hyphens with white space before:
The second decision is tougher, though: what should we do with abbreviations like "i'll" or "doesn't"? Should they be counted as words in their own right? Should they be decomposed? ("i'll" -> "i"+"will", "doesn't" -> "does"+"not") Should only the first word in such a concatenation count? ("i'll" -> "i", "doesn't" -> "does")
And what about more outlandish language quirks, like the introductory abbreviation? Look at the first word:
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
What is it - or, better, what should it be? "Twas"? "`Twas"? "It"+"was"?
Seems like you have to answer this first before a solution can be provided. As usual, problems of pattern matching are usually easily solved by laying out exactly what the pattern searched for will constitute.
Hi All,
I would have though this would have been simple, but...
I have text in a variable that I need to insert into a bunch of other files... The text is simple:
... (2 Replies)
Hi,
To clear up the filesystem, I archived /var/tmp (forgot that this directory was important for crontab), and then deleted the directory itself.
After that there were problems like crontab not accessible, certain ftp commands like mget not functioning, and worst there were some scripts which... (4 Replies)
Hello All,
I would like to ask your kind help regarding the following query:
I have this line:
awk '$2>5 {print "File: "$1,$2}'
I have got this output:
File: zzzds 76
File: fd9ffh 58
File: gfh0dg 107
....
Could you please help me how to modify my line to get these outputs with... (5 Replies)
Greetings!
My first PHP question; and, no doubt, a "no-brainer" for the initiated :)
The question centers around the proper syntax for input field labeling. The snippet which puzzles me (and the candidate which I wish to modify) goes like this:<?php _e('Hello World'); ?>:<br />What I'd like... (0 Replies)
#!/bin/bash
a=(*.pdf)
punct=((~`!@#$%^&*()_-+=|\{};':",./<>?))
for (( i =0; i < ${#a}; i++ ))
do
sed -ri 's/$punct//g' ${a}
done
I cannot use the above code, can you help me in removing all punctuation marks from file name except file extension. The idea is that once all... (9 Replies)
I have a file xxx.txt containing
winter_kool
sugar_"sweet"
Is there anyway i can grep xxx.txt for strings without using punctuations.
for eg:
`grep sugarsweet xxx.txt` should give output :
sugar_"sweet" (2 Replies)
Hi,
I have string like this:
CHECK (VALUE::text = ANY (ARRAY))
and I am trying to get out the words in apostrophes (').
In this case"ACTIVE INACTIVE DELETE"
Also the array may consist of one or more words (in given example 3). Also instead of word it can be only one LETTER.
And... (4 Replies)
I know that forward slash and backslash are "whack" and "backwhack," and I know that a pound-sign or number sign is "crunch" and an exclamation point, "bang." What I would like to know is whether or not there's a popular nickname for the dollar sign. I call it "cash," but that may just be Yank... (6 Replies)
Hi.
I'm trying to find some words within my directory and created a text file containing them which is read by my shell script:
#!/bin/bash
var=`cat words.txt`
for i in $var; do
echo $i
find -type f -print0 | xargs -r0 grep -F '$i'
done
But it searches "$i" (dollar sign... (2 Replies)
Say I have a command that looks like this:
host=$(/usr/bin/host xxx.xxx.xxx.xxx)
What is the significance of the $()
I know what happens when I don't include them, and I know what happens when I do, but...
Why doo it woik wit $()
Sorry for the lame question :o (3 Replies)