Problems with deleting punctuation and apostrophes


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Problems with deleting punctuation and apostrophes
# 1  
Old 06-15-2012
Data Problems with deleting punctuation and apostrophes

Dear All,

I have a file which I want to get the list of frequency of each word, ignoring list of stop words and now I have problems which punctuations and " 's ".

what I am doing is:

Code:
sed 's/[^a-zA-Z ]//g' file01.txt > file01-clear.txt
cat file01-clear.txt | tr "[:upper:]" "[:lower:]"| tr ' ' '\012' |sort |uniq -c |sort -n -r -k 1 > file01-FQ.txt
grep -v -F -f rejectfile.txt file01-FQ.txt > file01-results.txt

I have realized the sed comment is deleting some of my words and I dont know y.SmilieSmilieSmilieSmilie
like in my file I have 26 word general but in file01-clear I get only one.
because my file01-clear is wrong, I cant see whether the final cammand to delete stop words is right or wrong either Smilie
moreover, no matter what I did the file didnt delete the 's so I have to do it manually.

I dont really know what I am doing wrong

can you pleaseeeeeeeeeeee help me

A-V
# 2  
Old 06-15-2012
useless use of cat

Code:
awk 'NR==FNR { STOP[$1]++; next }

{
        gsub(/[^a-zA-Z \t]/, ""); # Replace non-whitespace with nothing.
        $0=tolower($0); # Lowercase and separate the string
        for(N=1; N<=NF; N++) if(!($N in STOP)) W[$1]++ # Count all non-stopwords
}

END { for(X in W) print W[X], X; } # Print count of words' stopwords.txt words.txt

Use nawk on solaris.
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 06-15-2012
First off, it might help to post a sample of your text (especially the one file with 26 words which is boiled down to one) because for most of us it is easier to understand what the problem is when we see it.

Second: you have hit the first real problem of all regexp constructors, which is: "what is a word"? Lets start.

The first attempt is what i found as a definition in IBM's "Handbook of Data Processing" some 30 years ago, which said:

A word is a sequence of nonblank characters separated by blanks.

To put it this in regexp (i write "<spc>" and "<tab> to denote space and tab characters, whitespace is hard to read in postings):

Code:
/[<spc><tab>]\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]/

This is fine, but it does not take into account line starts and line endings, where the preceeding (trailing) <spc> would be missing. So we have to refine this:

Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>]*/

Now this works better, but we have to take into account punctuation, which might come after a word. So we allow ",;.:!" after the characters instead of whitespace too:

Code:
/[<spc><tab>]*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!]*/

Finally there might be some quotation we have to take into account. Consider: "this" should be recognized as word, yes. So we have to allow for single and double quotes before and after our characters:

Code:
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

At last we have two decisions to make, at least if we parse English text:

The first is, how we should treat hyphenated words: should "this-that" be treated as a single word or as two words "this" and "that"? The first possibility would already be treated correctly by now, but if we decide to have the second we simply could replace the hyphens with white space before:

Code:
s/-/<spc>/g
/[<spc><tab>"']*\([a-zA-Z][a-zA-Z]*\)[<spc><tab>,;:.!"']*/

The second decision is tougher, though: what should we do with abbreviations like "i'll" or "doesn't"? Should they be counted as words in their own right? Should they be decomposed? ("i'll" -> "i"+"will", "doesn't" -> "does"+"not") Should only the first word in such a concatenation count? ("i'll" -> "i", "doesn't" -> "does")

And what about more outlandish language quirks, like the introductory abbreviation? Look at the first word:

`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.


What is it - or, better, what should it be? "Twas"? "`Twas"? "It"+"was"?

Seems like you have to answer this first before a solution can be provided. As usual, problems of pattern matching are usually easily solved by laying out exactly what the pattern searched for will constitute.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 4  
Old 06-16-2012
thanx for your help

problem solved Smilie
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Adding text from a variable using sed (Or awk) with punctuation

Hi All, I would have though this would have been simple, but... I have text in a variable that I need to insert into a bunch of other files... The text is simple: ... (2 Replies)
Discussion started by: joeg1484
2 Replies

2. HP-UX

Problems after deleting /var/tmp

Hi, To clear up the filesystem, I archived /var/tmp (forgot that this directory was important for crontab), and then deleted the directory itself. After that there were problems like crontab not accessible, certain ftp commands like mget not functioning, and worst there were some scripts which... (4 Replies)
Discussion started by: anaigini45
4 Replies

3. Shell Programming and Scripting

Printing apostrophes by using awk

Hello All, I would like to ask your kind help regarding the following query: I have this line: awk '$2>5 {print "File: "$1,$2}' I have got this output: File: zzzds 76 File: fd9ffh 58 File: gfh0dg 107 .... Could you please help me how to modify my line to get these outputs with... (5 Replies)
Discussion started by: Padavan
5 Replies

4. Shell Programming and Scripting

PHP Labeling/Punctuation Syntax Question

Greetings! My first PHP question; and, no doubt, a "no-brainer" for the initiated :) The question centers around the proper syntax for input field labeling. The snippet which puzzles me (and the candidate which I wish to modify) goes like this:<?php _e('Hello World'); ?>:<br />What I'd like... (0 Replies)
Discussion started by: LinQ
0 Replies

5. Shell Programming and Scripting

Replacing punctuation marks with the help of sed

#!/bin/bash a=(*.pdf) punct=((~`!@#$%^&*()_-+=|\{};':",./<>?)) for (( i =0; i < ${#a}; i++ )) do sed -ri 's/$punct//g' ${a} done I cannot use the above code, can you help me in removing all punctuation marks from file name except file extension. The idea is that once all... (9 Replies)
Discussion started by: ambijat
9 Replies

6. Shell Programming and Scripting

grep ignoring punctuation

I have a file xxx.txt containing winter_kool sugar_"sweet" Is there anyway i can grep xxx.txt for strings without using punctuations. for eg: `grep sugarsweet xxx.txt` should give output : sugar_"sweet" (2 Replies)
Discussion started by: jack_gb
2 Replies

7. Programming

Regex to pull out all words in apostrophes in a string

Hi, I have string like this: CHECK (VALUE::text = ANY (ARRAY)) and I am trying to get out the words in apostrophes ('). In this case"ACTIVE INACTIVE DELETE" Also the array may consist of one or more words (in given example 3). Also instead of word it can be only one LETTER. And... (4 Replies)
Discussion started by: neptun79
4 Replies

8. What is on Your Mind?

Names/nicknames for certain punctuation

I know that forward slash and backslash are "whack" and "backwhack," and I know that a pound-sign or number sign is "crunch" and an exclamation point, "bang." What I would like to know is whether or not there's a popular nickname for the dollar sign. I call it "cash," but that may just be Yank... (6 Replies)
Discussion started by: SilversleevesX
6 Replies

9. Shell Programming and Scripting

How to include a variable between apostrophes within a command

Hi. I'm trying to find some words within my directory and created a text file containing them which is read by my shell script: #!/bin/bash var=`cat words.txt` for i in $var; do echo $i find -type f -print0 | xargs -r0 grep -F '$i' done But it searches "$i" (dollar sign... (2 Replies)
Discussion started by: guarriman
2 Replies

10. Shell Programming and Scripting

whacky punctuation dealies

Say I have a command that looks like this: host=$(/usr/bin/host xxx.xxx.xxx.xxx) What is the significance of the $() I know what happens when I don't include them, and I know what happens when I do, but... Why doo it woik wit $() Sorry for the lame question :o (3 Replies)
Discussion started by: [MA]Flying_Meat
3 Replies
Login or Register to Ask a Question