|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
||||
|
||||
|
Unexpected Behaviour from grepping Text File
Hi! I recently downloaded a wordlist file called 2of12.txt, which is a wordlist of common words, part of the 12dicts package. I've been getting unexpected results from grepping it, such as getting no matches when clearly there ought to be, or returns that are simply wrong. Par exemple: Code:
egrep ^...a.....n.$ /usr/share/dict/2of12.txt |head -5 apparition bipartisan cavalryman defamation dilatation Clearly I'm asking for an eleven-letter word, and getting ten-letter words, (but at least the letters I'm asking for are in the right places). If I grep any other wordlist, I get the expected results. Code:
egrep ^...a.....n.$ /usr/share/dict/sowpods.txt |head -5 advancement advantaging alkalescent alkalifying antalkaline But if I add an extra dot at the end, I get the correct results. Well, not the correct results, but you know what I mean: Code:
egrep ^...a.....n..$ /usr/share/dict/2of12.txt |head -5 advancement arraignment arrangement derangement devastating I opened 2of12.txt in TextWrangler, showing invisibles, to see if there were some kind of extra white space characters in there, but I could see nothing wrong. It looks like they're all single words, followed by a newline. Something must be wrong with this file, but I have no idea what it might be. I had read here 12dicts - Helpful that the file contained annotations after certain words, but I can find none of these. Does anyone have any idea what might cause this behaviour in a text file? If so, how can I find and fix this problem? Thanks! |
| Sponsored Links | ||
|
|
#2
|
|||
|
|||
|
Maybe this can help: Code:
egrep ^...a.....n. /usr/share/dict/2of12.txt | head -5 advancement apparition arraignment arrangement audaciousness I think that a period matches any character, including the record separator, in a regular expression. ---------- Post updated at 10:44 PM ---------- Previous update was at 10:28 PM ---------- Maybe THIS can help better: Code:
egrep -w ...a.....n. /usr/share/dict/2of12.txt | head -5 advancement apparition arraignment arrangement bipartisan The -w option stands for --word-regexp in egrep's man. Still, a period may match a record separator and thus you get ten character words OR eleven character words... Now THIS may be quite what you wanted in the first place: Code:
egrep -w '...a.....n[^^M]' /usr/share/dict/2of12.txt | head -5 advancement arraignment arrangement derangement devastating The regular expression ends with a character range that means 'any character EXCEPT a record separator'; you may get it as a caret followed by a newline, both enclosed in brackets (open bracket, caret, control-V, newline, close bracket). Just a single tiny side-effect: I get 'self-advancement' somewhere down the list of results... Last edited by hexram; 06-02-2012 at 10:58 PM.. Reason: Further clarification of my answer |
| The Following User Says Thank You to hexram For This Useful Post: | ||
sudon't (06-02-2012) | ||
| Sponsored Links | ||
|
|
#3
|
||||
|
||||
|
Quote:
Code:
^...........$ should return exactly eleven characters. It certainly does in every other wordlist, (or text file), that I have. You need the 'end of line' stop, or it'll look at the first eleven characters, then keep on going. Yet in 2of12, it returns ten-letter words. What can it be matching that I can't see when I have my text editor showing invisibles? Why does this 2of12 file behave differently? There must be something there I can't see. |
|
#4
|
|||
|
|||
|
Quote:
The problem is that there are probably ctl-M characters in the file just before the newline (Dos style newline which is 0x0d0x0a combinations rather than just a single 0x0a). Run the file through dos2unix, or someother conversion tool, which will delete the ctl-M characters from the file. ---------- Post updated at 23:36 ---------- Previous update was at 23:33 ---------- If you want an easy to verify that CTL-M characters are present, use cat: Code:
cat -v 2of12.txt |head This will likely generate the output: Code:
a^M aardvark^M abaci^M aback^M abacus^M abaft^M abalone^M abandon^M abandoned^M abandonment^M |
| The Following User Says Thank You to agama For This Useful Post: | ||
sudon't (06-02-2012) | ||
| Sponsored Links | |
|
|
#5
|
||||
|
||||
|
Now THIS may be quite what you wanted in the first place: Code:
egrep -w '...a.....n[^^M]' /usr/share/dict/2of12.txt | head -5 advancement arraignment arrangement derangement devastating The regular expression ends with a character range that means 'any character EXCEPT a record separator'; you may get it as a caret followed by a newline, both enclosed in brackets (open bracket, caret, control-V, newline, close bracket). Just a single tiny side-effect: I get 'self-advancement' somewhere down the list of results...[/QUOTE] OK, that looks like it works. I need to figure out how to use that to try and strip that junk out of the file. I think we get 'self-advancement' because we didn't start with a caret. Again, why don't we need the -w flag in any other file? If you don't use it, even with the starting caret, you get longer words. ---------- Post updated at 11:43 PM ---------- Previous update was at 11:40 PM ---------- Quote:
Code:
whom^M whomever^M whomsoever^M whoop^M whoopee^M whooper^M whoops^M whoosh^M whopper^M whopping^M whore^M ---------- Post updated at 11:56 PM ---------- Previous update was at 11:43 PM ---------- dos2unix did the trick. |
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mtime unexpected behaviour | harris | UNIX for Dummies Questions & Answers | 8 | 06-01-2012 08:53 AM |
| Unexpected Behaviour with WPAR | 03sep2011 | AIX | 0 | 01-19-2012 11:41 PM |
| Grepping file and returning passed variable if the value does not exist in file at all. | personalt | Shell Programming and Scripting | 3 | 04-28-2011 03:24 AM |
| Grepping log file | x-plicit78 | Shell Programming and Scripting | 19 | 10-22-2009 08:14 AM |
| Grepping text by providing line numbers. | anushree.a | Shell Programming and Scripting | 6 | 07-01-2009 07:47 AM |
|
|