Get group of consecutive uppercase words using gawk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Get group of consecutive uppercase words using gawk
# 1  
Old 05-22-2012
Get group of consecutive uppercase words using gawk

Hi

I'd like to extract, from a text file, the strings starting with "The Thing" and only composed of words with a capital first letter and apostrophes, like for example:
"The Thing I Only" from "those are the The Thing I Only go for whatever."
or
"The Thing That Are Like Men's Eyewear" from "those are The Thing That Are Like Men's Eyewear even if bla bla. "

I am trying this but without success!
Code:
gawk '{match($0,/The Thing [A-Z][^[:space:]]*[[:space:]]/,arr); print arr[0]}' test

I get the first uppercase word after "The Thing" but I don't know how to get the others.

Thanks
# 2  
Old 05-22-2012
Code:
$ awk '/The Thing/ {for (i=1;i<=NF;i++){if ($i~/^[A-Z]/){printf "%s ",$i}}}{printf RS}' x
The Thing That Are Like Men's Eyewear
The Thing I Only
The Thing I
The Thing I Also Want.

$ cat x
those are The Thing That Are Like Men's Eyewear even if bla bla.
those are the The Thing I Only go for whatever.
these are the The Thing I dont want.
those are the The Thing I Also Want.
those are the.
$

---------- Post updated at 16:21 ---------- Previous update was at 16:16 ----------

Note: As shown in sample, I assumed you don't have "first letter uppercase" like words before "The Thing". If this is not the case, the condition need to change a bit.
# 3  
Old 05-22-2012
How about perl

Code:
perl -ne 'if(/The Thing/){foreach (split(/\s+/)) { print "$_ " if (substr($_,0,1)=~/[A-Z]/)} printf "\n";}' x

---------- Post updated at 06:31 AM ---------- Previous update was at 06:18 AM ----------

Modified version
Code:
perl -lne '$,=" ";print /[A-Z]\s+|\s+[A-Z]$|[A-Z].+?\s+|[A-Z].+?$/g'

# 4  
Old 05-22-2012
Quote:
Originally Posted by louisJ
I get the first uppercase word after "The Thing" but I don't know how to get the others.

Thanks
You need to group your "word" definition, and allow it to repeat. ( regex )*. I changed it to match spaces at the beginning instead of the end to simplify phrases that may be at the end of line..

Code:
awk 'match($0,/The Thing([[:space:]]*[[:upper:]][^[:space:]]*)*/) { print substr($0,RSTART,RLENGTH) }' input
The Thing I Only
The Thing That Are Like Men's Eyewear

This User Gave Thanks to neutronscott For This Post:
# 5  
Old 05-23-2012
Thanks everybody for these answers. I am using your solution neutronscott, because it is the closest to my orginal command.
But anchal_khare you are right, I have an uppercase at the begining of sentences, is there a way to exclude it in the netronscott's command ?

And another thing, how to get all the matching expressions in the line, not only the first one?
And how to include special characters like ® or ™? (anchal_khare indeed I thought there were two questions and I posted another post, but you are right it would be a duplicate, sorry for that).

Edit: I guess ® or ™ are included de facto in
[^[:space:]]

Last edited by louisJ; 05-23-2012 at 04:59 AM..
# 6  
Old 05-23-2012
Quote:
Originally Posted by louisJ
But anchal_khare you are right, I have an uppercase at the begining of sentences, is there a way to exclude it in the netronscott's command ?
What do you mean? Mine would only match what is after 'The Thing'. In his, he checks the line for 'The Thing' but then starts at the beginning rather than the start of the occurance of 'The Thing' for capital words..


Quote:
Originally Posted by louisJ
And another thing, how to get all the matching expressions in the line, not only the first one?
This would require a loop. Continuing with the match method:

Code:
#!/usr/bin/awk -f
 
{
  offset=0
  while (match(substr($0,offset+1),/The Thing([[:space:]]*[[:upper:]][^[:space:]]*)*/))
  {
    print substr($0,RSTART+offset,RLENGTH)
    offset+=RSTART+RLENGTH
  }
}

tested with this input i made up:

Code:
[mute@geek ~/temp/louisJ]$ cat input
some would say The Thing He Wants and The Thing She Gives Him are not The Thing That Matters Most. :(
This Thing and The Thing like Another Thing
I HAVE NOT The Thing TO DO WITH IT! The Thing Is Not Here it is there
 
[mute@geek ~/temp/louisJ]$ ./script input
The Thing He Wants
The Thing She Gives Him
The Thing That Matters Most.
The Thing
The Thing TO DO WITH IT! The Thing Is Not Here

if you want 'The Thing' to also allow matches like 'THE THING' you can group the successive letters like so: T[Hh][Ee] T[Hh][Ii][Nn][Gg]

another thing (heh): In the last example, the longest match is taken, so you see the 2nd 'The Thing' stays a part of one match. Do you want it split at the second occurance of 'The Thing' within another 'The Thing' ?

Last edited by neutronscott; 05-23-2012 at 10:01 AM..
# 7  
Old 05-24-2012
Thanks neutronscott,
in order to be more clear,
From this text:

Code:
 Combining greatness with lightweight origination, The Thing Men’s Are To Get offers way-compatible alpine comfort. Originated with effective Roubilflex® One and wrapped in a tight, waterfall HaavVeont® 
Alpha coat, versatility is a hallmark of this thing. The hat is removable, fully rullable and has a laminated tissu. Zips aid ventilation during sports. Styled in a good, athletic fit. The Thing Men’s Are To Get is a 
fishing jacket that provides environmental coolenss and origination with the range of technical things you would expect from a trusted Every Series™ offing coat.

I need to obtain
Code:
 The Thing Men’s Are To Get Roubilflex® One HaavVeont® Alpha The  Thing Men’s Are To Get Every Series™

There is every thing in this example:
-the group starting by The Thing occurs twice,
-the uppercase words starting the sentence are excluded
-the groups with special signs

Last edited by louisJ; 05-25-2012 at 03:06 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

2. Shell Programming and Scripting

Search words in any quote position and then change the words

hi, i need to replace all words in any quote position and then need to change the words inside the file thousand of raw. textfile data : "Ninguno","Confirma","JuicioABC" "JuicioCOMP","Recurso","JuicioABC" "JuicioDELL","Nulidad","Nosino" "Solidade","JuicioEUR","Segundo" need... (1 Reply)
Discussion started by: benjietambling
1 Replies

3. Shell Programming and Scripting

Gawk gensub, match capital words and lowercase words

Hi I have strings like these : Vengeance mitt Men Vengeance gloves Women Quatro Windstopper Etip gloves Quatro Windstopper Etip gloves Girls Thermobite hooded jacket Thermobite Triclimate snow jacket Boys Thermobite Triclimate snow jacket and I would like to get the lower case words at... (2 Replies)
Discussion started by: louisJ
2 Replies

4. Shell Programming and Scripting

Match groups of capital words using gawk

Hi I'd like to extract from a text file, using gawk, the groups of words beginning with a capital letter, that are not at the begining of a sentence (i.e. Not after a full stop and a pace ". "), including special characters like registered or trademark (® or ™ ). For example I would like to... (1 Reply)
Discussion started by: louisJ
1 Replies

5. Shell Programming and Scripting

How to move a group of words before another group of words

Hi I have a file containing lines with several consecutive words starting with a capital letter (i.e. Zuvaia Flex), followed by "de The New Foul", and I would like to put "The New Foul" before the group with capital letters and delete "de" From the line: Le short femme Zuvaia Flex de The... (2 Replies)
Discussion started by: louisJ
2 Replies

6. Shell Programming and Scripting

matching group of words

Hi, I am stuck with a problem, will be thankful for your guidance and help. I have two files. Each line is a group of words with first word as group Id. eg. 'gp1' in File1 and 'grp1' in File2. <File1> gp1 : xyz xys3 syt2 ssx itt kty gp2 : syt2 kgk iti op2 gp3 : ppy yt5 itt sky... (11 Replies)
Discussion started by: mira
11 Replies

7. Shell Programming and Scripting

Finding consecutive same words in a file

Hi All, I tried this but I am having trouble formulating this: I have a file that looks like this (this is a sample file words can be different): network router frame network router computer card host computer card One can see that in this file "network" and "router" occur... (3 Replies)
Discussion started by: shoaibjameel123
3 Replies

8. Shell Programming and Scripting

finding and removing 2 identical consecutive words in a text

i want to write a shell script that correct a text file.for example if i have the input file: "john has has 2 apples anne has 3 oranges oranges" i want that the output file be like this: "john has 2 apples anne has 3 oranges" i've tried to read line by line from input text file into array... (11 Replies)
Discussion started by: cocostaec
11 Replies

9. Shell Programming and Scripting

Shell script to find out words, replace them and count words

hello, i 'd like your help about a bash script which: 1. finds inside the html file (it is attached with my post) the code number of the Latest Stable Kernel, 2.finds the link which leads to the download location of the Latest Stable Kernel version, (the right link should lead to the file... (3 Replies)
Discussion started by: alex83
3 Replies

10. Shell Programming and Scripting

Removing uppercase words from textfiles

I have the task of removing all uppercase words from csv files, mit 10000's lines. I think it shoud be possible with regex's, something like "s/{2,}//g" but I can't get it work with sed or Vi. It would also be possible to script in ksh, awk, perl or python. example this "this is a EXAMPLE... (5 Replies)
Discussion started by: frieling
5 Replies
Login or Register to Ask a Question