Extract all proper names from string with awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract all proper names from string with awk
# 1  
Old 05-13-2012
Extract all proper names from string with awk

I want to extract the proper names with awk from a very long string, like:
Code:
ő(k): &lt;/span&gt;<br /><a something="pls/pe/person.person?i_pers_id=3694&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank"><b>Gary  Oldman</b></a> (George Smiley)<br /><a something="/pls/pe/person.person?i_pers_id=9384&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank"><b>Colin  Firth</b></a> (Bill Haydon)<br /><a something="pls/pe/person.person?i_pers_id=209372&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank"><b>Tom  Hardy</b></a> (Ricki Tarr)<br /><a something="/pls/pe/person.person?i_pers_id=10808&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">John  Hurt</a> (Control)<br /><a something="/pls/pe/person.person?i_pers_id=167105&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Toby  Jones</a> (Percy Alleline)<br /><a something="/pls/pe/person.person?i_pers_id=24870&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Mark  Strong</a> (Jim Prideaux)<br /><a something="/pls/pe/person.person?i_pers_id=219080&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Benedict  Cumberbatch</a> (Peter Guillam)<br /><a something="/pls/pe/person.person?i_pers_id=108042&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Ciarán  Hinds</a> (Roy Bland)<br /><a something="/pls/pe/person.person?i_pers_id=222906&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">David  Dencik</a> (Toby Esterhase)<br /><br />szinkronhang: <br /><a something="/pls/pe/person.person?i_pers_id=3880&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Hegedűs D. Géza</a> (George Smiley magyar hangja)<br /><a something="/pls/pe/person.person?i_pers_id=22939&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Csankó Zoltán</a> (Bill Haydon magyar hangja)<br /><a something="/pls/pe/person.person?i_pers_id=25860&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Viczián Ottó</a> (Ricki Tarr magyar hangja)<br /><a something="/pls/pe/person.person?i_pers_id=13098&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Tordy Géza</a> (Control magyar hangja)<br />
<a something="/pls/pe/person.person?i_pers_id=7028&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Gyabronka József</a> (Percy Alleline magyar hangja)<br /><a something="/pls/pe/person.person?i_pers_id=6444&amp;i_topic_id=2&amp;i_city_id=3372&amp;i_county_id=-1" target="_blank">Széles Tamás</a> (Jim Prideaux magyar hangja)&lt;/span&gt;<br />

The output I want:
Gary Oldman
(George Smiley)
Colin Firth
(Bill Haydon)
Tom Hardy
(Ricki Tarr)
John Hurt
(Control)
Toby Jones
(Percy Alleline)
Mark Strong
(Jim Prideaux)
etc.
Thanks

---------- Post updated at 04:04 PM ---------- Previous update was at 03:55 PM ----------

My mistake: It's a one-line only string, and I changed "href" to "something", but I think it doesn't matter.
# 2  
Old 05-13-2012
Try:
Code:
perl -ln0e 'while (/>([^<]+)/g) {print "$1"}' file

# 3  
Old 05-13-2012
Unfortunately it doesn't work:
syntax error at -e line 1, near ") ("
Execution of -e aborted due to compilation errors.
# 4  
Old 05-13-2012
Did you copy&paste it to the terminal? What operating system are you using?
# 5  
Old 05-13-2012
No, I typed it, but without mistake. I checked.
It's a Debian GNU/Linux 6.0.
# 6  
Old 05-13-2012
This is interesting, as in my code there is no string ") (" - as reported in the error. Can you try copy and pasting this code to the terminal window?
# 7  
Old 05-13-2012
Oh, sorry!
Now I see the mistype error - dumb ( - {.
So it works almost perfectly, but I need one space between the names, like
Code:
Gary Oldman

not
Code:
Gary  Oldman

.
Thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk/sed command to extract the string between 2 patterns but having some particular value

Hi - i have one file with content as below. ***** BEGIN 123 ***** BASH is awesome ***** END ***** ***** BEGIN 365 ***** KSH is awesome ***** END ***** ***** BEGIN 157 ***** KSH is awesome ***** END ***** ***** BEGIN 7123 ***** C is awesome ***** END ***** I am trying to find all... (4 Replies)
Discussion started by: reldb
4 Replies

2. Shell Programming and Scripting

Proper way to pipe into awk command

What is the proper way to run two commands together? For example, in the below I would like run a command in bold then pipe that output to awk to re-format it. Thank you :). bedtools nuc -fi /home/cmccabe/Desktop/bed/hg19.fa -bed /home/cmccabe/Desktop/bed/xgen_baits.bed >... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

Awk - Summation in Proper decimal Format

Hi I am executing below command to do summation on 46th coloumn. cat File1| awk -F"|" '{p += $46} END { printf"Column Name | SUM | " p}' I am getting output as Column Name | SUM | 1.01139e+10 Here I want output in Proper decimal format. Can someone tell me what change is required for same? (1 Reply)
Discussion started by: sanranad
1 Replies

4. Shell Programming and Scripting

Use grep sed or awk to extract string from log file and put into CSV

I'd like to copy strings from a log file and put them into a CSV. The strings could be on different line numbers, depending on size of log. Example Log File: File = foo.bat Date = 11/11/11 User = Foo Bar Size = 1024 ... CSV should look like: "foo.bat","11/11/11","Foo Bar","1024" (7 Replies)
Discussion started by: chipperuga
7 Replies

5. UNIX for Dummies Questions & Answers

command to extract sub-string out of file names

I have these files in a directory. It may have more class than the sample below: DEPT_CHEM101LEC_D_20110301.DAT DEPT_CHEM101LAB_D_20110301.DAT DEPT_BIO105LEC_D_20110325.DAT DEPT_BIO105LAB_D_20110325.DAT DEPT_CSC308LEC_D_20110327.DAT DEPT_CSC308LAB_D_20110327.DAT Is there way to extract out... (5 Replies)
Discussion started by: lv99
5 Replies

6. Shell Programming and Scripting

how to extract a paticular string from the text file with awk.

hello forum members I have txt file which consists the following information. Server: abababa.xyz.ap.mxmx.com Address: 111.143.211.202 Name: rmxd.ipc.ap.mxmx.com Address: 144.111.99.9 from the abovefile i have to extract only string "rmxd.ipc.ap.mxmx.com" through awk command.... (1 Reply)
Discussion started by: rajkumar_g
1 Replies

7. Shell Programming and Scripting

awk extract a string from a file

Hi, I have a file which has thousand of lines with lines starting with And I want to extract and show to user only the below string from all the lines Please note note that the above string is a time stamp and it would be different on all the lines. Please tell me how to extract... (8 Replies)
Discussion started by: jredx
8 Replies

8. UNIX for Dummies Questions & Answers

Proper Expression To Not Include A String...

I have a folder of scripts: bash:/folderpath/> ls beginFile.sh beginFileBackup.sh beginAnother.sh beginAnotherBackup.sh beginJunk.sh beginJunkBackup.sh I'd like to be able to call just one (beginFile.sh) using this type of scheme: #Run the beginFile script without the word "Backup" in... (1 Reply)
Discussion started by: mrwatkin
1 Replies

9. Shell Programming and Scripting

Using Awk in shell script to extract an index of a substring from a parent string

Hi All, I am new to this shell scripting world. Struck up with a problem, can anyone of you please pull me out of this. Requirement : Need to get the index of a substring from a parent string Eg : index("Sandy","dy") should return 4 or 3. My Approach : I used Awk function index to... (2 Replies)
Discussion started by: sandeepms17
2 Replies

10. Shell Programming and Scripting

Extract Part of string from 3rd field $3 using AWK

I'm executing "wc -lc" command in a c shell script to get record count and byte counts and writing them to a file. I get the result with the full pathname of the file. But I do not want the path name to be printed in the output file. I heard that using Awk we can get this but I don't have any... (4 Replies)
Discussion started by: stakuri
4 Replies
Login or Register to Ask a Question