The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
How do I extract text only from html file without HTML tag los111 UNIX for Dummies Questions & Answers 4 11-28-2007 01:40 AM
extracting a string start_shell Shell Programming and Scripting 2 09-30-2007 05:41 AM
Extracting a string from one file and searching the same string in other files mohancrr Shell Programming and Scripting 1 09-19-2007 12:17 AM
sed, grep, awk, regex -- extracting a matched substring from a file/string ropers Shell Programming and Scripting 2 05-23-2006 10:56 AM
extracting from a string preetikate Shell Programming and Scripting 1 03-11-2004 05:08 AM

Reply
 
Submit Tools LinkBack Thread Tools Search this Thread Display Modes
  #1  
Old 07-21-2008
Registered User
 

Join Date: Jul 2008
Posts: 16
Question extracting a string from html file with sed or awk

Hi,

i am downloading a html file with wget and now need to extract certain informations from it.

The string starts after "Attributes:" and ends before "Stats:". It may span across multiple Lines.
Within that string there are random additional strings that i need.
They have the pattern "String1:String2", these are the Attributes and their Values separated by a ":"

So based on the example below my result shuld be:
Category: Movies
Video Source: DVD
Video Format: WMV
Video Genre: Horror, Mystery, Thriller
Language: English
Audio Format: WMA

I tried sed with sed -n "/Attributes:/,/Stats:/p" but that will get the whole line and not only the text between the two strings.

I have not found a parameter to change that behavior.

Thanks in advance for any help
Rag

Quote:
d></tr><tr><th>Size:</th><td>89 files, 76 data, 13 recovery<br />Encoded: <span class="fileSize">3,705.4MB</span> data + <span class="fileSize">379.5MB</span> recovery = <span class="fileSize">4,084.8MB</span> total<br />Decoded: <span class="fileSize">3,557.1MB</span> data + <span class="fileSize">364.3MB</span> recovery = <span class="fileSize">3,921.4MB</span> total (estimated)</td></tr><tr><th>Nerw</th><td><a href="/asdasd/">fsdfsdf</a>, <a href="/adasdasd/">dfgdfg</a></td></tr><tr><th>Attributes:</th><td>Category: Movies<br />Video Source: DVD<br />Video Format: WMV<br />Video Genre: Horror, Mystery, Thriller<br />Language: English<br />Audio Format: WMA<br /></td></tr><tr><th>Stats:</th><td>25 views, <a href="#CommentsPH">0 comments</a></td></tr><tr><th>Editor Notes:</th><td><strong><span class="highlight">WMV is natively played back on xbox360.</span></strong></td></tr></table></div><div class="clear"></div><h3
second example on two lines:
Quote:
Reports from this Poster</a><br /><a href="/browse/all/f/?a_id=14402873">Browse Files from this Poster</a></td></tr><tr><th>Reported:</th><td>Monday 21 Jul 2008, 09:28AM CET (<span class="ageVeryNew">76 minutes ago</span>) (<span class="ageVeryNew">4 hours </span>after upload)</td></tr><tr><th>Posted:</th><td>Started: <span class="ageVeryNew">12 hours </span>ago, finished: <span class="ageVeryNew">6 hours </span>ago, duration: 6h 21m 59s<br />Approx upload speed: 219.9KB/sec</td></tr><tr><th>Size:</th><td>104 files, 88 data, 16 recovery<br />Encoded: <span class="fileSize">4,343.0MB</span> data + <span class="fileSize">579.0MB</span> recovery = <span class="fileSize">4,922.0MB</span> total<br />Decoded: <span class="fileSize">3,040.1MB</span> data + <span class="fileSize">405.3MB</span> recovery = <span class="fileSize">3,445.4MB</span> total (estimated)</td></tr><tr><th>Nwer:</th><td><a href="/werwerwer/">werwer</a></td></tr><tr><th>Attributes:</th><td>Category: Movies<br />Region System: PAL<br />Video Source: DVD<br />Video Format: DVD<br />Video Genre: Drama, War<br />Language: English<br />Subtitled Language: Dutch<br />Audio Format: AC3/DD<br /></td></tr><tr><th>Stats:</th><td>7 views, <a href="#CommentsPH">0 comments</a></td></tr><tr><th>Editor Notes:</th><td>None</td></tr></table></div><div class="clear"></div><h3 id="SimilarPH"><span style="float: right">[<a href="werwer">View All Similar Reports</a>] </span>Similar Reports (up to 5)</h3><div id="HideRelatedPosts" style="border: 1px solid black"><p style="text-align: center"><a href="#SimilarPH" onclick="
$('HideRelatedPosts').style.display = 'none';
$('ShowRelatedPosts').style.display = 'block';
">Currently hiding; click to view</a></p></div><div id="ShowRelatedPosts" style="display: none"><table summary="Post query results" class="dataTabular fixed"><thead><tr><th style="width: 20px">&nbsp;</th><th style="width: 110px">Size <a title="Sort by the size of the posts (Ascend
it seems like its always "Attributes:</th><td>" at the start and "<br /></td></tr><tr><th>Stats:" at the End.
Reply With Quote
Forum Sponsor
  #2  
Old 07-21-2008
rubin's Avatar
Registered User
 

Join Date: Nov 2007
Posts: 215
What is your OS ?

If you're on Linux try:

Code:
awk '/Stats/{f=0} f && NF; /Attributes/{f=1}'  RS="<[^>]+>" html_file
Output of the first example:

Code:
Category: Movies
Video Source: DVD
Video Format: WMV
Video Genre: Horror, Mystery, Thriller
Language: English
Audio Format: WMA
Output of the second example:

Code:
Category: Movies
Region System: PAL
Video Source: DVD
Video Format: DVD
Video Genre: Drama, War
Language: English
Subtitled Language: Dutch
Audio Format: AC3/DD

If there are opened tags somewhere in the records replace RS with:

Code:
RS="<[^>]+>|<[^>]+|[^>]+>"
Reply With Quote
  #3  
Old 07-21-2008
Registered User
 

Join Date: Jul 2008
Posts: 16
Wow i have no idea how exactly that awk command is working, but fortunately i dont have to Big thanks rubin! Im on Windows with the unixutils. i'll additionally clean the spaces gawk "/Stats/{f=0} f && NF; /Attributes/{f=1}" RS="]+>" index-3123947.html|sed -e "s/ //g">x.txt then set them as Environment variables in batch for /F "delims=: tokens=1,2" %f in (x.txt) do set nzb%f=%g
Reply With Quote
  #4  
Old 07-21-2008
Registered User
 

Join Date: Jul 2008
Posts: 16
Now i wanted to use that on my XP download machine but unfortunately the wget output is different there.

The result has newline's and Tab's. For me all whitespace could be cleaned. space, tab, CR and LF.
I added: |sed -e "s/\t*//g"|tr -d "\n" but then i dont have any CR/LF left.

please help again
Thanks

Quote:
<th>Attributes:</th>
<td>
Category: Movies<br />
Video Source:
DVD <br />
Video Format:
XviD <br />
Video Genre:
Animation, Family, Fantasy <br />
Language:
English <br />
Audio Format:
MP3 <br />
</td>
</tr>

<tr>
<th>Stats:</th>
after awk
Quote:
Category: Movies


Video Source:

DVD



Video Format:

XviD



Video Genre:

Animation,
Family,

Fantasy



Language:

English



Audio Format:

MP3
Reply With Quote
  #5  
Old 07-21-2008
rubin's Avatar
Registered User
 

Join Date: Nov 2007
Posts: 215
If I understood correctly your requirement, the following code will remove extra spaces, newlines and tabs.

Code:
awk  '/Stats/{f=0}{$1=$1} !/^$/ && f; /Attributes/{f=1}' RS="<[^>]+>|\n+"  html > out_html.txt
After running the command save the file as text file, and then use Wordpad or MS Word to open the output file ( not Notepad ), so the newlines are preserved in the output.

From your other sample, this is the output from running the code on Cygwin ( on Windows):

Code:
Category: Movies
Video Source:
DVD
Video Format:
XviD
Video Genre:
Animation, Family, Fantasy
Language:
English
Audio Format:
MP3
Is this what you required ?
Reply With Quote
  #6  
Old 07-21-2008
Moderator
 

Join Date: Dec 2003
Location: /dev/fl
Posts: 1,059
Not to detract from this solution, but it appears to only work with gawk.
Reply With Quote
  #7  
Old 07-21-2008
rubin's Avatar
Registered User
 

Join Date: Nov 2007
Posts: 215
That's correct, I've noticed that. To get it to work on older awk ( say HP-UX or Solaris), RS has to be changed to FS, and eventually some code modifications too. That's because older awk has a problem parsing correctly RS, the whole regex get reduced to a mere "<", but luckily not FS which is parsed the way it's expected.

That's why first goes the question that on what OS will the code be run.
Reply With Quote
Google The UNIX and Linux Forums
Reply

Tags
linux, solaris

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes




All times are GMT -7. The time now is 04:08 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0