Awk/sed HTML extract


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk/sed HTML extract
# 1  
Old 07-31-2016
Awk/sed HTML extract

I'm extracting text between table tags in HTML

Code:
<th><a href="/wiki/Buick_LeSabre" title="Buick LeSabre">Buick LeSabre</a></th>


using this:

Code:
awk -F "</*th>" '/<\/*th>/ {print $2}' auto2 > auto3

then this (text between a href):
Code:
sed -e 's/\(<[^<][^<]*>\)//g' auto3 > auto4

How to shorten this into one command, preferably just awk or just sed? I've tried this, where $0 prints entire a href line, with tags, but trying $1, $2, $3, etc. just gives blank file.
Code:
awk -F "</?a href.*>" '{print $0}' auto3 > auto5

Thanks in advance for help.
# 2  
Old 07-31-2016
Code:
awk -F"[<>]" '/<\/th>/ {print $5}' auto2


Last edited by rdrtx1; 08-01-2016 at 10:40 AM..
# 3  
Old 08-01-2016
Given those <th> tags are on a line by themselves (which would be required for your awk sample to work anyway),

Code:
sed -n '/^<th/s/<[^>]*>//gp' file
Buick LeSabre


EDIT: Should that NOT be the case, remove other tags upfront...
Code:
sed -n '/<th/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file


Last edited by RudiC; 08-01-2016 at 03:15 AM..
This User Gave Thanks to RudiC For This Post:
# 4  
Old 08-01-2016
Thanks RudiC, those are both very close. I probably should have posted table structure because the sed commands are returning some fields from other table elements. I just need the text in between <th> a href from the "Automobile" heading:
Code:
<table class="wikitable sortable" style="font-size:90%">
<tr>
<th style="width:5em">Image</th>
<th style="width:15em">Automobile</th>
<th style="width:10em">Production</th>
<th style="width:15em">Units Sold</th>
<th style="width:10em">Years sold</th>
<th style="width:25em">Notes</th>
</tr>
<tr>
<td>
<div class="center">
<div class="floatnone"><a href="/wiki/File:Late_model_Ford_Model_T.jpg" class="image" title="1927 Ford Model-T."><img alt="1927 Ford Model-T." src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/100px-Late_model_Ford_Model_T.jpg" width="100" height="91" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/150px-Late_model_Ford_Model_T.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/200px-Late_model_Ford_Model_T.jpg 2x" data-file-width="400" data-file-height="365" /></a></div>
</div>
</td>
<th><a href="/wiki/Ford_Model_T" title="Ford Model T">Ford Model T</a></th>
<td>1908-27</td>
<td><b>16,500,000</b><sup id="cite_ref-ford_7-0" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
<td>1908-27</td>
<td>The first car to achieve one million, five million, ten million and fifteen million units sold. By 1914, it was estimated that nine out of every ten cars in the world were <a href="/wiki/Ford_Motor_Company" title="Ford Motor Company">Fords</a>.<sup id="cite_ref-ford_7-1" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
</tr>

Thanks for your time.

Re: rdtx1 awk command, thanks, that prints blank file beyond $1 (prints full doc). I tried up to $6).
# 5  
Old 08-01-2016
Hello p1ne,

Could you please try following and let me know if this helps you.
Code:
awk '($1 ~ /<th><a/){sub(/.*\">/,X,$0);sub(/<.*/,X,$0);print $0}'   Input_file

Output will be as follows.
Code:
Ford Model T

EDIT: Adding one more solution on same now too.
Code:
 awk '{if($0 ~ /^<th><a href=\"/){match($0,/\">.*/);print substr($0,RSTART+2,RLENGTH-11)}}'  Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-01-2016 at 10:07 AM.. Reason: Adding one more solution now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 6  
Old 08-01-2016
Thanks so much, R. Singh, indeed, that does it!

RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:
Code:
sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.
# 7  
Old 08-01-2016
Quote:
Originally Posted by p1ne
Thanks so much, R. Singh, indeed, that does it!
RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:
Code:
sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.
Glad to help you p1ne. Could you please try following code and let us know if this helps.
Code:
sed -n '/^<th><a href="/s/\(.*">\)\(.*\)\(<\/a.*\)/\2/p'   Input_file

Output will be as follows.
Code:
Ford Model T

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to extract value after keyword in html

Using awk to extract value after a keyword in an html, and store in ts. The awk does execute but ts is empty. I use the tag as a delimiter and the keyword as a pattern, but there probably is a better way. Thank you :). file <html><head><title>xxxxxx xxxxx</title><style type="text/css"> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. Shell Programming and Scripting

Extract text from html using perl or awk

I am trying to extract text after keywords fron an html file. The keywords are reportLink":, "barcodedSamples": {", "barcodedSamples": {". Both the perl and awk run but the output is just the entire index.html not the desired output. Also for the reportLink": only the text after the second / until... (5 Replies)
Discussion started by: cmccabe
5 Replies

3. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Hi, I'm trying to get some data from an html file, but the problem is before it can extract the information I have multiple patterns that need to be passed through. https://www.unix.com/shell-programming-scripting/150711-extract-data-awk-html-files.html Is a similar problem. The only... (5 Replies)
Discussion started by: counfhou
5 Replies

4. Shell Programming and Scripting

help with sed needed to extract content from html tags

Hi I've searched for it for few hours now and i can't seem to find anything working like i want. I've got webpage, saved in file par with form like this: <html><body><form name='sendme' action='http://example.com/' method='POST'> <textarea name='1st'>abc123def678</textarea> <textarea... (9 Replies)
Discussion started by: seb001
9 Replies

5. Shell Programming and Scripting

extract data with awk from html files

Hello everyone, I'm new to this forum and i am new as a shell scripter. my problem is to have html files in a directory and I would like to extract from these some data that lies between two different lines Here's my situation <td align="default"> oxidizability (mg / l): data_to_extract... (6 Replies)
Discussion started by: sbobotex
6 Replies

6. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

I am attempting to extract weather data from the following website, but for the Victoria area only: Text Forecasts - Environment Canada I use this: sed -n "/Greater Victoria./,/Fraser Valley./p" But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies

7. Shell Programming and Scripting

Extract URLs from HTML code using sed

Hello, i try to extract urls from google-search-results, but i have problem with sed filtering of html-code. what i wont is just list of urls thay apears between ........<p><a href=" and next following " in html code. here is my code, i use wget and pipelines to filtering. wget works, but... (13 Replies)
Discussion started by: L0rd
13 Replies

8. Shell Programming and Scripting

sed to extract only floating point numbers from HTML

Hi All, I'm trying to extract some floating point numbers from within some HTML code like this: <TR><TD class='awrc'>Parse CPU to Parse Elapsd %:</TD><TD ALIGN='right' class='awrc'> 64.50</TD><TD class='awrc'>% Non-Parse CPU:</TD><TD ALIGN='right' class='awrc'> ... (2 Replies)
Discussion started by: pondlife
2 Replies

9. UNIX for Advanced & Expert Users

sed to extract HTML content

Hiya, I am trying to extract a news article from a web page. The sed I have written brings back a lot of Javascript code and sometimes advertisments too. Can anyone please help with this one ??? I need to fix this sed so it picks up the article ONLY (don't worry about the title or date .. i got... (2 Replies)
Discussion started by: stargazerr
2 Replies

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies
Login or Register to Ask a Question