Awk/sed HTML extract

07-31-2016

Registered User

30, 0

Join Date: Jul 2016

Last Activity: 3 April 2018, 12:38 PM EDT

Posts: 30

Thanks Given: 18

Thanked 0 Times in 0 Posts

Awk/sed HTML extract

I'm extracting text between table tags in HTML

Code:

<th><a href="/wiki/Buick_LeSabre" title="Buick LeSabre">Buick LeSabre</a></th>

using this:

Code:

awk -F "</*th>" '/<\/*th>/ {print $2}' auto2 > auto3

then this (text between a href):

Code:

sed -e 's/\(<[^<][^<]*>\)//g' auto3 > auto4

How to shorten this into one command, preferably just awk or just sed? I've tried this, where $0 prints entire a href line, with tags, but trying $1, $2, $3, etc. just gives blank file.

Code:

awk -F "</?a href.*>" '{print $0}' auto3 > auto5

Thanks in advance for help.

p1ne

View Public Profile for p1ne

Find all posts by p1ne

07-31-2016

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

awk -F"[<>]" '/<\/th>/ {print $5}' auto2

Last edited by rdrtx1; 08-01-2016 at 10:40 AM..

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

08-01-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Given those <th> tags are on a line by themselves (which would be required for your awk sample to work anyway),

Code:

sed -n '/^<th/s/<[^>]*>//gp' file
Buick LeSabre

EDIT: Should that NOT be the case, remove other tags upfront...

Code:

sed -n '/<th/{s/^.*<th>//;s/<\/th>.*$//;s/<[^>]*>//gp}' file

Last edited by RudiC; 08-01-2016 at 03:15 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-01-2016

Registered User

30, 0

Join Date: Jul 2016

Last Activity: 3 April 2018, 12:38 PM EDT

Posts: 30

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thanks RudiC, those are both very close. I probably should have posted table structure because the sed commands are returning some fields from other table elements. I just need the text in between <th> a href from the "Automobile" heading:

Code:

<table class="wikitable sortable" style="font-size:90%">
<tr>
<th style="width:5em">Image</th>
<th style="width:15em">Automobile</th>
<th style="width:10em">Production</th>
<th style="width:15em">Units Sold</th>
<th style="width:10em">Years sold</th>
<th style="width:25em">Notes</th>
</tr>
<tr>
<td>
<div class="center">
<div class="floatnone"><a href="/wiki/File:Late_model_Ford_Model_T.jpg" class="image" title="1927 Ford Model-T."><img alt="1927 Ford Model-T." src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/100px-Late_model_Ford_Model_T.jpg" width="100" height="91" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/150px-Late_model_Ford_Model_T.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Late_model_Ford_Model_T.jpg/200px-Late_model_Ford_Model_T.jpg 2x" data-file-width="400" data-file-height="365" /></a></div>
</div>
</td>
<th><a href="/wiki/Ford_Model_T" title="Ford Model T">Ford Model T</a></th>
<td>1908-27</td>
<td><b>16,500,000</b><sup id="cite_ref-ford_7-0" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
<td>1908-27</td>
<td>The first car to achieve one million, five million, ten million and fifteen million units sold. By 1914, it was estimated that nine out of every ten cars in the world were <a href="/wiki/Ford_Motor_Company" title="Ford Motor Company">Fords</a>.<sup id="cite_ref-ford_7-1" class="reference"><a href="#cite_note-ford-7">[7]</a></sup></td>
</tr>

Thanks for your time.

Re: rdtx1 awk command, thanks, that prints blank file beyond $1 (prints full doc). I tried up to $6).

p1ne

View Public Profile for p1ne

Find all posts by p1ne

08-01-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello p1ne,

Could you please try following and let me know if this helps you.

Code:

awk '($1 ~ /<th><a/){sub(/.*\">/,X,$0);sub(/<.*/,X,$0);print $0}'   Input_file

Output will be as follows.

Code:

Ford Model T

EDIT: Adding one more solution on same now too.

Code:

 awk '{if($0 ~ /^<th><a href=\"/){match($0,/\">.*/);print substr($0,RSTART+2,RLENGTH-11)}}'  Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-01-2016 at 10:07 AM.. Reason: Adding one more solution now.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

08-01-2016

Registered User

30, 0

Join Date: Jul 2016

Last Activity: 3 April 2018, 12:38 PM EDT

Posts: 30

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thanks so much, R. Singh, indeed, that does it!

RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:

Code:

sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.

p1ne

View Public Profile for p1ne

Find all posts by p1ne

08-01-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Quote:

Originally Posted by p1ne

Thanks so much, R. Singh, indeed, that does it!
RudiC, following your example, I'd like to solve also with sed. I'm trying this and variations, which give blank file:

Code:

sed -n '/^<th.^<a href.*/s/<[^>]*>//gp' auto2 > auto3

Thanks again.

Glad to help you p1ne. Could you please try following code and let us know if this helps.

Code:

sed -n '/^<th><a href="/s/\(.*">\)\(.*\)\(<\/a.*\)/\2/p'   Input_file

Output will be as follows.

Code:

Ford Model T

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

Shell Programming and Scripting

Awk/sed HTML extract

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to extract value after keyword in html

Discussion started by: cmccabe

2. Shell Programming and Scripting

Extract text from html using perl or awk

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Discussion started by: counfhou

4. Shell Programming and Scripting

help with sed needed to extract content from html tags

Discussion started by: seb001

5. Shell Programming and Scripting

extract data with awk from html files

Discussion started by: sbobotex

6. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

Discussion started by: lagagnon

7. Shell Programming and Scripting

Extract URLs from HTML code using sed

Discussion started by: L0rd

8. Shell Programming and Scripting

sed to extract only floating point numbers from HTML

Discussion started by: pondlife

9. UNIX for Advanced & Expert Users

sed to extract HTML content

Discussion started by: stargazerr

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111