extract data from html tables

03-19-2008

Registered User

10, 0

Join Date: Mar 2008

Last Activity: 8 May 2008, 6:40 AM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

extract data from html tables

hi

i need to use unix to extract data from several rows of a table coded in html. I know that rows within a table have the tags <tr> </tr> and so i thought that my first step should be to to delete all of the other html code which is not contained within these tags. i could then use this method again but remove everything not in <td> </td> tags. but the big question is how can i do this? i think i need sed but at the moment it is just confusing me too much

any help?

Streetrcr

View Public Profile for Streetrcr

Find all posts by Streetrcr

03-19-2008

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

In principle you are right. The following script will extract everything between a "<tr>" and "</tr>" tag. It will assume that there are no multiple "<tr>-</tr>"-pairs on a single line and the tags themselves are all lowercase (no "<TR>").

The result might not be what you need, though, so you might consider giving us a sample of what you have and what you will need to get from it. This would help us to help you better.

Code:

sed 's/.*<tr>//;s/<\/tr>.*//' /path/to/your/file

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

03-19-2008

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

See https://www.unix.com/shell-programmin...table-csv.html for another approach using lynx -dump.

In general, links to threads similar to yours are posted at the bottom of the thread ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

03-19-2008

Registered User

10, 0

Join Date: Mar 2008

Last Activity: 8 May 2008, 6:40 AM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

thanks bakunin that is really helpful. i cant post a sample of the html page for various reasons. the only problem with your solution is that most of the <tr> tags are across multiple lines in my html page. ie the tag may be opened on line 7 and then closed on line 20. hence is it possible with sed to delete everything on a line (including the line) BUT stop when it gets to a <tr> tag and start again when it gets to a </tr>? alternatively is there a way to make sed believe that the whole html page is on a single line?

as i am not familiar with the capabilities of sed, it makes it hard for me to know what the best way of completing this task is.

Streetrcr

View Public Profile for Streetrcr

Find all posts by Streetrcr

03-19-2008

Registered User

252, 1

Join Date: Jun 2006

Last Activity: 10 November 2009, 8:27 AM EST

Posts: 252

Thanks Given: 0

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Streetrcr

There's no reason you can't mock up an HTML page which looks like the one you're working with but which does not contain any sensitive information. Nobody is interested in throwing darts into a dark room.

If you post something, someone will post code. Otherwise, you're going to have to do it yourself. Try something like replacing all newlines in the file with spaces, splitting the file before each < or after each >, and going from there. If you may have a < or > within the data, then you're going to do a little extra work. That's the best I can do for you at the moment.

ShawnMilo

ShawnMilo

View Public Profile for ShawnMilo

Find all posts by ShawnMilo

03-20-2008

Registered User

10, 0

Join Date: Mar 2008

Last Activity: 8 May 2008, 6:40 AM EDT

Posts: 10

Thanks Given: 0

Thanked 0 Times in 0 Posts

trying to answer my own question here, but im still struggling

if this doesnt work then i will mock up an example, i just thought that my description may have been good enough withough having to waste time making an example table.

i found on this site Sed - An Introduction and Tutorial that you can create ranges by patterns. the example code is:

Code:

sed '/start/,/stop/ s/#.*//'

i tried making <tr> my start and </tr> my stop but i just kept getting errors. furthermore i would have to NOT (!) this so instead of deleting everything in the tags, it deletes everything outside the tags.
could someone please help me get this sed command working?

thanks

Last edited by Streetrcr; 03-20-2008 at 05:05 AM.. Reason: code tags

Streetrcr

View Public Profile for Streetrcr

Find all posts by Streetrcr

03-20-2008

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

the only problem with your solution is that most of the <tr> tags are across multiple lines in my html page. ie the tag may be opened on line 7 and then closed on line 20.

Well, i told you that - in absence of any example - i had to make some assumptions. Here is a new version which will work on tags ranging over several lines. It will still not catch the case of several "<tr>...</tr>" pairs on one line, though.

Code:

sed -n '/<tr>/,/<\/tr> {
           s/.*<tr>//
           s/<\/tr>.*//
           p
           }' /path/to/your/file

How this works: the "-n" clause will stop sed from printing every line it has read, so if you delete the script it would print just nothing. This is to (implicitly) throw out all the lines which are NOT in the specified range.

Everything between the curly braces is executed only when inside the range specified on line 1. As you can see the last command inside the curly braces is a "p", which will print everything inside this range. If you delete the two "s/...."-commands it would print something this:

Code:

something....<tr> content of the tr-tag
some more content
even more content</tr> something else....

As you can see the bold parts should be deleted as they are not part of what you want. The two "s/..."-commands (s=substitute) take care of that along with the tags themselves. At last the p(rint)-command outputs the result of all the trimming.

One more word, though: You got a second answer from me because i appreciated that you were doing genuine research on your own. You almost forfeited this answer because of this:

Quote:

[...]withough having to waste time making an example table

You might notice i have "wasted time" not only writing a script but even wasted more time explaining how it works in the hope of not only solving the problem at hand but enhancing your understanding at the same time. On top of that i "wasted some more time" to write a script in my first post which nobody is going to need because it was based on faulty assumptions. Assumptions which might not have been faulty at all would i have been able to work from an example created by "wasting time".

I am even now "wasting some more time" to explain to you why you might sometimes get no answer at all or some answer you can't use. Go figure.

I hope this helps.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

UNIX for Dummies Questions & Answers

extract data from html tables

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extract the tables from html

Discussion started by: deepti01

2. Shell Programming and Scripting

Splitting csv into 3 tables in html file

Discussion started by: archana25

3. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Discussion started by: counfhou

4. Shell Programming and Scripting

extract complex data from html table rows

Discussion started by: rickgtx

5. Shell Programming and Scripting

awk to create two HTML Tables

Discussion started by: dynamax

6. Shell Programming and Scripting

extract data with awk from html files

Discussion started by: sbobotex

7. AIX

Extract data from DB2 tables and FTP it to outside company's firewall

Discussion started by: priyanka3006

8. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

Discussion started by: lagagnon

9. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

Discussion started by: los111

10. Shell Programming and Scripting

Converting tables of row data into columns of tables

Discussion started by: justthisguy