![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| To Break data out of HTML | phip | Shell Programming and Scripting | 1 | 05-20-2008 03:23 AM |
| Converting HTML data into a spreadsheet | garric | Shell Programming and Scripting | 4 | 04-22-2008 10:00 AM |
| How do I extract text only from html file without HTML tag | los111 | UNIX for Dummies Questions & Answers | 4 | 11-28-2007 04:40 AM |
| coverting html data to text in 'c' | phani_sree | High Level Programming | 3 | 10-18-2007 10:06 AM |
| Converting tables of row data into columns of tables | justthisguy | Shell Programming and Scripting | 7 | 07-16-2007 04:42 PM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
||||
|
extract data from html tables
hi
i need to use unix to extract data from several rows of a table coded in html. I know that rows within a table have the tags <tr> </tr> and so i thought that my first step should be to to delete all of the other html code which is not contained within these tags. i could then use this method again but remove everything not in <td> </td> tags. but the big question is how can i do this? i think i need sed but at the moment it is just confusing me too much any help? |
|
||||
|
In principle you are right. The following script will extract everything between a "<tr>" and "</tr>" tag. It will assume that there are no multiple "<tr>-</tr>"-pairs on a single line and the tags themselves are all lowercase (no "<TR>").
The result might not be what you need, though, so you might consider giving us a sample of what you have and what you will need to get from it. This would help us to help you better. Code:
sed 's/.*<tr>//;s/<\/tr>.*//' /path/to/your/file bakunin |
|
|||||
|
Hi.
See HTML table to CSV for another approach using lynx -dump. In general, links to threads similar to yours are posted at the bottom of the thread ... cheers, drl |
|
||||
|
thanks bakunin that is really helpful. i cant post a sample of the html page for various reasons. the only problem with your solution is that most of the <tr> tags are across multiple lines in my html page. ie the tag may be opened on line 7 and then closed on line 20. hence is it possible with sed to delete everything on a line (including the line) BUT stop when it gets to a <tr> tag and start again when it gets to a </tr>? alternatively is there a way to make sed believe that the whole html page is on a single line?
as i am not familiar with the capabilities of sed, it makes it hard for me to know what the best way of completing this task is. |
|
||||
|
Quote:
If you post something, someone will post code. Otherwise, you're going to have to do it yourself. Try something like replacing all newlines in the file with spaces, splitting the file before each < or after each >, and going from there. If you may have a < or > within the data, then you're going to do a little extra work. That's the best I can do for you at the moment. ShawnMilo |
|
||||
|
trying to answer my own question here, but im still struggling
if this doesnt work then i will mock up an example, i just thought that my description may have been good enough withough having to waste time making an example table. i found on this site Sed - An Introduction and Tutorial that you can create ranges by patterns. the example code is: Code:
sed '/start/,/stop/ s/#.*//' could someone please help me get this sed command working? thanks Last edited by Streetrcr; 03-20-2008 at 04:05 AM.. Reason: code tags |
|
||||
|
Quote:
Code:
sed -n '/<tr>/,/<\/tr> {
s/.*<tr>//
s/<\/tr>.*//
p
}' /path/to/your/file
Everything between the curly braces is executed only when inside the range specified on line 1. As you can see the last command inside the curly braces is a "p", which will print everything inside this range. If you delete the two "s/...."-commands it would print something this: Code:
something....<tr> content of the tr-tag some more content even more content</tr> something else.... One more word, though: You got a second answer from me because i appreciated that you were doing genuine research on your own. You almost forfeited this answer because of this: Quote:
I am even now "wasting some more time" to explain to you why you might sometimes get no answer at all or some answer you can't use. Go figure. I hope this helps. bakunin |
| Sponsored Links | ||
|
|