![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Extract content from several txt-files | larsu | Shell Programming and Scripting | 7 | 06-17-2008 03:52 AM |
| Content extract of a file using awk | nr_shan | Shell Programming and Scripting | 5 | 12-19-2007 05:22 AM |
| How do I extract text only from html file without HTML tag | los111 | UNIX for Dummies Questions & Answers | 4 | 11-28-2007 04:40 AM |
| sending mail with html content | gmchoudary | UNIX for Dummies Questions & Answers | 2 | 11-28-2005 08:26 AM |
| mail: html content | RishiPahuja | Shell Programming and Scripting | 2 | 10-31-2005 12:43 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
sed to extract HTML content
Hiya,
I am trying to extract a news article from a web page. The sed I have written brings back a lot of Javascript code and sometimes advertisments too. Can anyone please help with this one ??? I need to fix this sed so it picks up the article ONLY (don't worry about the title or date .. i got those using a separate sed) .. The sed I am running is: tr -d '\n' <03climate.html | sed -e 's/»//g' -e 's/.*nyt_text[^;]*;//' -e 's/<\/p>.*//g' -e 's/<[^>]*>//g' -e s'/[&][#]//g' -e 's/<[^>]*>//g' >> articletest The file I am trying to extract from (03climate.html) and the result (articletest.txt) are both attached with this post .. ![]() Thanks. SG |
|
||||
|
IMO, its really a bad idea to parse HTML files using SED/AWK
Your best bet is to use PERL, HTML::TreeBuilder - Parser that builds a HTML syntax tree - search.cpan.org |
![]() |
| Bookmarks |
| Tags |
| html, news article, sed, shell script, unresolved |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|