The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Remove html tags with bash dejavu88 Shell Programming and Scripting 4 05-22-2008 01:58 PM
How to supplement HTML tags with SED DocBrewer Shell Programming and Scripting 3 04-25-2008 09:40 AM
How to remove only html tags inside a file? btech_raju Linux 2 11-23-2007 12:25 PM
Automated replacement of HTML Tags nem_kirk SUN Solaris 1 11-17-2005 01:24 AM
unsing sed to strip html tags - help zap Shell Programming and Scripting 3 04-18-2004 04:03 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 11-28-2007
dunryc dunryc is offline
Registered User
  
 

Join Date: Nov 2007
Posts: 4
html tags

hi new to the forum so hi every one hope you all well,

Iam attempting to write a bash script at the moment its a scraper/grabber using wget to download webpages related to the users query. that part is no probs when i have the page i need to stipr all the useless (to me) data out of the html source ie :-

Quote:

<html>
test test test
<tag>test ttest </tag>
<new>
this is the data i want to grab between the new tags
</new>
<html>

as you can seen from the above the data i need to grab is from between the new tags these are always on the source what ever the uses query. Can anyone help or point me in the correct direction any help would be greatly appreciated thanks for listening dunryc
  #2 (permalink)  
Old 11-28-2007
porter porter is offline Forum Advisor  
Registered User
  
 

Join Date: Jan 2007
Posts: 2,965
Have you considered XMLStarlet Command Line XML Toolkit: Overview
  #3 (permalink)  
Old 11-28-2007
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
  
 

Join Date: May 2005
Location: In the leftmost byte of /dev/kmem
Posts: 1,628
Quote:
Originally Posted by dunryc View Post
the data i need to grab is from between the new tags these are always on the source what ever the uses query.
There are two different cases to be considered: the starting and ending tags are on the same line or they are on different lines:

Code:
Example

<new>This is the text to catch</new>

<new>
This is some text
to catch</new>
Both can be matched by simple regular expressions. For each regexp i give the matched portion in blue:

Code:
sed -n 's/.*<new>\(.*\)<\/new>.*/\1/p'

blabla <new>text to match</new> blabla

sed -n '/<new>/,/<\/new>/ {
               s/.*<new>//
               s/<\/new>.*//
               /^$/d
               p
               }'

blabla <new>text
to
match</new> blabla
bakunin
  #4 (permalink)  
Old 11-29-2007
dunryc dunryc is offline
Registered User
  
 

Join Date: Nov 2007
Posts: 4
thanks for the pointers guys , i did have a look at XMLStarlet to grab the data and it works great but i wanted to use tools that would be present in most distros the commands that bakunin work great once again thanks for the help
Sponsored Links
Closed Thread

Bookmarks

Tags
regex, regular expressions

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 07:19 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0