The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
extracting a string from html file with sed or awk Ragnar157 Shell Programming and Scripting 35 09-01-2008 05:39 AM
Problem with 'sed' command while using HTML tags The Observer Shell Programming and Scripting 5 06-09-2008 10:00 AM
Problem in extracting vector data ahjiefreak Shell Programming and Scripting 2 03-18-2008 06:09 AM
html problem: get file name dialog exists? f33ldead Shell Programming and Scripting 0 02-25-2008 08:48 PM
For loop problem extracting data nitin UNIX for Advanced & Expert Users 2 11-27-2001 05:20 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 12-01-2008
i007 i007 is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 2
Problem when extracting the title of HTML doc

Dear all.

I need to extract the title (text between <title> and </title>) of a set of HTML documents.
I've found a command that makes the work of extracting the text, but it does not always work.

It works with the next example:
Code:
cat a.txt 
htmltext<title>This is a HTML title</title>blablalbla
Code:
grep title a.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'
This is a HTML title
However, it does not works with a real example:

Code:
cat b.txt 
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></meta> <title>This my new page
</title> <link href...></link>
Code:
grep title b.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q'
The last command do not return anything.

I appreciate any comment or suggestion.
  #2 (permalink)  
Old 12-01-2008
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
  
 

Join Date: May 2005
Location: In the leftmost byte of /dev/kmem
Posts: 1,628
First off - dispose of the grep. One regex program is enough on every commandline and grep can do nothing which sed couldn't do too - and better so.

The reason is that sed works linewise - once a new line is read sed forgets (usually - we can overcome that) what it has done on the last line.

The following title would be extracted with your regex:

Code:
<title>blah</title>
but the following would fail:

Code:
<title>blah
</title>
The reason is that sed would read the first line, notice that the search pattern (which specifies the opening AND the closing tag to be there) is not found and move on to the next line. On the next line the same is true so the output is null.

Fortunately there is a device to make sed less forgettable: the line-range.

When we write "s/x/y/" we imply that this rule is used on every line. Still, this is only the abbreviated form of a command, which would include a starting and an end line: "1,5 s/x/y/" would apply the rule only to lines 1-5. Try these with a test file to see the effect.

OK, using line numbers is a bit static, because usually we will not know on which line a certain rule has to be applied - at least not beforehand. But it is also possible to use additional regexes to define the first and the last line of the block where the rule will be applied:

Code:
<regex1>,<regex2> <command>
Applying this to your problem, we could use "<title>" as the beginning and "</title>" of the block in question - it is legal to have only one line in a block - and apply your rule to the whole block instead of only one line:

Code:
sed -n '/<title>/,/<\/title>/ p'
This will print only the lines from the opening to the closing tag. Now we have to "trim" this to get a nice output.

There are three possible types of lines:

1. lines with a "<title>" in them. We want to delete everything up to "<title>" and display the rest

2. lines with a "</title>" in them. We want to keep everything up to "</title>" and dispose of the rest.

3. Lines in between. We want to keep them entirely.

Ok, lets do it - one more thing: it is possible to group commands in regex language like in any programming language. The curly braces "{}" are used to group several commands to a single one:

Code:
sed -n '/<title>/,/<\/title>/ {
            s/^.*<title>//
            s/<\/title>.*$//
            p
            }'
You might notice that there is no action for the type-3-lines, but in fact there is: its the "p" which prints all the resulting lines (or the parts which survived our trimming respectively) out. The "-n" makes sure no output is done save for explicitly ordered one.

I leave the task to concatenate the resulting lines to you as an exercise. If you still have troubles feel free to ask again.

I hope this helps.

bakunin
  #3 (permalink)  
Old 12-01-2008
i007 i007 is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 2
It definitely works.
Thank you very much bakunin for your excellent explanation, and for the fast reply.
I really appreciate your help
Sponsored Links
Closed Thread

Bookmarks

Tags
awk, awk trim, trim, trim awk

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 12:11 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0