![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| extracting a string from html file with sed or awk | Ragnar157 | Shell Programming and Scripting | 35 | 09-01-2008 05:39 AM |
| Problem with 'sed' command while using HTML tags | The Observer | Shell Programming and Scripting | 5 | 06-09-2008 10:00 AM |
| Problem in extracting vector data | ahjiefreak | Shell Programming and Scripting | 2 | 03-18-2008 06:09 AM |
| html problem: get file name dialog exists? | f33ldead | Shell Programming and Scripting | 0 | 02-25-2008 08:48 PM |
| For loop problem extracting data | nitin | UNIX for Advanced & Expert Users | 2 | 11-27-2001 05:20 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
||||
|
Problem when extracting the title of HTML doc
Dear all.
I need to extract the title (text between <title> and </title>) of a set of HTML documents. I've found a command that makes the work of extracting the text, but it does not always work. It works with the next example: Code:
cat a.txt htmltext<title>This is a HTML title</title>blablalbla Code:
grep title a.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' This is a HTML title Code:
cat b.txt <head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></meta> <title>This my new page </title> <link href...></link> Code:
grep title b.txt | sed -n 's/.*<title>\(.*\)<\/title>.*/\1/ip;T;q' I appreciate any comment or suggestion. |
|
||||
|
First off - dispose of the grep. One regex program is enough on every commandline and grep can do nothing which sed couldn't do too - and better so.
The reason is that sed works linewise - once a new line is read sed forgets (usually - we can overcome that) what it has done on the last line. The following title would be extracted with your regex: Code:
<title>blah</title> Code:
<title>blah </title> Fortunately there is a device to make sed less forgettable: the line-range. When we write "s/x/y/" we imply that this rule is used on every line. Still, this is only the abbreviated form of a command, which would include a starting and an end line: "1,5 s/x/y/" would apply the rule only to lines 1-5. Try these with a test file to see the effect. OK, using line numbers is a bit static, because usually we will not know on which line a certain rule has to be applied - at least not beforehand. But it is also possible to use additional regexes to define the first and the last line of the block where the rule will be applied: Code:
<regex1>,<regex2> <command> Code:
sed -n '/<title>/,/<\/title>/ p' There are three possible types of lines: 1. lines with a "<title>" in them. We want to delete everything up to "<title>" and display the rest 2. lines with a "</title>" in them. We want to keep everything up to "</title>" and dispose of the rest. 3. Lines in between. We want to keep them entirely. Ok, lets do it - one more thing: it is possible to group commands in regex language like in any programming language. The curly braces "{}" are used to group several commands to a single one: Code:
sed -n '/<title>/,/<\/title>/ {
s/^.*<title>//
s/<\/title>.*$//
p
}'
I leave the task to concatenate the resulting lines to you as an exercise. If you still have troubles feel free to ask again. I hope this helps. bakunin |
|
||||
|
It definitely works.
Thank you very much bakunin for your excellent explanation, and for the fast reply. I really appreciate your help |
| Sponsored Links | ||
|
|
![]() |
| Bookmarks |
| Tags |
| awk, awk trim, trim, trim awk |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|