![]() |
|
|
|
|
|||||||
| Forums | Portal | Register | Forum Rules | FAQ | Contribute | Members List | Arcade | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here. |
|
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How do I extract text only from html file without HTML tag | los111 | UNIX for Dummies Questions & Answers | 4 | 11-28-2007 12:40 AM |
| extracting a string | start_shell | Shell Programming and Scripting | 2 | 09-30-2007 05:41 AM |
| Extracting a string from one file and searching the same string in other files | mohancrr | Shell Programming and Scripting | 1 | 09-19-2007 12:17 AM |
| sed, grep, awk, regex -- extracting a matched substring from a file/string | ropers | Shell Programming and Scripting | 2 | 05-23-2006 10:56 AM |
| extracting from a string | preetikate | Shell Programming and Scripting | 1 | 03-11-2004 04:08 AM |
|
|
Submit Tools | LinkBack | Thread Tools | Display Modes |
|
|||
|
Hi,
i am downloading a html file with wget and now need to extract certain informations from it. The string starts after "Attributes:" and ends before "Stats:". It may span across multiple Lines. Within that string there are random additional strings that i need. They have the pattern "String1:String2", these are the Attributes and their Values separated by a ":" So based on the example below my result shuld be: Category: Movies Video Source: DVD Video Format: WMV Video Genre: Horror, Mystery, Thriller Language: English Audio Format: WMA I tried sed with sed -n "/Attributes:/,/Stats:/p" but that will get the whole line and not only the text between the two strings. I have not found a parameter to change that behavior. Thanks in advance for any help Rag Quote:
Quote:
|
| Forum Sponsor | ||
|
|
|
|||
|
Wow i have no idea how exactly that awk command is working, but fortunately i dont have to
|
|
|||
|
Now i wanted to use that on my XP download machine but unfortunately the wget output is different there.
The result has newline's and Tab's. For me all whitespace could be cleaned. space, tab, CR and LF. I added: |sed -e "s/\t*//g"|tr -d "\n" but then i dont have any CR/LF left. please help again Thanks Quote:
Quote:
|
|
||||
|
If I understood correctly your requirement, the following code will remove extra spaces, newlines and tabs.
Code:
awk '/Stats/{f=0}{$1=$1} !/^$/ && f; /Attributes/{f=1}' RS="<[^>]+>|\n+" html > out_html.txt
From your other sample, this is the output from running the code on Cygwin ( on Windows): Code:
Category: Movies Video Source: DVD Video Format: XviD Video Genre: Animation, Family, Fantasy Language: English Audio Format: MP3 |
|
||||
|
That's correct, I've noticed that. To get it to work on older awk ( say HP-UX or Solaris), RS has to be changed to FS, and eventually some code modifications too. That's because older awk has a problem parsing correctly RS, the whole regex get reduced to a mere "<", but luckily not FS which is parsed the way it's expected.
That's why first goes the question that on what OS will the code be run. |
||||
| Google The UNIX and Linux Forums |