![]() |
|
|
|
|
|||||||
| Forums | Portal | Register | Forum Rules | FAQ | Contribute | Members List | Arcade | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here. |
|
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How do I extract text only from html file without HTML tag | los111 | UNIX for Dummies Questions & Answers | 4 | 11-28-2007 01:40 AM |
| extracting a string | start_shell | Shell Programming and Scripting | 2 | 09-30-2007 05:41 AM |
| Extracting a string from one file and searching the same string in other files | mohancrr | Shell Programming and Scripting | 1 | 09-19-2007 12:17 AM |
| sed, grep, awk, regex -- extracting a matched substring from a file/string | ropers | Shell Programming and Scripting | 2 | 05-23-2006 10:56 AM |
| extracting from a string | preetikate | Shell Programming and Scripting | 1 | 03-11-2004 05:08 AM |
|
|
Submit Tools | LinkBack | Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Hi,
i am downloading a html file with wget and now need to extract certain informations from it. The string starts after "Attributes:" and ends before "Stats:". It may span across multiple Lines. Within that string there are random additional strings that i need. They have the pattern "String1:String2", these are the Attributes and their Values separated by a ":" So based on the example below my result shuld be: Category: Movies Video Source: DVD Video Format: WMV Video Genre: Horror, Mystery, Thriller Language: English Audio Format: WMA I tried sed with sed -n "/Attributes:/,/Stats:/p" but that will get the whole line and not only the text between the two strings. I have not found a parameter to change that behavior. Thanks in advance for any help Rag Quote:
Quote:
|
| Forum Sponsor | ||
|
|
|
#2
|
||||
|
||||
|
What is your OS ?
If you're on Linux try: Code:
awk '/Stats/{f=0} f && NF; /Attributes/{f=1}' RS="<[^>]+>" html_file
Code:
Category: Movies Video Source: DVD Video Format: WMV Video Genre: Horror, Mystery, Thriller Language: English Audio Format: WMA Code:
Category: Movies Region System: PAL Video Source: DVD Video Format: DVD Video Genre: Drama, War Language: English Subtitled Language: Dutch Audio Format: AC3/DD If there are opened tags somewhere in the records replace RS with: Code:
RS="<[^>]+>|<[^>]+|[^>]+>" |
|
#3
|
|||
|
|||
|
Wow i have no idea how exactly that awk command is working, but fortunately i dont have to
|
|
#4
|
|||
|
|||
|
Now i wanted to use that on my XP download machine but unfortunately the wget output is different there.
The result has newline's and Tab's. For me all whitespace could be cleaned. space, tab, CR and LF. I added: |sed -e "s/\t*//g"|tr -d "\n" but then i dont have any CR/LF left. please help again Thanks Quote:
Quote:
|
|
#5
|
||||
|
||||
|
If I understood correctly your requirement, the following code will remove extra spaces, newlines and tabs.
Code:
awk '/Stats/{f=0}{$1=$1} !/^$/ && f; /Attributes/{f=1}' RS="<[^>]+>|\n+" html > out_html.txt
From your other sample, this is the output from running the code on Cygwin ( on Windows): Code:
Category: Movies Video Source: DVD Video Format: XviD Video Genre: Animation, Family, Fantasy Language: English Audio Format: MP3 |
|
#6
|
|||
|
|||
|
Not to detract from this solution, but it appears to only work with gawk.
|
|
#7
|
||||
|
||||
|
That's correct, I've noticed that. To get it to work on older awk ( say HP-UX or Solaris), RS has to be changed to FS, and eventually some code modifications too. That's because older awk has a problem parsing correctly RS, the whole regex get reduced to a mere "<", but luckily not FS which is parsed the way it's expected.
That's why first goes the question that on what OS will the code be run. |
||||
| Google The UNIX and Linux Forums |