![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How do I extract text only from html file without HTML tag | los111 | UNIX for Dummies Questions & Answers | 4 | 11-28-2007 04:40 AM |
| extracting a string | start_shell | Shell Programming and Scripting | 2 | 09-30-2007 08:41 AM |
| Extracting a string from one file and searching the same string in other files | mohancrr | Shell Programming and Scripting | 1 | 09-19-2007 03:17 AM |
| sed, grep, awk, regex -- extracting a matched substring from a file/string | ropers | Shell Programming and Scripting | 2 | 05-23-2006 01:56 PM |
| extracting from a string | preetikate | Shell Programming and Scripting | 1 | 03-11-2004 08:08 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Hi,
i am downloading a html file with wget and now need to extract certain informations from it. The string starts after "Attributes:" and ends before "Stats:". It may span across multiple Lines. Within that string there are random additional strings that i need. They have the pattern "String1:String2", these are the Attributes and their Values separated by a ":" So based on the example below my result shuld be: Category: Movies Video Source: DVD Video Format: WMV Video Genre: Horror, Mystery, Thriller Language: English Audio Format: WMA I tried sed with sed -n "/Attributes:/,/Stats:/p" but that will get the whole line and not only the text between the two strings. I have not found a parameter to change that behavior. Thanks in advance for any help Rag Quote:
Quote:
|
|
||||
|
Wow i have no idea how exactly that awk command is working, but fortunately i dont have to
Big thanks rubin! Im on Windows with the unixutils. i'll additionally clean the spaces gawk "/Stats/{f=0} f && NF; /Attributes/{f=1}" RS="]+>" index-3123947.html|sed -e "s/ //g">x.txt then set them as Environment variables in batch for /F "delims=: tokens=1,2" %f in (x.txt) do set nzb%f=%g |
|
||||
|
Now i wanted to use that on my XP download machine but unfortunately the wget output is different there.
The result has newline's and Tab's. For me all whitespace could be cleaned. space, tab, CR and LF. I added: |sed -e "s/\t*//g"|tr -d "\n" but then i dont have any CR/LF left. please help again Thanks Quote:
Quote:
|
|
|||||
|
If I understood correctly your requirement, the following code will remove extra spaces, newlines and tabs.
Code:
awk '/Stats/{f=0}{$1=$1} !/^$/ && f; /Attributes/{f=1}' RS="<[^>]+>|\n+" html > out_html.txt
From your other sample, this is the output from running the code on Cygwin ( on Windows): Code:
Category: Movies Video Source: DVD Video Format: XviD Video Genre: Animation, Family, Fantasy Language: English Audio Format: MP3 |
|
||||
|
works great! Thanks a lot!
But now i need something else But thats should be easy.I need to split a string at the last \ i managed to get the last part: echo D:\DATA\XY\what ever (nr2)|sed -e "s/.*\\//g" = what ever (nr2) But i dont know how to extract from start to the last \ so i get D:\DATA\XY |
![]() |
| Bookmarks |
| Tags |
| linux, solaris |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|