09-20-2005
How do I extract text only from html file without HTML tag
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer, then "save as text", then open it again. But I need to do this from UNIX, as I have thousands of html files and have no time to convert them to text files one by one. I went through many books, but can't find a way. I would really appreciate your help.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi!
I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record.
With awk and sed, I managed to put every table row in separate lines. So my file looks like this:
<TR> .... </TR>
<TR> .... </TR>
...One... (1 Reply)
Discussion started by: senszey
1 Replies
2. Shell Programming and Scripting
I am attempting to extract weather data from the following website, but for the Victoria area only:
Text Forecasts - Environment Canada
I use this:
sed -n "/Greater Victoria./,/Fraser Valley./p"
But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies
3. Shell Programming and Scripting
Hi there, I'm quite new to the forum and shell scripting.
I want to filter out the "166.0 points". The results, that i found in google / the forum search didn't helped me :(
<a href="/user/test" class="headitem menu" style="color:rgb(83,186,224);">test</a><a href="/points" class="headitem... (1 Reply)
Discussion started by: Mysthik
1 Replies
4. Shell Programming and Scripting
I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags.
The logic of the script would be:
- if there is <li> or <ul> on the line, do nothing (=write same line to output)
- if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies
5. Shell Programming and Scripting
Hi,
i have 30 html files and i want to add the html tag first (<html>) and end of the line </html> tag..How to do it in script.
Thanks, (7 Replies)
Discussion started by: bmk
7 Replies
6. Shell Programming and Scripting
I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help
eg.
<fruits>
<fruit id="111">mango<fruit>
.
another 20 lines
.
</fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies
7. UNIX for Dummies Questions & Answers
I want to extract a table from an HTML file. the table starts with
<table class="tableinfo"
and ends with next closing table tag
</table>
how can I do this with awk/sed...
---------- Post updated at 04:34 PM ---------- Previous update was at 04:28 PM ----------
also I want to... (4 Replies)
Discussion started by: koutroul
4 Replies
8. Shell Programming and Scripting
Hi
This is my first post and I'm just a beginner. So please be nice to me.
I have a couple of html files where a pattern beginning with "http://www.site.com" and ending with "/resource.dat" is present on every 241st line. How do I extract this to a new text file?
I have tried sed -n 241,241p... (13 Replies)
Discussion started by: dejavo
13 Replies
9. Shell Programming and Scripting
Hi there,
Print IP Address:
grep 'HostID :' 10.244.9.124\ nessus.html | awk -F '<br>' '{print $12}' | tr -s ' ' | awk -F ':' '{print "<tr><td>" $2 "</td><td>"}'
Print Respective Ports:
grep 'classsubsection\|./tcp\|./udp' 10.244.9.124\ nessus.html | grep -v 'h2.classsubsection... (3 Replies)
Discussion started by: alvinoo
3 Replies
10. Shell Programming and Scripting
I am trying to extract text after keywords fron an html file. The keywords are reportLink":, "barcodedSamples": {", "barcodedSamples": {". Both the perl and awk run but the output is just the entire index.html not the desired output. Also for the reportLink": only the text after the second / until... (5 Replies)
Discussion started by: cmccabe
5 Replies
LEARN ABOUT DEBIAN
unaccent
unaccent(1) General Commands Manual unaccent(1)
NAME
unaccent - remove accents from input stream or a string
SYNOPSIS
unaccent [--debug_low] [--debug_high] [-h] charset [string] [expected]
DESCRIPTION
With a single argument, unaccent reads data from stdin, replaces accented letters by their unaccented equivalent and writes the result on
stdout. If the second argument ('string') is provided unaccent transforms it by replacing accented letters by their unaccented equivalent.
The result is printed on the standard output. The charset of the input string or the data read from stdin is specified by the 'charset'
argument (ISO-8859-15 for instance). The output is printed using the same charset.
If the 'expected' argument is provided, the output string is compared to it. If they are not equal unaccent exits on error.
unaccent relies on the iconv(3) library to convert from the specified charset to UTF-16BE (or UTF-16 if UTF-16BE is not available). You
should check the manual pages for available charsets. On GNU/Linux the command
iconv -l
shows all available charsets.
OPTIONS
--debug_low
Prints human readable information about the unaccentuation process. See unac(3) for more information.
--debug_high
Prints very detailed information about the unaccentuation process. See unac(3) for more information.
--help -h
Prints a short usage and exits.
EXAMPLES
Remove accents from the string ete and check that the result is ete.
unaccent ISO-8859-1 ete ete
Remove accents from file myfile and put the result in file myfile.unaccent
unaccent ISO-8859-1 < myfile > myfile.unaccent
SEE ALSO
unac(3), iconv(3)
AUTHOR
Loic Dachary loic@senga.org
http://www.senga.org/unac/
local unaccent(1)