Sponsored Content
Top Forums UNIX for Dummies Questions & Answers How do I extract text only from html file without HTML tag Post 83908 by LanceBoyles on Tuesday 20th of September 2005 07:29:28 PM
Old 09-20-2005
Use Lynx with the --dump option, like this:
Code:
lynx --dump myfile.html > myfile.txt

OR
Code:
lynx --dump http://some.where.com/whatever.html > myfile.txt

You can write a shell script that will do this for many files without you having to touch it.

Last edited by Yogesh Sawant; 12-20-2010 at 08:35 AM.. Reason: added code tags
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parse HTML tag parameters and text

Hi! I have a bunch of HTML files, which I want to parse to CSV files. Every page has a table in it, and I need to parse each row into a csv record. With awk and sed, I managed to put every table row in separate lines. So my file looks like this: <TR> .... </TR> <TR> .... </TR> ...One... (1 Reply)
Discussion started by: senszey
1 Replies

2. Shell Programming and Scripting

SED to extract HTML text data, not quite right!

I am attempting to extract weather data from the following website, but for the Victoria area only: Text Forecasts - Environment Canada I use this: sed -n "/Greater Victoria./,/Fraser Valley./p" But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies

3. Shell Programming and Scripting

Parsing HTML, get text between 2 HTML tags

Hi there, I'm quite new to the forum and shell scripting. I want to filter out the "166.0 points". The results, that i found in google / the forum search didn't helped me :( <a href="/user/test" class="headitem menu" style="color:rgb(83,186,224);">test</a><a href="/points" class="headitem... (1 Reply)
Discussion started by: Mysthik
1 Replies

4. Shell Programming and Scripting

Removing all except couple of html tags from html file

I tried to find elegant (or at least simple) way to remove all but couple of html tags from html file, but all examples I found dealt with removing all the tags. The logic of the script would be: - if there is <li> or <ul> on the line, do nothing (=write same line to output) - if there is:... (0 Replies)
Discussion started by: juubuntu
0 Replies

5. Shell Programming and Scripting

Add the html tag first and last line the file

Hi, i have 30 html files and i want to add the html tag first (<html>) and end of the line </html> tag..How to do it in script. Thanks, (7 Replies)
Discussion started by: bmk
7 Replies

6. Shell Programming and Scripting

Search for a html tag and print the entire tag

I want to print from <fruits> to </fruits> tag which have <fruit> as mango. Also i want both <fruits> and </fruits> in output. Please help eg. <fruits> <fruit id="111">mango<fruit> . another 20 lines . </fruits> (3 Replies)
Discussion started by: Ashik409
3 Replies

7. UNIX for Dummies Questions & Answers

Extract table from an HTML file

I want to extract a table from an HTML file. the table starts with <table class="tableinfo" and ends with next closing table tag </table> how can I do this with awk/sed... ---------- Post updated at 04:34 PM ---------- Previous update was at 04:28 PM ---------- also I want to... (4 Replies)
Discussion started by: koutroul
4 Replies

8. Shell Programming and Scripting

Extract specific line in an html file starting and ending with specific pattern to a text file

Hi This is my first post and I'm just a beginner. So please be nice to me. I have a couple of html files where a pattern beginning with "http://www.site.com" and ending with "/resource.dat" is present on every 241st line. How do I extract this to a new text file? I have tried sed -n 241,241p... (13 Replies)
Discussion started by: dejavo
13 Replies

9. Shell Programming and Scripting

Extract both contents from a html file and do printing

Hi there, Print IP Address: grep 'HostID :' 10.244.9.124\ nessus.html | awk -F '<br>' '{print $12}' | tr -s ' ' | awk -F ':' '{print "<tr><td>" $2 "</td><td>"}' Print Respective Ports: grep 'classsubsection\|./tcp\|./udp' 10.244.9.124\ nessus.html | grep -v 'h2.classsubsection... (3 Replies)
Discussion started by: alvinoo
3 Replies

10. Shell Programming and Scripting

Extract text from html using perl or awk

I am trying to extract text after keywords fron an html file. The keywords are reportLink":, "barcodedSamples": {", "barcodedSamples": {". Both the perl and awk run but the output is just the entire index.html not the desired output. Also for the reportLink": only the text after the second / until... (5 Replies)
Discussion started by: cmccabe
5 Replies
Locale::Script(3)					User Contributed Perl Documentation					 Locale::Script(3)

NAME
Locale::Script - standard codes for script identification SYNOPSIS
use Locale::Script; $script = code2script('phnx'); # 'Phoenician' $code = script2code('Phoenician'); # 'Phnx' $code = script2code('Phoenician', LOCALE_CODE_NUMERIC); # 115 @codes = all_script_codes(); @scripts = all_script_names(); DESCRIPTION
The "Locale::Script" module provides access to standards codes used for identifying scripts, such as those defined in ISO 15924. Most of the routines take an optional additional argument which specifies the code set to use. If not specified, the default ISO 15924 four-letter codes will be used. SUPPORTED CODE SETS
There are several different code sets you can use for identifying scripts. A code set may be specified using either a name, or a constant that is automatically exported by this module. For example, the two are equivalent: $script = code2script('phnx','alpha'); $script = code2script('phnx',LOCALE_SCRIPT_ALPHA); The codesets currently supported are: alpha, LOCALE_SCRIPT_ALPHA This is a set of four-letter (capitalized) codes from ISO 15924 such as 'Phnx' for Phoenician. It also includes additions to this set included in the IANA language registry. The Zxxx, Zyyy, and Zzzz codes are not used. This is the default code set. num, LOCALE_SCRIPT_NUMERIC This is a set of three-digit numeric codes from ISO 15924 such as 115 for Phoenician. ROUTINES
code2script ( CODE [,CODESET] ) script2code ( NAME [,CODESET] ) script_code2code ( CODE ,CODESET ,CODESET2 ) all_script_codes ( [CODESET] ) all_script_names ( [CODESET] ) Locale::Script::rename_script ( CODE ,NEW_NAME [,CODESET] ) Locale::Script::add_script ( CODE ,NAME [,CODESET] ) Locale::Script::delete_script ( CODE [,CODESET] ) Locale::Script::add_script_alias ( NAME ,NEW_NAME ) Locale::Script::delete_script_alias ( NAME ) Locale::Script::rename_script_code ( CODE ,NEW_CODE [,CODESET] ) Locale::Script::add_script_code_alias ( CODE ,NEW_CODE [,CODESET] ) Locale::Script::delete_script_code_alias ( CODE [,CODESET] ) These routines are all documented in the Locale::Codes::API man page. SEE ALSO
Locale::Codes The Locale-Codes distribution. Locale::Codes::API The list of functions supported by this module. http://www.unicode.org/iso15924/ Home page for ISO 15924. http://www.iana.org/assignments/language-subtag-registry The IANA language subtag registry. AUTHOR
See Locale::Codes for full author history. Currently maintained by Sullivan Beck (sbeck@cpan.org). COPYRIGHT
Copyright (c) 1997-2001 Canon Research Centre Europe (CRE). Copyright (c) 2001-2010 Neil Bowers Copyright (c) 2010-2013 Sullivan Beck This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. perl v5.16.3 2013-06-03 Locale::Script(3)
All times are GMT -4. The time now is 04:00 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy