03-20-2008
In my experience sed is not an appropriate tool to use to extract data from HTML files. Often times the HTML is poorly formed even through it may apparantly render correctly on a web browser.
It is better to use tools specifically designed to parse HTML/XHTML/XML/SGML files. Perl, Python, Ruby all have modules which support parsing such files using DOM or SAX.
Even better are XSLT processors. If you want a simple command line XSLT processor which works on UNIX I suggest you try xsltproc which comes with libxslt. If the Gnome desktop is installed on your system, libxslt is already installed.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
I am trying to transpose tables listed in the format into format. Any help would be greatly appreciated.
Input:
test_data_1
1 2 90%
4 3 91%
5 4 90%
6 5 90%
9 6 90%
test_data_2
3 5 92%
5 4 92%
7 3 93%
9 2 92%
1 1 92%
...
Output:... (7 Replies)
Discussion started by: justthisguy
7 Replies
2. UNIX for Dummies Questions & Answers
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies
3. Shell Programming and Scripting
I am attempting to extract weather data from the following website, but for the Victoria area only:
Text Forecasts - Environment Canada
I use this:
sed -n "/Greater Victoria./,/Fraser Valley./p"
But that phrasing does not sometimes get it all and think perhaps the website has more... (2 Replies)
Discussion started by: lagagnon
2 Replies
4. AIX
Please help me in creating the script in AIX.
requirement is;
The new component's main function is to extract the data from DB2 tables and company's firewall directly.
The component function needs to check the timestamp in the DB2 tables ((CREDAT and CRETIM) with the requested timestamp and... (1 Reply)
Discussion started by: priyanka3006
1 Replies
5. Shell Programming and Scripting
Hello everyone, I'm new to this forum and i am new as a shell scripter.
my problem is to have html files in a directory and I would like to extract from these some data that lies between two different lines
Here's my situation
<td align="default"> oxidizability (mg / l):
data_to_extract... (6 Replies)
Discussion started by: sbobotex
6 Replies
6. Shell Programming and Scripting
I am working on awk script to generate an HTML format output. With input file as below I am able to generate a HTML file however I want to saperate spare devices in a different table than rest of the devices and which has only Bunch ID, RAW Size and "Bunch Spare" status columns.
INPUT File :
... (2 Replies)
Discussion started by: dynamax
2 Replies
7. Shell Programming and Scripting
I have bash, awk, and sed available on my portable device. I need to extract 10 fields from each table row from a web page that looks like this:
</tr>
<tr>
<td>28 Apr</td>
<td><a... (6 Replies)
Discussion started by: rickgtx
6 Replies
8. Shell Programming and Scripting
Hi, I'm trying to get some data from an html file, but the problem is before it can extract the information I have multiple patterns that need to be passed through.
https://www.unix.com/shell-programming-scripting/150711-extract-data-awk-html-files.html
Is a similar problem. The only... (5 Replies)
Discussion started by: counfhou
5 Replies
9. Shell Programming and Scripting
I have the data in csv in 3 tables. how can I output the same into 3 tables in html.also how can I set the width. tried multiple options . attached is the format.
#!/bin/ksh
awk 'BEGIN{
FS=","
print "<HTML><BODY><TABLE border = '1' cellpadding=10 width=100>"
print... (7 Replies)
Discussion started by: archana25
7 Replies
10. UNIX for Beginners Questions & Answers
Hi I have a script which extracts the table from HTML and convert it into .csv.
But the problem in the script is if we have 2 tables in HTMl . it takes only the first table.
Please help me what changes i need to do in the script to make it read the complete HTML page.
Script is as below:
... (10 Replies)
Discussion started by: deepti01
10 Replies
LEARN ABOUT HPUX
nlist_ia
nlist_ia(4) Kernel Interfaces Manual nlist_ia(4)
NAME
nlist_ia: nlist, nlist64 - structure formats for Integrity systems
SYNOPSIS
Remarks
The exact content of the structure defined below can be best found by examining It varies somewhat between various HP-UX implementations.
This manpage describes on Integrity systems. For on PA-RISC systems, see nlist_pa(4).
DESCRIPTION
and can be used to extract information from the symbol table in an object file (see nlist(3C)). They are basically the same tool, and both
can process SOM and Elf files. Since symbol tables are machine dependent (as defined in each implementation's copy of a header file, is
defined to encapsulate the differences.
The nlist function, either or when used with the corresponding nlist structure, can be used to extract certain information about selected
symbols in the symbol table. The data associated with each symbol is machine specific, thus only the name and position of the field in the
function is standardized by HP-UX. The rest of the structure includes at least the value and type of the symbol. The names and meanings
of all fields not standardized will change no more than necessary.
The structure is the same as the structure and is used for source compatibility purposes.
SEE ALSO
nlist(3C), a.out(4).
Integrity Systems Only nlist_ia(4)