Sponsored Content
Top Forums Shell Programming and Scripting Extract/Parse information from html (website) Post 302627661 by Scrutinizer on Saturday 21st of April 2012 05:54:48 AM
Old 04-21-2012
Genuine HTML parsing is preferable I think, but FWIW this is with a bit of awk using http://pastebin.com/DL1KERT4 as the input file :
Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,ORS,$1); print $1}' RS=


Code:
Siegeltr&auml;ger
Badische Kraftwerk GmbH & Co. KG
76532 Baden-Baden


Contractor
Bayerische Elektrizitätswerke GmbH
86150 Augsburg
Tel.: +49 (0821) 328 - 0
Fax: +49 (0821) 328 - 4160


Siegeltr&auml;ger
BayWa Energie Dienstleistungs GmbH
81925 München


Siegeltr&auml;ger
BEG Energiegesellschaft mbH
12681 Berlin


Partnerunternehmen
Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme
Dipl.-Ing. Günter Schlagowski
28213 Bremen
Tel.: +49 (0421) 211210
Fax: +49 (0421) 212772


Interessent
Bernd Wiggenhauser
78234 Engen


Interessent
Berndorff Contracting GmbH
50674 Köln


Contractor
beta GmbH Betrieb energietechnischer Anlagen
30451 Hannover
Tel.: +49 (0511) 45001109
Fax: +49 (0511) 497574


Siegeltr&auml;ger
BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG
23684 Schulendorf


Siegeltr&auml;ger
BHK-Systeme GmbH
10243 Berlin

or


Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,"|",$1); print $1}' RS=

Code:
|Siegeltr&auml;ger|Badische Kraftwerk GmbH & Co. KG|76532 Baden-Baden|
|Contractor|Bayerische Elektrizitätswerke GmbH|86150 Augsburg|Tel.: +49 (0821) 328 - 0|Fax: +49 (0821) 328 - 4160|
|Siegeltr&auml;ger|BayWa Energie Dienstleistungs GmbH|81925 München|
|Siegeltr&auml;ger|BEG Energiegesellschaft mbH|12681 Berlin|
|Partnerunternehmen|Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme|Dipl.-Ing. Günter Schlagowski|28213 Bremen|Tel.: +49 (0421) 211210|Fax: +49 (0421) 212772|
|Interessent|Bernd Wiggenhauser|78234 Engen|
|Interessent|Berndorff Contracting GmbH|50674 Köln|
|Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover|Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|
|Siegeltr&auml;ger|BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG|23684 Schulendorf|
|Siegeltr&auml;ger|BHK-Systeme GmbH|10243 Berlin|


Last edited by Scrutinizer; 04-21-2012 at 07:04 AM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

3. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

4. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

5. Shell Programming and Scripting

feasibility of opening a website link from unix and get a response in the form of xml or html

i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format... the website is of local network... for example something like this wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456 ... (2 Replies)
Discussion started by: vivek d r
2 Replies

6. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

7. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

8. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

9. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

10. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies
LIBGPHOTO2_PORT(3)					  The gPhoto2 Reference (the man					LIBGPHOTO2_PORT(3)

NAME
libgphoto2_port - cross-platform port access library SYNOPSIS
#include <gphoto2/gphoto2_port.h> DESCRIPTION
The libgphoto2_port library was written to provide libgphoto2(3) with a generic way of accessing ports. In this function, libgphoto2_port is the successor of the libgpio library. Currently, libgphoto2_port supports serial (RS-232) and USB connections, the latter requiring libusb to be installed. The autogenerated API docs will be added here in the future. ENVIRONMENT VARIABLES
IOLIBS If set, defines the directory where the libgphoto2_port library looks for its I/O drivers (iolibs). You only need to set this on OS/2 systems and broken/test installations. LD_DEBUG Set this to all to receive lots of debug information regarding library loading on ld based systems. USB_DEBUG If set, defines the numeric debug level with which the libusb library will print messages. In order to get some debug output, set it to 1. SEE ALSO
libgphoto2(3), The gPhoto2 Manual, [1]gphoto website, automatically generated API docs, [2]libusb website AUTHOR
The gPhoto2 Team. Hans Ulrich Niedermann <gp@n-dimensional.de>. (man page) REFERENCES
1. gphoto website http://www.gphoto.org/ 2. libusb website http://libusb.sourceforge.net/ 08/16/2006 LIBGPHOTO2_PORT(3)
All times are GMT -4. The time now is 02:21 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy