Extract/Parse information from html (website)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extract/Parse information from html (website)
# 1  
Old 04-20-2012
Extract/Parse information from html (website)

Hello,

I want to extract some informations from a html (website, http://www.energiecontracting.de/7-m...?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it properly..

All the information is displayed on one line, here an example (copy/paste raw data into your favorite text editor):

http://pastebin.com/DL1KERT4

so I've reformated it by hand just to give you a better understanding on what information I need and where the problem lies:

http://pastebin.com/q5mve8H9

or

http://pastebin.com/DvrGRh7y

and I need the following (all) information:

status (Partnerunternehmen, Contractor etc. )
company name (BRANDES GmbH, BRASST Energiedienstleistungen GmbH etc.)
company address (13088 Berlin etc.)
company contact person (Karin Brandes etc.)
telephon, email, weburl

now like I already mentioned before, I can't find a way to extract the info properly because of how the code is formated.. I can't see any usuable start/end points because of how the information differs, likes sometimes there's no email, no website, no contact person etc.


I'd be greatful for any help, pretty sure that one of the experts here has the required knownledge to beat it Smilie

---------- Post updated at 12:33 PM ---------- Previous update was at 12:31 AM ----------

Hmm, so nobody good enough to give it a try?
# 2  
Old 04-20-2012
Hi TehOne,

I want to give a try, of course. It's an interesting problem, but parsing html is not an easy task. I will post a solution if can solve it, but try by yourself too.
# 3  
Old 04-20-2012
Bumping up posts or double posting is not permitted in these forums.

Please read the rules, which you agreed to when you registered, if you have not already done so.

You may receive an infraction for this. If so, don't worry, just try to follow the rules more carefully. The infraction will expire in the near future

Thank You.

The UNIX and Linux Forums.
# 4  
Old 04-21-2012
One way:
Code:
$ cat script.pl 
use warnings;
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
use HTML::Entities;

my $uri = q[http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=1];

## Get the agent to explore the web page.
my $mech = WWW::Mechanize->new();
$mech->agent_alias( q[Linux Mozilla] );
$mech->get( $uri );

## Get last page.
my @c = $mech->find_all_links(
                q[url_regex] => qr/(?i:seite)=/,
);
my %d = map { $_->[0] => do { $_->[0] =~ m/(\d+)\Z/; $1 } } @c;
my $last_page = (sort { $b <=> $a } values %d)[0];

my @text;
for my $page ( 1 .. $last_page ) {

        my $tp = HTML::TokeParser->new( \$mech->content() ) or die qq[ERROR in HTML::TokeParser\n];
        $tp->get_tag( q[table] );

        while ( 1 ) {
                my $t = $tp->get_text();
                if ( $t ) {
                        last if $t =~ m/\A(?i)seite/;
                        push @text, $t;
                }
                my $token = $tp->get_token;
                if ( $token->[0] eq q[E] && $token->[1] eq q[p] ) {
                        printf qq[%s\n], join q[,], @text;
                        @text = ();
                        next;
                }
                if ( $token->[0] eq q[E] && $token->[1] eq q[div] ) {
                        last;
                }
        }

        $uri =~ s/(\d+)\Z/$1 + 1/e;
        $mech->get( $uri );
}

exit 0;
$ perl script.pl
Siegeltr�ger,Badische Kraftwerk GmbH & Co. KG,76532 Baden-Baden
Contractor,Bayerische Elektrizit�tswerke GmbH,86150 Augsburg,Tel.: +49 (0821) 328 - 0,Fax: +49 (0821) 328 - 4160,undine.maidl@lew.de,www.bew-augsburg.de
Siegeltr�ger,BayWa Energie Dienstleistungs GmbH,81925 M�nchen,Projekte dieser Firma ansehen
Siegeltr�ger,BEG Energiegesellschaft mbH,12681 Berlin
Partnerunternehmen,Beratungs- und Planungsb�ro f�r MULTIVALENTE Beheizungssysteme,Dipl.-Ing. G�nter Schlagowski,28213 Bremen,Tel.: +49 (0421) 211210,Fax: +49 (0421) 212772,g.s.nestwaerme@t-online.de,www.schlagowski.de,Weitere Informationen
Interessent,Bernd Wiggenhauser,78234 Engen
Interessent,Berndorff Contracting GmbH,50674 K�ln
Contractor,beta GmbH Betrieb energietechnischer Anlagen,30451 Hannover,Tel.: +49 (0511) 45001109,Fax: +49 (0511) 497574,brosziewski@beta-energie.de,www.beta-energie.de
Siegeltr�ger,BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG,23684 Schulendorf
Siegeltr�ger,BHK-Systeme GmbH,10243 Berlin
Interessent,Bi. En GmbH & Co. KG,24109 Kiel
Siegeltr�ger,BIBER Biomasse GmbH,94333 Geiselh�ring,Projekte dieser Firma ansehen
Siegeltr�ger,Bio W�rme Rh�n GmbH & Co. KG,36145 Hofbieber-Obern�st,Projekte dieser Firma ansehen
Siegeltr�ger,Bio-W�rme-Innovation GmbH,06449 Aschersleben
Interessent,Bioenergie-Regional GmbH,74199 Untergruppenbach
Siegeltr�ger,Bioenergiehof B�hme GmbH,01762 Obercarsdorf
Siegeltr�ger,Bisser-Putz re-Solution Energietechnik GbR,78606 Seitingen-Oberflacht
Siegeltr�ger,Blume W�rmelieferungs GmbH,14728 Rhinow
Siegeltr�ger,Bosch Energy and Building Solutions GmbH,70499 Stuttgart
Partnerunternehmen,Bosch Thermotechnik GmbH Buderus Deutschland,Dipl.-Ing. Jens Gierok,21035 Hamburg,Tel.: +49 (040) 73417 - 0,Fax: +49 (040) 73417 - 267,jens.gierok@buderus.de,www.buderus.de,Weitere Informationen
Partnerunternehmen,BRANDES GmbH,Karin Brandes,23701 Eutin,Tel.: +49 (04521) 807 - 0,Fax: +49 (04521) 807 - 77,karin.brandes@brandes.de,www.brandes.de,Weitere Informationen
Contractor,BRASST Energiedienstleistungen GmbH,13088 Berlin,Tel.: +49 (030) 556885 - 0,Fax: +49 (030) 556885 - 99,brasst@bln.de,www.brasst.de
Contractor,BTB  Blockheizkraftwerks- Tr�ger- und Betreiberges. mbH Berlin,10589 Berlin,Tel.: +49 (030) 349907 - 61,Fax: +49 (030) 349907 - 88,karl.meyer@btb-berlin.de,www.btb-berlin.de,Projekte dieser Firma ansehen

# 5  
Old 04-21-2012
Genuine HTML parsing is preferable I think, but FWIW this is with a bit of awk using http://pastebin.com/DL1KERT4 as the input file :
Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,ORS,$1); print $1}' RS=


Code:
Siegeltr&auml;ger
Badische Kraftwerk GmbH & Co. KG
76532 Baden-Baden


Contractor
Bayerische Elektrizitätswerke GmbH
86150 Augsburg
Tel.: +49 (0821) 328 - 0
Fax: +49 (0821) 328 - 4160


Siegeltr&auml;ger
BayWa Energie Dienstleistungs GmbH
81925 München


Siegeltr&auml;ger
BEG Energiegesellschaft mbH
12681 Berlin


Partnerunternehmen
Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme
Dipl.-Ing. Günter Schlagowski
28213 Bremen
Tel.: +49 (0421) 211210
Fax: +49 (0421) 212772


Interessent
Bernd Wiggenhauser
78234 Engen


Interessent
Berndorff Contracting GmbH
50674 Köln


Contractor
beta GmbH Betrieb energietechnischer Anlagen
30451 Hannover
Tel.: +49 (0511) 45001109
Fax: +49 (0511) 497574


Siegeltr&auml;ger
BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG
23684 Schulendorf


Siegeltr&auml;ger
BHK-Systeme GmbH
10243 Berlin

or


Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,"|",$1); print $1}' RS=

Code:
|Siegeltr&auml;ger|Badische Kraftwerk GmbH & Co. KG|76532 Baden-Baden|
|Contractor|Bayerische Elektrizitätswerke GmbH|86150 Augsburg|Tel.: +49 (0821) 328 - 0|Fax: +49 (0821) 328 - 4160|
|Siegeltr&auml;ger|BayWa Energie Dienstleistungs GmbH|81925 München|
|Siegeltr&auml;ger|BEG Energiegesellschaft mbH|12681 Berlin|
|Partnerunternehmen|Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme|Dipl.-Ing. Günter Schlagowski|28213 Bremen|Tel.: +49 (0421) 211210|Fax: +49 (0421) 212772|
|Interessent|Bernd Wiggenhauser|78234 Engen|
|Interessent|Berndorff Contracting GmbH|50674 Köln|
|Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover|Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|
|Siegeltr&auml;ger|BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG|23684 Schulendorf|
|Siegeltr&auml;ger|BHK-Systeme GmbH|10243 Berlin|


Last edited by Scrutinizer; 04-21-2012 at 07:04 AM..
# 6  
Old 05-02-2012
Quote:
Originally Posted by Scrutinizer
Genuine HTML parsing is preferable I think, but FWIW this is with a bit of awk using <div id="text"><img class="ab-bottom" src="/7-mitglieder/images/mitglieder.j - Pastebin.com as the input file :
Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,ORS,$1); print $1}' RS=

Code:
Siegeltr&auml;ger
Badische Kraftwerk GmbH & Co. KG
76532 Baden-Baden


Contractor
Bayerische Elektrizitätswerke GmbH
86150 Augsburg
Tel.: +49 (0821) 328 - 0
Fax: +49 (0821) 328 - 4160


Siegeltr&auml;ger
BayWa Energie Dienstleistungs GmbH
81925 München


Siegeltr&auml;ger
BEG Energiegesellschaft mbH
12681 Berlin


Partnerunternehmen
Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme
Dipl.-Ing. Günter Schlagowski
28213 Bremen
Tel.: +49 (0421) 211210
Fax: +49 (0421) 212772


Interessent
Bernd Wiggenhauser
78234 Engen


Interessent
Berndorff Contracting GmbH
50674 Köln


Contractor
beta GmbH Betrieb energietechnischer Anlagen
30451 Hannover
Tel.: +49 (0511) 45001109
Fax: +49 (0511) 497574


Siegeltr&auml;ger
BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG
23684 Schulendorf


Siegeltr&auml;ger
BHK-Systeme GmbH
10243 Berlin

or


Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,"|",$1); print $1}' RS=

Code:
|Siegeltr&auml;ger|Badische Kraftwerk GmbH & Co. KG|76532 Baden-Baden|
|Contractor|Bayerische Elektrizitätswerke GmbH|86150 Augsburg|Tel.: +49 (0821) 328 - 0|Fax: +49 (0821) 328 - 4160|
|Siegeltr&auml;ger|BayWa Energie Dienstleistungs GmbH|81925 München|
|Siegeltr&auml;ger|BEG Energiegesellschaft mbH|12681 Berlin|
|Partnerunternehmen|Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme|Dipl.-Ing. Günter Schlagowski|28213 Bremen|Tel.: +49 (0421) 211210|Fax: +49 (0421) 212772|
|Interessent|Bernd Wiggenhauser|78234 Engen|
|Interessent|Berndorff Contracting GmbH|50674 Köln|
|Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover|Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|
|Siegeltr&auml;ger|BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG|23684 Schulendorf|
|Siegeltr&auml;ger|BHK-Systeme GmbH|10243 Berlin|

Damn that looks great, thanks allot!

Hmm but it also gets me thinking, how would I parse the output properly to arrange all that information into the appropriate rows..eg. to match the .csv format.. if something is not available it would have to be represented by an empty field... like:

Code:
Status|Company Name|Company Address|Contact|Telephone|Fax|Email|Weburl
----------------------------------------------------------------------------
Interessent|Berndorff Contracting GmbH|50674 Köln|||||
Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover||Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|||

etc..

while getting the the status, address, phone, fax, email is easy.. the contact and company name is not.. as both are just [a-zA-Z] so hard to separate, especially since not each company has a "GmbH" etc. in its name..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies

2. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

3. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

4. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

6. Shell Programming and Scripting

feasibility of opening a website link from unix and get a response in the form of xml or html

i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format... the website is of local network... for example something like this wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456 ... (2 Replies)
Discussion started by: vivek d r
2 Replies

7. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

8. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

9. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

10. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies
Login or Register to Ask a Question