Sponsored Content
Top Forums Shell Programming and Scripting Extract/Parse information from html (website) Post 302627647 by birei on Saturday 21st of April 2012 05:00:16 AM
Old 04-21-2012
One way:
Code:
$ cat script.pl 
use warnings;
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
use HTML::Entities;

my $uri = q[http://www.energiecontracting.de/7-mitglieder/von-A-Z.php?a_z=B&seite=1];

## Get the agent to explore the web page.
my $mech = WWW::Mechanize->new();
$mech->agent_alias( q[Linux Mozilla] );
$mech->get( $uri );

## Get last page.
my @c = $mech->find_all_links(
                q[url_regex] => qr/(?i:seite)=/,
);
my %d = map { $_->[0] => do { $_->[0] =~ m/(\d+)\Z/; $1 } } @c;
my $last_page = (sort { $b <=> $a } values %d)[0];

my @text;
for my $page ( 1 .. $last_page ) {

        my $tp = HTML::TokeParser->new( \$mech->content() ) or die qq[ERROR in HTML::TokeParser\n];
        $tp->get_tag( q[table] );

        while ( 1 ) {
                my $t = $tp->get_text();
                if ( $t ) {
                        last if $t =~ m/\A(?i)seite/;
                        push @text, $t;
                }
                my $token = $tp->get_token;
                if ( $token->[0] eq q[E] && $token->[1] eq q[p] ) {
                        printf qq[%s\n], join q[,], @text;
                        @text = ();
                        next;
                }
                if ( $token->[0] eq q[E] && $token->[1] eq q[div] ) {
                        last;
                }
        }

        $uri =~ s/(\d+)\Z/$1 + 1/e;
        $mech->get( $uri );
}

exit 0;
$ perl script.pl
Siegeltr�ger,Badische Kraftwerk GmbH & Co. KG,76532 Baden-Baden
Contractor,Bayerische Elektrizit�tswerke GmbH,86150 Augsburg,Tel.: +49 (0821) 328 - 0,Fax: +49 (0821) 328 - 4160,undine.maidl@lew.de,www.bew-augsburg.de
Siegeltr�ger,BayWa Energie Dienstleistungs GmbH,81925 M�nchen,Projekte dieser Firma ansehen
Siegeltr�ger,BEG Energiegesellschaft mbH,12681 Berlin
Partnerunternehmen,Beratungs- und Planungsb�ro f�r MULTIVALENTE Beheizungssysteme,Dipl.-Ing. G�nter Schlagowski,28213 Bremen,Tel.: +49 (0421) 211210,Fax: +49 (0421) 212772,g.s.nestwaerme@t-online.de,www.schlagowski.de,Weitere Informationen
Interessent,Bernd Wiggenhauser,78234 Engen
Interessent,Berndorff Contracting GmbH,50674 K�ln
Contractor,beta GmbH Betrieb energietechnischer Anlagen,30451 Hannover,Tel.: +49 (0511) 45001109,Fax: +49 (0511) 497574,brosziewski@beta-energie.de,www.beta-energie.de
Siegeltr�ger,BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG,23684 Schulendorf
Siegeltr�ger,BHK-Systeme GmbH,10243 Berlin
Interessent,Bi. En GmbH & Co. KG,24109 Kiel
Siegeltr�ger,BIBER Biomasse GmbH,94333 Geiselh�ring,Projekte dieser Firma ansehen
Siegeltr�ger,Bio W�rme Rh�n GmbH & Co. KG,36145 Hofbieber-Obern�st,Projekte dieser Firma ansehen
Siegeltr�ger,Bio-W�rme-Innovation GmbH,06449 Aschersleben
Interessent,Bioenergie-Regional GmbH,74199 Untergruppenbach
Siegeltr�ger,Bioenergiehof B�hme GmbH,01762 Obercarsdorf
Siegeltr�ger,Bisser-Putz re-Solution Energietechnik GbR,78606 Seitingen-Oberflacht
Siegeltr�ger,Blume W�rmelieferungs GmbH,14728 Rhinow
Siegeltr�ger,Bosch Energy and Building Solutions GmbH,70499 Stuttgart
Partnerunternehmen,Bosch Thermotechnik GmbH Buderus Deutschland,Dipl.-Ing. Jens Gierok,21035 Hamburg,Tel.: +49 (040) 73417 - 0,Fax: +49 (040) 73417 - 267,jens.gierok@buderus.de,www.buderus.de,Weitere Informationen
Partnerunternehmen,BRANDES GmbH,Karin Brandes,23701 Eutin,Tel.: +49 (04521) 807 - 0,Fax: +49 (04521) 807 - 77,karin.brandes@brandes.de,www.brandes.de,Weitere Informationen
Contractor,BRASST Energiedienstleistungen GmbH,13088 Berlin,Tel.: +49 (030) 556885 - 0,Fax: +49 (030) 556885 - 99,brasst@bln.de,www.brasst.de
Contractor,BTB  Blockheizkraftwerks- Tr�ger- und Betreiberges. mbH Berlin,10589 Berlin,Tel.: +49 (030) 349907 - 61,Fax: +49 (030) 349907 - 88,karl.meyer@btb-berlin.de,www.btb-berlin.de,Projekte dieser Firma ansehen

 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

3. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

4. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

5. Shell Programming and Scripting

feasibility of opening a website link from unix and get a response in the form of xml or html

i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format... the website is of local network... for example something like this wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456 ... (2 Replies)
Discussion started by: vivek d r
2 Replies

6. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

7. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

8. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

9. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

10. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies
THUBAN(1)						      General Commands Manual							 THUBAN(1)

NAME
thuban - interactive geographic data viewer SYNOPSIS
thuban [thuban-session-file] DESCRIPTION
This manual page documents briefly the thuban command. This manual page was written for the Debian distribution because the original pro- gram does not have a manual page. thuban is an interactiv geographic data viewer. Thuban can read geographic data in the shapefile format. To control the visual appearance of a layer you have to select the layer in the session window of thuban. Afterwards you can change the layers color with the Layer-menu. You can load the layers table with the Layer-table-menu. Afterwards you can query the table for feature selection. You can also load new tables with the Table-menu and make a table join to the current layer table. Maps can be printed or exported to the PS-format by Map/print. All changes can be saved in a thuban session file. If a thuban session file has been indicated on command line it will be loaded by thuban. If no thuban session file has been given thuban starts with a new session. SEE ALSO
http://thuban.intevation.org/ AUTHOR
Thuban was written by Intevation GmbH, <bh@intevation.de> This manual page was written by Silke Reimer <silke@intevation.de>, for the Debian GNU/Linux system (but may be used by others). COPYRIGHT
Thuban may be copied and modified under te terms of GNU General Public License. July 28, 2003 THUBAN(1)
All times are GMT -4. The time now is 01:42 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy