Sponsored Content
Top Forums Shell Programming and Scripting Extract/Parse information from html (website) Post 302627661 by Scrutinizer on Saturday 21st of April 2012 05:54:48 AM
Old 04-21-2012
Genuine HTML parsing is preferable I think, but FWIW this is with a bit of awk using http://pastebin.com/DL1KERT4 as the input file :
Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,ORS,$1); print $1}' RS=


Code:
Siegeltr&auml;ger
Badische Kraftwerk GmbH & Co. KG
76532 Baden-Baden


Contractor
Bayerische Elektrizitätswerke GmbH
86150 Augsburg
Tel.: +49 (0821) 328 - 0
Fax: +49 (0821) 328 - 4160


Siegeltr&auml;ger
BayWa Energie Dienstleistungs GmbH
81925 München


Siegeltr&auml;ger
BEG Energiegesellschaft mbH
12681 Berlin


Partnerunternehmen
Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme
Dipl.-Ing. Günter Schlagowski
28213 Bremen
Tel.: +49 (0421) 211210
Fax: +49 (0421) 212772


Interessent
Bernd Wiggenhauser
78234 Engen


Interessent
Berndorff Contracting GmbH
50674 Köln


Contractor
beta GmbH Betrieb energietechnischer Anlagen
30451 Hannover
Tel.: +49 (0511) 45001109
Fax: +49 (0511) 497574


Siegeltr&auml;ger
BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG
23684 Schulendorf


Siegeltr&auml;ger
BHK-Systeme GmbH
10243 Berlin

or


Code:
awk 'gsub(/<h6[^>]*>/,ORS ORS)' infile | awk -F'</?tr>' 'NR>1{gsub(/(<[^>]*>)+/,"|",$1); print $1}' RS=

Code:
|Siegeltr&auml;ger|Badische Kraftwerk GmbH & Co. KG|76532 Baden-Baden|
|Contractor|Bayerische Elektrizitätswerke GmbH|86150 Augsburg|Tel.: +49 (0821) 328 - 0|Fax: +49 (0821) 328 - 4160|
|Siegeltr&auml;ger|BayWa Energie Dienstleistungs GmbH|81925 München|
|Siegeltr&auml;ger|BEG Energiegesellschaft mbH|12681 Berlin|
|Partnerunternehmen|Beratungs- und Planungsbüro für MULTIVALENTE Beheizungssysteme|Dipl.-Ing. Günter Schlagowski|28213 Bremen|Tel.: +49 (0421) 211210|Fax: +49 (0421) 212772|
|Interessent|Bernd Wiggenhauser|78234 Engen|
|Interessent|Berndorff Contracting GmbH|50674 Köln|
|Contractor|beta GmbH Betrieb energietechnischer Anlagen|30451 Hannover|Tel.: +49 (0511) 45001109|Fax: +49 (0511) 497574|
|Siegeltr&auml;ger|BEVR Biomasse Energie Versorgung Ratekau GmbH & Co. KG|23684 Schulendorf|
|Siegeltr&auml;ger|BHK-Systeme GmbH|10243 Berlin|


Last edited by Scrutinizer; 04-21-2012 at 07:04 AM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

3. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

4. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

5. Shell Programming and Scripting

feasibility of opening a website link from unix and get a response in the form of xml or html

i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format... the website is of local network... for example something like this wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456 ... (2 Replies)
Discussion started by: vivek d r
2 Replies

6. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

7. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

8. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

9. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

10. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies
All times are GMT -4. The time now is 06:57 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy