Sponsored Content
Top Forums Shell Programming and Scripting Extract/Parse information from html (website) Post 302626981 by TehOne on Friday 20th of April 2012 04:33:36 PM
Old 04-20-2012
Extract/Parse information from html (website)

Hello,

I want to extract some informations from a html (website, http://www.energiecontracting.de/7-m...?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it properly..

All the information is displayed on one line, here an example (copy/paste raw data into your favorite text editor):

http://pastebin.com/DL1KERT4

so I've reformated it by hand just to give you a better understanding on what information I need and where the problem lies:

http://pastebin.com/q5mve8H9

or

http://pastebin.com/DvrGRh7y

and I need the following (all) information:

status (Partnerunternehmen, Contractor etc. )
company name (BRANDES GmbH, BRASST Energiedienstleistungen GmbH etc.)
company address (13088 Berlin etc.)
company contact person (Karin Brandes etc.)
telephon, email, weburl

now like I already mentioned before, I can't find a way to extract the info properly because of how the code is formated.. I can't see any usuable start/end points because of how the information differs, likes sometimes there's no email, no website, no contact person etc.


I'd be greatful for any help, pretty sure that one of the experts here has the required knownledge to beat it Smilie

---------- Post updated at 12:33 PM ---------- Previous update was at 12:31 AM ----------

Hmm, so nobody good enough to give it a try?
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How do I extract text only from html file without HTML tag

I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part. Same problem happens in "type" command in MS-DOS. I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies

2. Shell Programming and Scripting

Using Perl to query a website and parse the result

Hi, I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem: In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies

3. Shell Programming and Scripting

Trying to Parse Version Information from Text File

I have a file name version.properties with the following data: major.version=14 minor.version=234 I'm trying to write a grep expression to only put "14" to stdout. The following is not working. grep "major.version=(+)" version.properties What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies

4. Shell Programming and Scripting

sed to parse html

Hello, I have a html file like this : <html> ... ... ... <table> ....... ...... </table> <table name = "hi"> ...... ..... ... </table> <h1> Welcome </h1> ....... ...... </html> (11 Replies)
Discussion started by: prasanna1157
11 Replies

5. Shell Programming and Scripting

feasibility of opening a website link from unix and get a response in the form of xml or html

i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format... the website is of local network... for example something like this wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456 ... (2 Replies)
Discussion started by: vivek d r
2 Replies

6. Shell Programming and Scripting

Parse excel file with html on each cell

<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies

7. Shell Programming and Scripting

awk to parse html file

Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached. <title> EDAR Gene Sequencing <dt>Test Code:</dt> <dd>156 </dd> <dt>Turnaround Time:</dt> <dd>6-8 weeks </dd> ... (4 Replies)
Discussion started by: cmccabe
4 Replies

8. Shell Programming and Scripting

Parse multiple html files in directory

I have downloaded source code for 97 files using: wget -x -i link.txt then run a rename loop: for file in * do mv $file $file.txt done to keep the html tags but make the file a text that can be parsed. In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies

9. Shell Programming and Scripting

Parse html

I downloaded source code using: wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies

10. UNIX for Beginners Questions & Answers

How to parse a specifc value between html tags using sed?

Hi, im trying to read a Temperature value from html code. So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string <TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies
Ns_Url(3aolserver)					   AOLserver Library Procedures 					Ns_Url(3aolserver)

__________________________________________________________________________________________________________________________________________________

NAME
Ns_AbsoluteUrl, Ns_ParseUrl, Ns_RelativeUrl, Ns_SkipUrl - URL manipulation routines SYNOPSIS
#include "ns.h" int Ns_AbsoluteUrl(Ns_DString *pds, char *url, char *baseurl) int Ns_ParseUrl(char *url, char **pprotocol, char **phost, char **pport, char **ppath, char **ptail) char * Ns_RelativeUrl(char *url, char *location) char * Ns_SkipUrl(Ns_Request *request, int n) _________________________________________________________________ DESCRIPTION
Ns_AbsoluteUrl(pds, url, baseurl) Construct an URL based on baseurl but with as many parts of the incomplete url as possible. Return NS_OK or NS_ERROR. Ns_ParseUrl(url, pprotocol, phost, pport, ppath, ptail) Parse a URL into its component parts. Pointers to the protocol, host, port, path, and "tail" (last path element) will be set by ref- erence in the passed-in pointers. The passed-in url will be modified. Ns_RelativeUrl(url, location) If the url passed in is for this server, then the initial part of the URL is stripped off. e.g., on a server whose location is http://www.foo.com, Ns_RelativeUrl of "http://www.foo.com/hello" will return "/hello". Returns a pointer to the beginning of the relative url in the passed-in url, or NULL if error. Will set errno on error. Ns_SkipUrl(request, n) Return a pointer n elements into the request's url. SEE ALSO
nsd(1), info(n) KEYWORDS
AOLserver 4.0 Ns_Url(3aolserver)
All times are GMT -4. The time now is 07:15 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy