04-20-2012
Extract/Parse information from html (website)
Hello,
I want to extract some informations from a html (website,
http://www.energiecontracting.de/7-m...?a_z=B&seite=2 ) file and save those in a predefined format (.csv).. However it seems that the code on that website is kinda messy and I can't find a way to handle it properly..
All the information is displayed on one line, here an example (copy/paste raw data into your favorite text editor):
http://pastebin.com/DL1KERT4
so I've reformated it by hand just to give you a better understanding on what information I need and where the problem lies:
http://pastebin.com/q5mve8H9
or
http://pastebin.com/DvrGRh7y
and I need the following (all) information:
status (Partnerunternehmen, Contractor etc. )
company name (BRANDES GmbH, BRASST Energiedienstleistungen GmbH etc.)
company address (13088 Berlin etc.)
company contact person (Karin Brandes etc.)
telephon, email, weburl
now like I already mentioned before, I can't find a way to extract the info properly because of how the code is formated.. I can't see any usuable start/end points because of how the information differs, likes sometimes there's no email, no website, no contact person etc.
I'd be greatful for any help, pretty sure that one of the experts here has the required knownledge to beat it
---------- Post updated at 12:33 PM ---------- Previous update was at 12:31 AM ----------
Hmm, so nobody good enough to give it a try?
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
I have a html file called myfile. If I simply put "cat myfile.html" in UNIX, it shows all the html tags like <a href=r/26><img src="http://www>. But I want to extract only text part.
Same problem happens in "type" command in MS-DOS.
I know you can do it by opening it in Internet Explorer,... (4 Replies)
Discussion started by: los111
4 Replies
2. Shell Programming and Scripting
Hi,
I am a JAVA programmer and I have no idea about perl. I did use it a long time ago and I don't even remember the basics. So here is my problem:
In my work, I am supposed to build a simple program that opens a website (Gene Ontology)and passes my query and returns the result into a file. The... (1 Reply)
Discussion started by: chavanak
1 Replies
3. Shell Programming and Scripting
I have a file name version.properties with the following data:
major.version=14
minor.version=234
I'm trying to write a grep expression to only put "14" to stdout. The following is not working.
grep "major.version=(+)" version.properties
What am I doing wrong? (6 Replies)
Discussion started by: obfunkhouser
6 Replies
4. Shell Programming and Scripting
Hello,
I have a html file like this :
<html>
...
...
...
<table>
.......
......
</table>
<table name = "hi">
......
.....
...
</table>
<h1> Welcome </h1>
.......
......
</html> (11 Replies)
Discussion started by: prasanna1157
11 Replies
5. Shell Programming and Scripting
i just wanted to know whether is it possible to open a website link and get a response in the form of xml or html format...
the website is of local network...
for example something like this
wget http://blahblah.samplesite.com/blachblahcblach/User/jsp/ShowPerson.jsp?empid=123456
... (2 Replies)
Discussion started by: vivek d r
2 Replies
6. Shell Programming and Scripting
<DIV><P>Pré-condição aceder ao ecrã Home do MRS.</P></DIV><DIV><P>OK.</P></DIV><DIV><P>Seleccionar Pesquisa de Recepção Directa.</P></DIV><DIV><P>Confirmar que abriu ecrã de Recepção Directa.</P></DIV><DIV> (6 Replies)
Discussion started by: oliveiraum
6 Replies
7. Shell Programming and Scripting
Is it possible in awk to parse a webpage (EDAR Gene Sequencing - Genetic Testing Company | The DNA Diagnostic Experts | GeneDx), the source code is attached.
<title> EDAR Gene Sequencing
<dt>Test Code:</dt>
<dd>156 </dd>
<dt>Turnaround Time:</dt>
<dd>6-8 weeks </dd>
... (4 Replies)
Discussion started by: cmccabe
4 Replies
8. Shell Programming and Scripting
I have downloaded source code for 97 files using:
wget -x -i link.txt then run a rename loop:
for file in *
do
mv $file $file.txt
done to keep the html tags but make the file a text that can be parsed.
In each of the 97 txt files the gene # is variable, but the gene is associated... (15 Replies)
Discussion started by: cmccabe
15 Replies
9. Shell Programming and Scripting
I downloaded source code using:
wget -qO- http://fulgentdiagnostics.com/test/clinical-exome/ | cat > flugentsource.txt
Now I am trying to use sed to parse it to confirm a gene count. Basically, output (flugent.txt) all the gene names with a total count after them
I'm not all that... (5 Replies)
Discussion started by: cmccabe
5 Replies
10. UNIX for Beginners Questions & Answers
Hi,
im trying to read a Temperature value from html code.
So far i have managed to reduce the whole html page down to this single line with the following sed command:sed -n '/Temperature/p' $temp_temperature | tee temp_string
<TD width='350'>Temperature :</td><td>25... (2 Replies)
Discussion started by: naittis
2 Replies
LEARN ABOUT DEBIAN
ns_relativeurl
Ns_Url(3aolserver) AOLserver Library Procedures Ns_Url(3aolserver)
__________________________________________________________________________________________________________________________________________________
NAME
Ns_AbsoluteUrl, Ns_ParseUrl, Ns_RelativeUrl, Ns_SkipUrl - URL manipulation routines
SYNOPSIS
#include "ns.h"
int
Ns_AbsoluteUrl(Ns_DString *pds, char *url, char *baseurl)
int
Ns_ParseUrl(char *url, char **pprotocol, char **phost,
char **pport, char **ppath, char **ptail)
char *
Ns_RelativeUrl(char *url, char *location)
char *
Ns_SkipUrl(Ns_Request *request, int n)
_________________________________________________________________
DESCRIPTION
Ns_AbsoluteUrl(pds, url, baseurl)
Construct an URL based on baseurl but with as many parts of the incomplete url as possible. Return NS_OK or NS_ERROR.
Ns_ParseUrl(url, pprotocol, phost, pport, ppath, ptail)
Parse a URL into its component parts. Pointers to the protocol, host, port, path, and "tail" (last path element) will be set by ref-
erence in the passed-in pointers. The passed-in url will be modified.
Ns_RelativeUrl(url, location)
If the url passed in is for this server, then the initial part of the URL is stripped off. e.g., on a server whose location is
http://www.foo.com, Ns_RelativeUrl of "http://www.foo.com/hello" will return "/hello". Returns a pointer to the beginning of the
relative url in the passed-in url, or NULL if error. Will set errno on error.
Ns_SkipUrl(request, n)
Return a pointer n elements into the request's url.
SEE ALSO
nsd(1), info(n)
KEYWORDS
AOLserver 4.0 Ns_Url(3aolserver)