The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > High Level Programming
Google UNIX.COM


High Level Programming Post questions about C, C++, Java, SQL, and other programming languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
how to delete content in a file (delete content only) kittusri9 Shell Programming and Scripting 5 05-15-2008 10:12 AM
reading web page source in unix jaymzlee UNIX for Dummies Questions & Answers 3 03-26-2008 04:27 PM
lpr- how to print from page to page naamas03 Shell Programming and Scripting 4 12-26-2007 03:30 AM
reading reading data from webpage phani_sree High Level Programming 3 11-01-2007 10:28 AM
Content of Content of a variable! jaduks Shell Programming and Scripting 2 08-26-2007 09:40 PM

Reply
 
Submit Tools LinkBack Thread Tools Display Modes
  #1  
Old 07-07-2004
Registered User
 

Join Date: Jul 2004
Posts: 2
Reading web page content

Hi, guys.

I really need to solve this problem and I don't know how. So, can somone be so kind to help me? Please.

And the problem is:
I have to write a C program which will open a web page, filter it's contents and save needed data to a file. Now, everything's easy, but reading this web page.

How can I open a page from within a program? Just for example, lets say I need to find the name of the newest member of unix.com from the notification on home page. How can I even tell it to go to web?!

Reading pure HTML (or PHP generated or whatever) is also ok.

I'm on Sun Solaris OS 5.8.
Reply With Quote
Forum Sponsor
  #2  
Old 07-07-2004
Registered User
 

Join Date: May 2004
Location: Hawaii
Posts: 37
My personal favorite is to use perl to invoke wget or curl, and also to filter the webpage using perl's regular expressions. I'm sure C is the wrong tool for this sort of job, but if you're against learning perl, by all means use C.
Reply With Quote
  #3  
Old 07-08-2004
photon's Avatar
Registered User
 
Join Date: Jul 2002
Posts: 148
This is an interesting debate, it is easy to say using a higher
level language is easier, but when you are using an API to
solve more difficult requests you will run into problems.

To truly understand the request and response, you have to
understand how ports work.

For instance try to request

http://www.google.com/search?hl=en&i...&q=unix+forums

with a high level language API.

I tried with Java for instance:

Code:
// This example is from the book _Java in a Nutshell_ by David Flanagan.
// Written by David Flanagan.  Copyright (c) 1996 O'Reilly & Associates.
// You may study, use, modify, and distribute this example for any purpose.
// This example is provided WITHOUT WARRANTY either expressed or implied.

import java.net.*;
import java.io.*;
import java.util.*;

public class GetURLInfo {
    public static void printinfo(URLConnection u) throws IOException {
        // Display the URL address, and information about it.
        System.out.println(u.getURL().toExternalForm() + ":");
        System.out.println("  Content Type: " + u.getContentType());
        System.out.println("  Content Length: " + u.getContentLength());
        System.out.println("  Last Modified: " + new Date(u.getLastModified()));
        System.out.println("  Expiration: " + u.getExpiration());
        System.out.println("  Content Encoding: " + u.getContentEncoding());
        
        // Read and print out the first five lines of the URL.
        System.out.println("First five lines:");
        DataInputStream in = new DataInputStream(u.getInputStream());
        for(int i = 0; i < 5; i++) {
            String line = in.readLine();
            if (line == null) break;
            System.out.println("  " + line);
        }
    }
    
    // Create a URL from the specified address, open a connection to it,
    // and then display information about the URL.
    public static void main(String[] args) 
        throws MalformedURLException, IOException
    {
	URL url = new URL(args[0]);
	URLConnection connection = url.openConnection();
	printinfo(connection);
    }
}
This code will return 403 from server.

To break this problem you have to go to the socket level and do
low level sends such as:

Code:
Socket socket = new Socket(u.getHost(),port);
OutputStream out = socket.getOutputStream();
InputStream in = socket.getInputStream();
To read data you would do something like this.

Code:
byte buffer[]=new byte[1024];
int l = in.read(buffer);
body.append(new String(buffer,0,l,"8859_1"));
Of course sending the proper GET request and all that other HTTP stuff.

It is not as easy as you would think and C would give you a
better understanding of socket programming. And when I get
$60.00 I will pick up Stevens book to see how he explains it, to
really understand network programming.

Therefore, whatever language you use, open port send HTTP
request and read HTTP response from port.
Reply With Quote
  #4  
Old 07-08-2004
Registered User
 

Join Date: Jul 2004
Posts: 2
Yeah!

Thanks, wget does the trick. Actually, what I really needed is summed in one line (Google as an example):

Code:
system ("wget -O file.dat http://www.google.com");
And than I can do whatever I need with the file (on my own ground here ). Filtering and everything. I know it's not the greatest way nor the most elegant one, but I don't have time to try to learn more then this. The program has to be finished in some twenty hours now.


Tnx so much!
Reply With Quote
Google The UNIX and Linux Forums
Reply

Tags
regex, regular expressions

Thread Tools
Display Modes




All times are GMT -7. The time now is 02:42 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0