Newbie Python Url Scraper


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Newbie Python Url Scraper
# 1  
Old 06-04-2013
Newbie Python Url Scraper

I setup Zoneminder and have been playing around with setting up a couple of Wanscam PTZ ip cameras in which I have been running into road blocks with streaming and etc. I cant find much information on the camera and its webserver that sits on it and wanted to get a an absolute directory structure of the webserver on the camera. I tried using:
Code:
wget --spider -r 192.168.3.3:80
Spider mode enabled. Check if remote file exists.
--2013-06-04 13:00:49--  (try: 5)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:00:54--  (try: 6)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:01:00--  (try: 7)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:01:07--  (try: 8)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

but doesnt find a thing. I know it has a webserver that reside on TCP:80 because I can view the camera through it. I have been attempting to use Pythons "scrapy" but can understand how to tell it to crawl and find the directory structure as opposed to where to start looking for it. This is what I have so far:
Code:
 #!/usr/bin/env python
# encoding=utf-8

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
### Kludge to set default encoding to utf-8
reload(sys)
sys.setdefaultencoding('utf-8')

class PTZcamera(BaseSpider):
      name = "camera"
      allowed_domains = ["http://192.168.3.3:80"]
      #start_urls = [""]

      def parse(self, response):
          pass

but doesn't produce much. I would like an output in which is display on the absolute path of the directory on the webserver like:
Code:
http://192.168.3.3/cgi-bin/blah
http://192.168.3.3/cgi-bin/blah2
http://192.168.3.3/video/blah1
http://192.168.3.3/video/blah2
...
...
...

Can someone point me in the correct direction?
# 2  
Old 06-04-2013
If it won't talk to wget, I doubt it'll talk to python. Solve that problem first I think...

It may be refusing to talk to wget because it doesn't like its user-agent, which you can set with something like -U netscape

Also, give it --server-response so you can see exactly where the communication dies.

Last edited by Corona688; 06-04-2013 at 02:28 PM..
# 3  
Old 06-04-2013
Thanks for the reply. It didnt make a difference.
Code:
wget --spider r --server-response -U netscape http://192.168.3.3

Spider mode enabled. Check if remote file exists.
--2013-06-04 14:34:51--  (try:19)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 14:35:01--  (try:20)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Giving up.

I tried just a basic HTTP GET request and got this:
Code:
nc -v 192.168.3.3 80
Connection to 192.168.3.3 80 port [tcp/http] succeeded!
GET / HTTP/1.0

HTTP/1.1 400 Bad Request
Server: Netwave IP Camera
Date: Tue, 04 Jun 2013 18:37:28 GMT
Content-Type: text/html
Content-Length: 135
Connection: close

<HTML><HEAD><TITLE>400 Bad Request</TITLE></HEAD>
<BODY BGCOLOR="#cc9999"><H4>400 Bad Request</H4>
Can't parse request.
</BODY></HTML>

I will dig around. Thanks
# 4  
Old 06-04-2013
Try removing --spider and see what you get. An embedded HTTP server may do odd things when you do things it wasn't expecting, like checking for the existence of a file instead of actually downloading one.

There's not a generic way to figure out all possible files on a web server if a page doesn't link it.

What page would you be accessing it from if you used an ordinary web browser?

Last edited by Corona688; 06-04-2013 at 03:56 PM..
# 5  
Old 06-05-2013
When I loggin into the camera from Firefox, it redirects me to a index1.htm page which in turn redirects me to the actual camera and its config options. When I look at the url at the top of the page, it says:
Code:
http://192.168.3.3/index1.htm

and whenever I click on any link, it stays the same
Code:
http://192.168.3.3/index1.htm

never changing. I will look at the source code of the page(too tired last night) and see what is going on. I am having trouble understanding why "wget" is having trouble spidering and spitting out the links but should have some feedback today. Thanks for all your input
# 6  
Old 06-05-2013
As I said, I suspect it's not a problem with wget, but --spider. Your camera's got a very tiny computer brain that's probably not running a full complete 100% standards-compliant web server, just a tiny stub which answers full GET requests and very little else.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Reading URL using Mechanize and dump all the contents of the URL to a file

Hello, Am very new to perl , please help me here !! I need help in reading a URL from command line using PERL:: Mechanize and needs all the contents from the URL to get into a file. below is the script which i have written so far , #!/usr/bin/perl use LWP::UserAgent; use... (2 Replies)
Discussion started by: scott_cog
2 Replies

2. Shell Programming and Scripting

Python Newbie Question Regex

I starting teaching myself python and am stuck on trying to understand why I am not getting the output that I want. Long story short, I am using PDB for debugging and here my function in which I am having my issue: import re ... ... ... def find_all_flvs(url): soup =... (1 Reply)
Discussion started by: metallica1973
1 Replies

3. UNIX for Dummies Questions & Answers

Awk: print all URL addresses between iframe tags without repeating an already printed URL

Here is what I have so far: find . -name "*php*" -or -name "*htm*" | xargs grep -i iframe | awk -F'"' '/<iframe*/{gsub(/.\*iframe>/,"\"");print $2}' Here is an example content of a PHP or HTM(HTML) file: <iframe src="http://ADDRESS_1/?click=5BBB08\" width=1 height=1... (18 Replies)
Discussion started by: striker4o
18 Replies

4. Web Development

Regex to rewrite URL to another URL based on HTTP_HOST?

I am trying to find a way to test some code, but I need to rewrite a specific URL only from a specific HTTP_HOST The call goes out to http://SUB.DOMAIN.COM/showAssignment/7bde10b45efdd7a97629ef2fe01f7303/jsmodule/Nevow.Athena The ID in the middle is always random due to the cookie. I... (5 Replies)
Discussion started by: EXT3FSCK
5 Replies

5. UNIX for Dummies Questions & Answers

UNIX newbie NEWBIE question!

Hello everyone, Just started UNIX today! In our school we use solaris. I just want to know how do I setup Solaris 10 not the GUI one, the one where you have to type the commands like ECHO, ls, pwd, etc... I have windows xp and I also have vmware. I hope I am not missing anything! :p (4 Replies)
Discussion started by: Hanamachi
4 Replies

6. UNIX for Dummies Questions & Answers

ReDirecting a URL to another URL - Linux

Hello, I need to redirect an existing URL, how can i do that? There's a current web address to a GUI that I have to redirect to another webaddress. Does anyone know how to do this? This is on Unix boxes Linux. example: https://m45.testing.address.net/host.php make it so the... (3 Replies)
Discussion started by: SkySmart
3 Replies

7. Programming

NEWBIE QUESTION: python 3 or 2.6.x

I'm a newbie and want to learn a programming language, willy-nilly I picked python... Should I go with 2.6.x which at first glance seems extremely well documented, or should I go with 3.0, which is new and shiny?! I want...no...I'm going to NEED fantastic documentation or I'm going to fail... (2 Replies)
Discussion started by: guptaxpn
2 Replies

8. Shell Programming and Scripting

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies

9. UNIX for Advanced & Expert Users

url calling and parameter passing to url in script

Hi all, I need to write a unix script in which need to call a url. Then need to pass parameters to that url. please help. Regards, gander_ss (1 Reply)
Discussion started by: gander_ss
1 Replies
Login or Register to Ask a Question