ery weird wget/curl output - what should I do?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting ery weird wget/curl output - what should I do?
# 1  
Old 09-13-2011
ery weird wget/curl output - what should I do?

Hi,
I'm trying to write a script to download RedHat's errata digest.
It comes in a txt.gz format, and i can get it easily with firefox.

HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It contains the first message in the digest, and then garbled data of what i can only assume is the rest of the .gz file.
Here is the basic request (I removed the http prefix because i'm not allowed to post links in the forum):
Moderator's Comments:
Mod Comment When posting a command line, use [CODE] tags, which allow you to post URLs as they aren't parsed

Code:
wget http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

I think this is an attempt by redhat to block people who try to retrieve the errata by script.... so I tried messing with the user agent ID string. no luck. output is the same. Here is an example of what I tried:

Code:
wget -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3" http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.

Code:
curl --silent http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

Code:
curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz


This is really annoying. Again, firefox gets it ok as a gz file. what should I do?

Thanks in advance....

Last edited by pludi; 09-13-2011 at 07:02 AM..
# 2  
Old 09-13-2011
I can confirm that this is happening at least, but remain as mystified as you. I also tried --referer in wget, to no avail.

I don't believe this is intentional. If they wanted to deny you the file, they'd just deny you the file, not find creative ways to botch its contents.

Last edited by Corona688; 09-13-2011 at 11:30 AM..
# 3  
Old 09-13-2011
I too was able to replicate the observed behavior. Corona688 is correct in that it's not an attempt to deny access to the file. It's either apache or wget being stupid. I cannot confirm at the moment which since the wget header dump only included the server side of the conversation (@#$@@#?).

In any case, this is what's happening.

When Firefox requests the file, it indicates that it accepts gzip encoding. When wget or curl ask for it, they do not indicate this. In a bizarre attempt to be helpful, instead of sending you the compressed text file, or redirecting, or refusing to comply, the webserver sends you plain text.

That in itself seems foolish, as depending on the client headers you may download a .gz file that may or may not be a gzip'd file. Meanwhile, the Content-Type header always indicates "application/x-gzip".

(We're just getting warmed up.)

The server response, in the Content-Length header, indicates that the data (you know, the gzip'd text which is actually gunzip'd text) that it's sending you is 13258 bytes long. In its infinte wisdom, their Apache decides to close the connection one byte short of the advertised size.

(Just when you think things couldn't get more messed up ...)

When wget reconnects to finish the transfer, their webserver begins sending at the byte offset requested, but in the original, gzip compressed data file ... and continues to send until the end of that compressed data. This is why you see an identical file size that begins with text followed by "garbled data".

Using dd to skip the first 13257 bytes in the mangled file, I used cmp to compare the remaining bytes with their counterparts in the file downloaded from Firefox. They were identical.

So, in the end, the transfer received is not the 13258 bytes advertised by the first server response, but the 86777 bytes file size of the gzip'd compressed file with the first 13257 bytes as uncompressed text and the remainder as gzip'd data.

Long story short: Tell Apache that you can handle gzip'd data. Using curl, the following option works around the problem:
Code:
-H 'Accept-Encoding: gzip'

Regards,
Alister

---------- Post updated at 12:13 PM ---------- Previous update was at 11:52 AM ----------

Quote:
Originally Posted by jstilby
curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.
Nah. curl is simply not retrying after the webserver closes the connection. Both curl and wget are sent plain text before the connection closes. Only wget reconnects and begins receiving gzip'd data.

Regards and welcome to the forum,
Alister

Last edited by alister; 09-13-2011 at 01:20 PM..
This User Gave Thanks to alister For This Post:
# 4  
Old 09-13-2011
Quote:
Originally Posted by alister
When Firefox requests the file, it indicates that it accepts gzip encoding. When wget or curl ask for it, they do not indicate this. In a bizarre attempt to be helpful, instead of sending you the compressed text file, or redirecting, or refusing to comply, the webserver sends you plain text.
Could this be server-side compression gone wrong? Many webservers support sending text as zipped data, but to do the reverse operation is just weird. It'd make sense for character encodings but not for a file on disk. You don't have to say you accept binary/unknown to download binary/unknown...

---------- Post updated at 10:22 AM ---------- Previous update was at 10:19 AM ----------

--header 'Accept-Encoding: gzip' works for wget too.

Last edited by Corona688; 09-13-2011 at 01:32 PM..
# 5  
Old 09-13-2011
Quote:
Originally Posted by Corona688
Many webservers support sending text as zipped data
And even that seemingly straightforward behavior can be a pain in the derrière: daniel.haxx.se HTTP transfer compression

Regards,
Alister
# 6  
Old 09-14-2011
Thanks!

Hi,

This does indeed work. Your help is much apreaciated.... i was really stuck with this.
Keep up the good job!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Web Development

Wget/curl and javascript

What can I use instead of wget/curl when I need to log into websites that use javascript? Wget and curl don't handle javascript. (6 Replies)
Discussion started by: locoroco
6 Replies

2. Shell Programming and Scripting

Wget and curl to post data

i'm using this command to post data to a remote host: wget --post-data="My Data" http://<my-ip>:80 -O /dev/null -q and curl --data "My Data" http://<my-ip>:80 however, when i run the above, i see the following in my access log on the remote host: Wget: 10.10.10.10 - - "POST /... (1 Reply)
Discussion started by: SkySmart
1 Replies

3. Shell Programming and Scripting

How to get content of a webpage Curl vs Wget?

Hello, What I am trying to do is to get html data of a website automatically. Firstly I decided to do it manually and via terminal I entered below code: $ wget http://www.***.*** -q -O code.html Unfortunately code.html file was empty. When I enter below code it gave Error 303-304 $... (1 Reply)
Discussion started by: baris35
1 Replies

4. Shell Programming and Scripting

Wget/curl credentials validation

Experts, I login to a 3rd party and pull some valuable information with my credentials. I pass my credentials via --post-data in wget. Now my Account is locked. I want my wget to alert that the Account is locked. How can i achieve this. My idea is, get the Source page html from the... (2 Replies)
Discussion started by: sathyaonnuix
2 Replies

5. Shell Programming and Scripting

Wget vs Curl - Proxy issue

Hi, My script needs to crawl the data from a third party site. Currently it is written in wget. The third party site is of shared interface with different IP addresses. My wget works with all the IP address but not with one. Whereas the curl is able to hit that IP address and comes out... (2 Replies)
Discussion started by: sathyaonnuix
2 Replies

6. Shell Programming and Scripting

Encapsulating output of CURL and/or WGET

i use curl and wget quite often. i set up alarms on their output. for instance, i would run a "wget" on a url and then search for certain strings within the output given by the "wget". the problem is, i cant get the entire output or response of my wget/curl command to show up correctly in... (3 Replies)
Discussion started by: SkySmart
3 Replies

7. Shell Programming and Scripting

How to download file without curl and wget

Hi I need a Shell script that will download a zip file every second from a http server but i can't use neither curl nor wget. Can anyone will help me go about this task ??? Thanks!! (1 Reply)
Discussion started by: rubber08
1 Replies

8. Shell Programming and Scripting

Specifying IP address with curl/wget

Hello, I am wondering does anyone know of a method using curl/wget or other where by I could specify the IP address of the server I wish to query for a website. Something similar to editing /etc/hosts but that can be done directly from the command line. I have looked through the man pages... (4 Replies)
Discussion started by: colinireland
4 Replies

9. Shell Programming and Scripting

Proxy with curl/wget support

I need a proxy that would enable me to use cli curl/wget with another ip address. How do I find a paid proxy server that supports curl/wget? (1 Reply)
Discussion started by: locoroco
1 Replies

10. Shell Programming and Scripting

Help needed in Curl & Wget

We are trying to invoke a https service from our unix script using curl command. The service is not getting invoked because it is SSL configured. Bypassing certification (using curl –k) does not work. curl -k https://site curl -k -x IP:Port https://site curl -k -x IP:443 https://id:pwd@site ... (0 Replies)
Discussion started by: dineshbabu01
0 Replies
Login or Register to Ask a Question