Hi,
I'm trying to write a script to download RedHat's errata digest.
It comes in a txt.gz format, and i can get it easily with firefox.
HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It contains the first message in the digest, and then garbled data of what i can only assume is the rest of the .gz file.
Here is the basic request (I removed the http prefix because i'm not allowed to post links in the forum):
Moderator's Comments:
When posting a command line, use [CODE] tags, which allow you to post URLs as they aren't parsed
I think this is an attempt by redhat to block people who try to retrieve the errata by script.... so I tried messing with the user agent ID string. no luck. output is the same. Here is an example of what I tried:
curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.
This is really annoying. Again, firefox gets it ok as a gz file. what should I do?
I too was able to replicate the observed behavior. Corona688 is correct in that it's not an attempt to deny access to the file. It's either apache or wget being stupid. I cannot confirm at the moment which since the wget header dump only included the server side of the conversation (@#$@@#?).
In any case, this is what's happening.
When Firefox requests the file, it indicates that it accepts gzip encoding. When wget or curl ask for it, they do not indicate this. In a bizarre attempt to be helpful, instead of sending you the compressed text file, or redirecting, or refusing to comply, the webserver sends you plain text.
That in itself seems foolish, as depending on the client headers you may download a .gz file that may or may not be a gzip'd file. Meanwhile, the Content-Type header always indicates "application/x-gzip".
(We're just getting warmed up.)
The server response, in the Content-Length header, indicates that the data (you know, the gzip'd text which is actually gunzip'd text) that it's sending you is 13258 bytes long. In its infinte wisdom, their Apache decides to close the connection one byte short of the advertised size.
(Just when you think things couldn't get more messed up ...)
When wget reconnects to finish the transfer, their webserver begins sending at the byte offset requested, but in the original, gzip compressed data file ... and continues to send until the end of that compressed data. This is why you see an identical file size that begins with text followed by "garbled data".
Using dd to skip the first 13257 bytes in the mangled file, I used cmp to compare the remaining bytes with their counterparts in the file downloaded from Firefox. They were identical.
So, in the end, the transfer received is not the 13258 bytes advertised by the first server response, but the 86777 bytes file size of the gzip'd compressed file with the first 13257 bytes as uncompressed text and the remainder as gzip'd data.
Long story short: Tell Apache that you can handle gzip'd data. Using curl, the following option works around the problem:
Regards,
Alister
---------- Post updated at 12:13 PM ---------- Previous update was at 11:52 AM ----------
Quote:
Originally Posted by jstilby
curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.
Nah. curl is simply not retrying after the webserver closes the connection. Both curl and wget are sent plain text before the connection closes. Only wget reconnects and begins receiving gzip'd data.
When Firefox requests the file, it indicates that it accepts gzip encoding. When wget or curl ask for it, they do not indicate this. In a bizarre attempt to be helpful, instead of sending you the compressed text file, or redirecting, or refusing to comply, the webserver sends you plain text.
Could this be server-side compression gone wrong? Many webservers support sending text as zipped data, but to do the reverse operation is just weird. It'd make sense for character encodings but not for a file on disk. You don't have to say you accept binary/unknown to download binary/unknown...
---------- Post updated at 10:22 AM ---------- Previous update was at 10:19 AM ----------
--header 'Accept-Encoding: gzip' works for wget too.
Last edited by Corona688; 09-13-2011 at 01:32 PM..
i'm using this command to post data to a remote host:
wget --post-data="My Data" http://<my-ip>:80 -O /dev/null -q
and
curl --data "My Data" http://<my-ip>:80
however, when i run the above, i see the following in my access log on the remote host:
Wget:
10.10.10.10 - - "POST /... (1 Reply)
Hello,
What I am trying to do is to get html data of a website automatically.
Firstly I decided to do it manually and via terminal I entered below code:
$ wget http://www.***.*** -q -O code.html
Unfortunately code.html file was empty.
When I enter below code it gave Error 303-304
$... (1 Reply)
Experts,
I login to a 3rd party and pull some valuable information with my credentials. I pass my credentials via --post-data in wget.
Now my Account is locked. I want my wget to alert that the Account is locked. How can i achieve this.
My idea is, get the Source page html from the... (2 Replies)
Hi,
My script needs to crawl the data from a third party site. Currently it is written in wget. The third party site is of shared interface with different IP addresses.
My wget works with all the IP address but not with one. Whereas the curl is able to hit that IP address and comes out... (2 Replies)
i use curl and wget quite often.
i set up alarms on their output. for instance, i would run a "wget" on a url and then search for certain strings within the output given by the "wget".
the problem is, i cant get the entire output or response of my wget/curl command to show up correctly in... (3 Replies)
Hi
I need a Shell script that will download a zip file every second from a http server but i can't use neither curl nor wget.
Can anyone will help me go about this task ???
Thanks!! (1 Reply)
Hello,
I am wondering does anyone know of a method using curl/wget or other where by I could specify the IP address of the server I wish to query for a website.
Something similar to editing /etc/hosts but that can be done directly from the command line. I have looked through the man pages... (4 Replies)
We are trying to invoke a https service from our unix script using curl command. The service is not getting invoked because it is SSL configured. Bypassing certification (using curl –k) does not work.
curl -k https://site
curl -k -x IP:Port https://site
curl -k -x IP:443 https://id:pwd@site
... (0 Replies)