I downloaded a webpage in my python script.
In most cases, this works fine.
However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.
How do decode this to regular text?
I use zlib to decompress gzipped content from web.
import zlib
import urllib
f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
Decompress your byte stream using the built-in gzip module.
If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])
Further information
1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.
2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:
zlib.decompress(string[, wbits[, bufsize]])
...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.
Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:
arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").
arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)
arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.
The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.
For Python 3
Try out this:
import gzip
fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')
I use something like that:
f = urllib2.urlopen(request)
data = f.read()
try:
from cStringIO import StringIO
from gzip import GzipFile
data2 = GzipFile('', 'r', 0, StringIO(data)).read()
data = data2
except:
#print "decompress error %s" % err
pass
return data
If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.
Example:
>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...
The .text property of the response is for reading the content in the text context.
The .content property of the response is for reading the content in the binary context.
See the Binary Response Content section on docs.python-requests.org
Similar to Shatu's answer for python3, but arranged a little differently:
import gzip
s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())
This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.
This version is simple and avoids reading the whole file first by not calling the read() method. It provides a file stream like object instead that behaves just like a normal file stream.
import gzip
from urllib.request import urlopen
my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')
None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:
import requests
import gzip
response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data) # decoded contents of page
You can use urllib3 to easily decode gzip.
urllib3.response.decode_gzip(response.data)
Related
I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written
This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.
Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)
Dears I want get source page but not in internet rather in local system
example : url=urllib.request.urlopen ('c://1.html')
>>> import urllib.request
>>> url=urllib.request.urlopen ('http://google.com')
>>> page =url.read()
>>> page=page.decode()
>>> page
what's my problem ?
from os.path import abspath
with open(abspath('c:/1.html') as fh:
print(fh.read())
Since url.read() just gives you the data as-is, and .decode() doesn't really do anything except convert the byte data from the socket to a traditional string, just print the filecontents?
urllib is mainly (if not only) a transporter to recieve HTML data, not actually parse the content. So all it does is connect to the source, separate the headers and give you the content. If you've already stored it locally, in a file.. Well then urllib has no more use to you. Consider looking at a HTML Parsing library such as BeautifulSoup for instance.
I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.
My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.
For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.
I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.
Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.
So you need to compress it first:
$ gzip file.txt
That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:
$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
Now you have all what you need to use the contents of that file in your python script:
from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile
# this is the variable with your file's contents
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""
# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r',
fileobj=StringIO(b64decode(gzipped_data)))
# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()
# and close the file descriptor
orig_file_desc.close()
Obviously, your program will depend on the base64, gzip and cStringIO python modules.
I'm not sure exactly what you're asking, but here's a stab...
The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.
Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.
Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))
This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.
All the modules mentioned here are described in the documentation for the Python standard library.
I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.
Have I answered the right question?
How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.
from StringIO import StringIO
import base64
import gzip
contents = 'The quick brown fox jumps over the lazy dog'
zip_text_file = StringIO()
zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)
zipper.write(contents)
zipper.close()
enc_text = base64.b64encode(zip_text_file.getvalue())
print enc_text
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE
Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:
import base64
from io import BytesIO
DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"
sample_text_file = gzip.GzipFile(
mode='rb',
fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)
binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.
I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.
The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):
#! /usr/bin/env python
import zlib
CHUNKSIZE=1000
d = zlib.decompressobj()
f=open('23046-8.txt.gz','rb')
buffer=f.read(CHUNKSIZE)
while buffer:
outstr = d.decompress(buffer)
print(outstr)
buffer=f.read(CHUNKSIZE)
outstr = d.flush()
print(outstr)
f.close()
Unfortunately, as I said, this barfs with:
Traceback (most recent call last):
File "./test.py", line 13, in <module>
outstr = d.decompress(buffer)
zlib.error: Error -3 while decompressing: incorrect header check
Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.
The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.
Any ideas?
gzip and zlib use slightly different headers.
See How can I decompress a gzip stream with zlib?
Try d = zlib.decompressobj(16+zlib.MAX_WBITS).
And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.
I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117
d = zlib.decompressobj(zlib.MAX_WBITS|32)
per documentation this automatically detects the header (zlib or gzip).
I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.