Replacing binary mode file object for urllib2

Replacing binary mode file object for urllib2 - python

I have a piece of code that looks like this, and it works as is in local (without using any AppEngine):
bodyParams = { "file" : open( filePath, "rb" ) }
request = urllib2.Request( requestUrl, None, self.buildAuthInfo() )
response = self.getOpener().open(request, bodyParams).read()
I would like to get rid of open, i.e open( filePath, "rb" ) , because in Google AppEngine working with files is prohibited (or really inconvenient).
In order to solve the problem, I get a HTTP POST data of a base64 encoded image file, I decode it. I send the encoded data via cURL, using the following command:
curl -X POST -F image=#encoded http://localhost:8080/image-process
encoded is a base64 encoded jpg.
At this point, I tried two different things: putting the decoded data into a StringIO object, or directly passing and using inside bodyParams = { "file" : DECODEDVALUE}. I would expect either of these to work, but I get a HTTP 500 error from the target server (where I'm doing an external API request). This is how I decode the value:
img = self.request.get('image')
DECODEDVALUE = MyStringIO(base64.b64decode(img))
I believe I have a problem related to encoding and/or binary mode. I believe I have a problem related to encoding and/or binary mode.
How can I get rid of open and the use of file objects, in favor of StringIO, BytesIO, or str objects?
NOTE: just for clarity, not related to the original problem. MyStringIO is a subclass of StringIO.StringIO with __len__ added:
class MyStringIO(StringIO.StringIO):
def __len__(self):
return self.len

Related

zapier - python - pass bytes to output to be used for next action

I am trying to make an automation on Zapier with the flow like this:
Trigger: a web hook that receive POST request. The body is a file key with a value of base64 string of a certain PDF, so the type is str
Action: a Zapier Python Code that retrieve the file from web hooks, decode the base64 string to bytes to get the real valid content of the PDF to say a variable named file_bytes
Action: a dropbox that retrieve the file_bytes from the step before, and upload it to dropbox
I coded the decoder myself (point 2) and tested that it worked well on my local system.
The problem is that Dropbox (point 3) only receive binary, while Python (point 2) can not pass a value other than JSON serializable. This is a clear limitation from Zapier:
output A dictionary or list of dictionaries that will be the "return value" of this code. You can explicitly return early if you like. This must be JSON serializable!
...
The close to what I can get from other question on this sites are these two, but it did not give me any luck.
Why am I getting a Runtime.MarshalError when using this code in Zapier?
Use Python to get image in Zapier
...
The code to decode base64 string to bytes is like so:
file_bytes = base64.b64decode(input_data['file'])
What I already did:
pass the file_bytes to output like so:
output = [{'file': input_data['file_bytes']}]}]
but it gave me This must be JSON serializable!
pass the file_bytes as string like so:
output = [{'file': str(input_data['file_bytes'])}]
it do uploaded to dropbox, but the file content is corrupt. (of course it is, duh)
pass the file_bytes as decoded string with latin-1 encoding:
output = [{'file': input_data['file_bytes'].decode('latin-1')}]
it do uploaded to dropbox, the PDF can also be opened, even having the same page number as the original PDF, but it is all blank (white, no content)
...
So, is this kind of feature really visible in Zapier platform? Or I was already at dead end even since the beginning?

How to save incoming file in bottle api to hdfs

I am defining bottle api where I need to accept a file from the client and then save that file to HDFS on the local system.
The code looks something like this.
#route('/upload', method='POST')
def do_upload():
import pdb; pdb.set_trace()
upload = request.files.upload
name, ext = os.path.splitext(upload.filename)
save_path = "/data/{user}/{filename}".format(user=USER, filename=name)
hadoopy.writetb(save_path, upload.file.read())
return "File successfully saved to '{0}'.".format(save_path)
The issue is, the request.files.upload.file is an object of type cStringIO.StringO which can be converted to a str with a .read() method. But the hadoopy.writetb(path, content) expects the content to be some other format and the server sticks at that point. It doesn't give exception, it doesn't give error or any result. Just stands there as if it was in infinite loop.
Does anyone know how to write incoming file in bottle api to HDFS?

From the hadoopy documentation, it looks like the second parameter to writetb is supposed to be an iterable of pairs; but you're passing in bytes.
...the hadoopy.writetb command which takes an iterator of key/value pairs...
Have you tried passing in a pair? Instead of what you're doing,
hadoopy.writetb(save_path, upload.file.read()) # 2nd param is wrong
try this:
hadoopy.writetb(save_path, (path, upload.file.read()))
(I'm not familiar with Hadoop so it's not clear to me what the semantics of path are, but presumably it'll make sense to someone who knows HDFS.)

GAE Python Blobstore doesn't save filename containing unicode literals in Firefox only

I am developing an app which prompts the user to upload a file which is then available for download.
Here is the download handler:
class ViewPrezentacje(blobstore_handlers.BlobstoreDownloadHandler, BaseHandler):
def get(self,blob_key):
blob_key = str(urllib.unquote(blob_key))
blob_info=blobstore.BlobInfo.get(blob_key)
self.send_blob(blob_info, save_as=urllib.quote(blob_info.filename.encode('utf-8')))
The file is downloaded with the correct file name (i.e. unicode literals are properly displayed) while using Chrome or IE, but in Firefox it is saved as a string of the form "%83%86%E3..."
Is there any way to make it work properly in Firefox?

Sending filenames with non-ASCII characters in attachments is fraught with difficulty, as the original specification was broken and browser behaviours have varied.
You shouldn't be %-encoding (urllib.quote) the filename; Firefox is right to offer it as literal % sequences as a result. IE's behaviour of %-decoding sequences in the filename is incorrect, even though Chrome eventually went on to copy it.
Ultimately the right way to send non-ASCII filenames is to use the mechanism specified in RFC6266, which ends up with a header that looks like this:
Content-Disposition: attachment; filename*=UTF-8''foo-%c3%a4-%e2%82%ac.html
However:
older browsers such as IE8 don't support it so if you care you should pass something as an ASCII-only filename= as well;
BlobstoreDownloadHandler doesn't know about this mechanism.
The bit of BlobstoreDownloadHandler that needs fixing is this inner function in send_blob:
def send_attachment(filename):
if isinstance(filename, unicode):
filename = filename.encode('utf-8')
self.response.headers['Content-Disposition'] = (
_CONTENT_DISPOSITION_FORMAT % filename)
which really wants to do:
rfc6266_filename = "UTF-8''" + urllib.quote(filename.encode('utf-8'))
fallback_filename = filename.encode('us-ascii', 'ignore')
self.response.headers['Content-Disposition'] = 'attachment; filename="%s"; filename*=%s' % (rfc6266_filename, fallback_filename)
but unfortunately being an inner function makes it annoying to try to fix in a subclass. You could:
override the whole of send_blob to replace the send_attachment inner function
or maybe you can write self.response.headers['Content-Disposition'] like this after calling send_blob? I'm not sure how GAE handles this
or, probably most practical of all, give up on having Unicode filenames for now until GAE fixes it

Read specific bytes using urlopen()

I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written

This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.

Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)

In Python, how do I decode GZIP encoding?

I downloaded a webpage in my python script.
In most cases, this works fine.
However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.
How do decode this to regular text?

I use zlib to decompress gzipped content from web.
import zlib
import urllib
f=urllib.request.urlopen(url)
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Decompress your byte stream using the built-in gzip module.
If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])
Further information
1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.
2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:
zlib.decompress(string[, wbits[, bufsize]])
...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.
Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:
arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").
arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)
arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.
The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.

For Python 3
Try out this:
import gzip
fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

I use something like that:
f = urllib2.urlopen(request)
data = f.read()
try:
from cStringIO import StringIO
from gzip import GzipFile
data2 = GzipFile('', 'r', 0, StringIO(data)).read()
data = data2
except:
#print "decompress error %s" % err
pass
return data

If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.
Example:
>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...
The .text property of the response is for reading the content in the text context.
The .content property of the response is for reading the content in the binary context.
See the Binary Response Content section on docs.python-requests.org

Similar to Shatu's answer for python3, but arranged a little differently:
import gzip
s = Request("https://someplace.com", None, headers)
r = urlopen(s, None, 180).read()
try: r = gzip.decompress(r)
except OSError: pass
result = json_load(r.decode())
This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.

This version is simple and avoids reading the whole file first by not calling the read() method. It provides a file stream like object instead that behaves just like a normal file stream.
import gzip
from urllib.request import urlopen
my_gzip_url = 'http://my_url.gz'
my_gzip_stream = urlopen(my_gzip_url)
my_stream = gzip.open(my_gzip_stream, 'r')

None of these answers worked out of the box using Python 3. Here is what worked for me to fetch a page and decode the gzipped response:
import requests
import gzip
response = requests.get('your-url-here')
data = str(gzip.decompress(response.content), 'utf-8')
print(data) # decoded contents of page

You can use urllib3 to easily decode gzip.
urllib3.response.decode_gzip(response.data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing binary mode file object for urllib2 - python

Related

zapier - python - pass bytes to output to be used for next action

How to save incoming file in bottle api to hdfs

GAE Python Blobstore doesn't save filename containing unicode literals in Firefox only

Read specific bytes using urlopen()

In Python, how do I decode GZIP encoding?

Categories

Resources