HTTP POST binary files using Python: concise non-pycurl examples? - python

I'm interested in writing a short python script which uploads a short binary file (.wav/.raw audio) via a POST request to a remote server.
I've done this with pycurl, which makes it very simple and results in a concise script; unfortunately it also requires that the end
user have pycurl installed, which I can't rely on.
I've also seen some examples in other posts which rely only on basic libraries, urllib, urllib2, etc., however these generally seem to be quite verbose, which is also something I'd like to avoid.
I'm wondering if there are any concise examples which do not require the use of external libraries, and which will be quick and easy for 3rd parties to understand - even if they aren't particularly familiar with python.
What I'm using at present looks like,
def upload_wav( wavfile, url=None, **kwargs ):
"""Upload a wav file to the server, return the response."""
class responseCallback:
"""Store the server response."""
def __init__(self):
self.contents=''
def body_callback(self, buf):
self.contents = self.contents + buf
def decode( self ):
self.contents = urllib.unquote(self.contents)
try:
self.contents = simplejson.loads(self.contents)
except:
return self.contents
t = responseCallback()
c = pycurl.Curl()
c.setopt(c.POST,1)
c.setopt(c.WRITEFUNCTION, t.body_callback)
c.setopt(c.URL,url)
postdict = [
('userfile',(c.FORM_FILE,wavfile)), #wav file to post
]
#If there are extra keyword args add them to the postdict
for key in kwargs:
postdict.append( (key,kwargs[key]) )
c.setopt(c.HTTPPOST,postdict)
c.setopt(c.VERBOSE,verbose)
c.perform()
c.close()
t.decode()
return t.contents
this isn't exact, but it gives you the general idea. It works great, it's simple for 3rd parties to understand, but it requires pycurl.

POSTing a file requires multipart/form-data encoding and, as far as I know, there's no easy way (i.e. one-liner or something) to do this with the stdlib. But as you mentioned, there are plenty of recipes out there.
Although they seem verbose, your use case suggests that you can probably just encapsulate it once into a function or class and not worry too much, right? Take a look at the recipe on ActiveState and read the comments for suggestions:
Recipe 146306: Http client to POST using multipart/form-data
or see the MultiPartForm class in this PyMOTW, which seems pretty reusable:
PyMOTW: urllib2 - Library for opening URLs.
I believe both handle binary files.

I met similar issue today, after tried both and pycurl and multipart/form-data, I decide to read python httplib/urllib2 source code to find out, I did get one comparably good solution:
set Content-Length header(of the file) before doing post
pass a opened file when doing post
Here is the code:
import urllib2, os
image_path = "png\\01.png"
url = 'http://xx.oo.com/webserviceapi/postfile/'
length = os.path.getsize(image_path)
png_data = open(image_path, "rb")
request = urllib2.Request(url, data=png_data)
request.add_header('Cache-Control', 'no-cache')
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'image/png')
res = urllib2.urlopen(request).read().strip()
return res
see my blog post: http://www.2maomao.com/blog/python-http-post-a-binary-file-using-urllib2/

I know this is an old old stack, but I have a different solution.
If you went thru the trouble of building all the magic headers and everything, and are just UPSET that suddenly a binary file can't pass because python library is mean.. you can monkey patch a solution..
import httplib
class HTTPSConnection(httplib.HTTPSConnection):
def _send_output(self, message_body=None):
self._buffer.extend(("",""))
msg = "\r\n".join(self._buffer)
del self._buffer[:]
self.send(msg)
if message_body is not None:
self.send(message_body)
httplib.HTTPSConnection = HTTPSConnection
If you are using HTTP:// instead of HTTPS:// then replace all instances of HTTPSConnection above with HTTPConnection.
Before people get upset with me, YES, this is a BAD SOLUTION, but it is a way to fix existing code you really don't want to re-engineer to do it some other way.
Why does this fix it? Go look at the original Python source, httplib.py file.

How's urllib substantially more verbose? You build postdict basically the same way, except you start with
postdict = [ ('userfile', open(wavfile, 'rb').read()) ]
Once you vave postdict,
resp = urllib.urlopen(url, urllib.urlencode(postdict))
and then you get and save resp.read() and maybe unquote and try JSON-loading if needed. Seems like it would be actually shorter! So what am I missing...?

urllib.urlencode doesn't like some kinds of binary data.

Related

pytest-django: Is this the right way to test view with parameters?

Say I'm testing an RSS feed view in a Django app, is this how I should go about it?
def test_some_view(...):
...
requested_url = reverse("personal_feed", args=[some_profile.auth_token])
resp = client.get(requested_url, follow=True)
...
assert dummy_object.title in str(resp.content)
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
Should I assert that dummy_object is in the response this way?
I'm testing here using the str representation of the response object. When is it a good practice to do this vs. using selenium? I know it makes it easier to verify that said obj or property (like dummy_object.title) is encapsulated within an H1 tag for example. On the other hand, if I don't care about how the obj is represented, it's faster to do it like the above.
Reevaluating my comment (didn't carefully read the question and overlooked the RSS feed stuff):
Is reverse-ing and then passing that into the client.get() the right way to test? I thought it's DRYer and more future-proof than simply .get()ing the URL.
I would agree on that - from Django point, you are testing your views and don't care about what the exact endpoints they are mapped against. Using reverse is thus IMO the clear and correct approach.
Should I assert that dummy_object is in the response this way?
You have to pay attention here. response.content is a bytestring, so asserting dummy_object.title in str(resp.content) is dangerous. Consider the following example:
from django.contrib.syndication.views import Feed
class MyFeed(Feed):
title = 'äöüß'
...
Registered the feed in urls:
urlpatterns = [
path('my-feed/', MyFeed(), name='my-feed'),
]
Tests:
#pytest.mark.django_db
def test_feed_failing(client):
uri = reverse('news-feed')
resp = client.get(uri)
assert 'äöüß' in str(resp.content)
#pytest.mark.django_db
def test_feed_passing(client):
uri = reverse('news-feed')
resp = client.get(uri)
content = resp.content.decode(resp.charset)
assert 'äöüß' in content
One will fail, the other won't because of the correct encoding handling.
As for the check itself, personally I always prefer parsing the content to some meaningful data structure instead of working with raw string even for simple tests. For example, if you are checking for data in a text/html response, it's not much more overhead in writing
soup = bs4.BeautifulSoup(content, 'html.parser')
assert soup.select_one('h1#title-headliner') == '<h1>title</h1>'
or
root = lxml.etree.parse(io.StringIO(content), lxml.etree.HTMLParser())
assert next(root.xpath('//h1[#id='title-headliner']')).text == 'title'
than just
assert 'title' in content
However, invoking a parser is more explicit (you won't accidentally test for e.g. the title in page metadata in head) and also makes an implicit check for data integrity (e.g. you know that the payload is indeed valid HTML because parsed successfully).
To your example: in case of RSS feed, I'd simply use the XML parser:
from lxml import etree
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
root = etree.parse(io.BytesIO(resp.content))
title = root.xpath('//channel/title')[0].text
assert title == 'my title'
Here, I'm using lxml which is a faster impl of stdlib's xml. The advantage of parsing the content to an XML tree is also that the parser reads from bytestrings, taking care about the encoding handling - so you don't have to decode anything yourself.
Or use something high-level like atoma that ahs a nice API specifically for RSS entities, so you don't have to fight with XPath selectors:
import atoma
#pytest.mark.django_db
def test_feed_title(client):
uri = reverse('my-feed')
resp = client.get(uri)
feed = atoma.parse_atom_bytes(resp.content)
assert feed.title.value == 'my title'
...When is it a good practice to do this vs. using selenium?
Short answer - you don't need it. I havent't paid much attention when reading your question and had HTML pages in mind when writing the comment. Regarding this selenium remark - this library handles all the low-level stuff, so when the tests start to accumulate in count (and usually, they do pretty fast), writing
uri = reverse('news-feed')
resp = client.get(uri)
root = parser.parse(resp.content)
assert root.query('some-query')
and dragging the imports along becomes too much hassle, so selenium can replace it with
driver = WebDriver()
driver.get(uri)
assert driver.find_element_by_id('my-element').text == 'my value'
Sure, testing with an automated browser instance has other advantages like seeing exactly what the user would see in real browser, allowing the pages to execute client-side javascript etc. But of course, all of this applies mainly to HTML pages testing; in case of testing against the RSS feed selenium usage is an overkill and Django's testing tools are more than enough.

IncompleteRead using httplib

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.
I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Locally execute python file that is located on a web server

I am working on an open-source project called RubberBand which is an open source project that allows you to do what the title says. Locally execute python file that is located on a web server, however I have run a problem. If a comma is located in a string (etc. "http:"), It Will return an error.
'''
RubberBand Version 1.0.1 'Indigo-Charlie'
http://www.lukeshiels.com/rubberband
CHANGE-LOG:
Changed Error Messages.
Changed Whole Code Into one function, rather than three.
Changed Importing required libraries into one line instead of two
'''
#Edit Below this line
import httplib, urlparse
def executeFromURL(url):
if (url == None):
print "!# RUBBERBAND_ERROR: No URL Specified #!"
else:
CORE = None
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
host, path = urlparse.urlparse(url)[1:3]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
CORE = conn.getresponse().status
except StandardError:
CORE = None
if(CORE in good_codes):
exec(url)
else:
print "!# RUBBERBAND_ERROR: File Does Not Exist On WEBSERVER #!"
RubberBand in three lines without error checking:
import requests
def execute_from_url(url):
exec(requests.get(url).content)
You should use a return statement in your if (url == None): block as there is no point in carrying on with your function.
Where abouts in your code is the error, is there a full traceback as URIs with commas parse fine with the urlparse module.
Is it perhaps httplib.ResponseNotReady when calling CORE = conn.getresponse().status?
Nevermind that error message, that was me quickly testing your code and re-using the same connection object. I can't see what would be erroneous in your code.
I would suggest to check this question.
avoid comma in URL, that my suggestion.
Can I use commas in a URL?
This seems to work well for me:
import urllib
(fn,hd) = urllib.urlretrieve('http://host.com/file.py')
execfile(fn)
I prefer to use standard libraries, because I'm using python bundled with third party software (abaqus) which makes it a real headache to add packages.

Script to connect to a web page

Looking for a python script that would simply connect to a web page (maybe some querystring parameters).
I am going to run this script as a batch job in unix.
urllib2 will do what you want and it's pretty simple to use.
import urllib
import urllib2
params = {'param1': 'value1'}
req = urllib2.Request("http://someurl", urllib.urlencode(params))
res = urllib2.urlopen(req)
data = res.read()
It's also nice because it's easy to modify the above code to do all sorts of other things like POST requests, Basic Authentication, etc.
Try this:
aResp = urllib2.urlopen("http://google.com/");
print aResp.read();
If you need your script to actually function as a user of the site (clicking links, etc.) then you're probably looking for the python mechanize library.
Python Mechanize
A simple wget called from a shell script might suffice.
in python 2.7:
import urllib2
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urllib2.urlopen(url+"?"+params).read()
print html
more info at https://docs.python.org/2.7/library/urllib2.html
in python 3.6:
from urllib.request import urlopen
params = "key=val&key2=val2" #make sure that it's in GET request format
url = "http://www.example.com"
html = urlopen(url+"?"+params).read()
print(html)
more info at https://docs.python.org/3.6/library/urllib.request.html
to encode params into GET format:
def myEncode(dictionary):
result = ""
for k in dictionary: #k is the key
result += k+"="+dictionary[k]+"&"
return result[:-1] #all but that last `&`
I'm pretty sure this should work in either python2 or python3...
What are you trying to do? If you're just trying to fetch a web page, cURL is a pre-existing (and very common) tool that does exactly that.
Basic usage is very simple:
curl www.example.com
You might want to simply use httplib from the standard library.
myConnection = httplib.HTTPConnection('http://www.example.com')
you can find the official reference here: http://docs.python.org/library/httplib.html

Putting a pyCurl XML server response into a variable (Python)

I'm a Python novice, trying to use pyCurl. The project I am working on is creating a Python wrapper for the twitpic.com API (http://twitpic.com/api.do). For reference purposes, check out the code (http://pastebin.com/f4c498b6e) and the error I'm getting (http://pastebin.com/mff11d31).
Pay special attention to line 27 of the code, which contains "xml = server.perform()". After researching my problem, I discovered that unlike I had previously thought, .perform() does not return the xml response from twitpic.com, but None, when the upload succeeds (duh!).
After looking at the error output further, it seems to me like the xml input that I want stuffed into the "xml" variable is instead being printed to ether standard output or standard error (not sure which). I'm sure there is an easy way to do this, but I cannot seem to think of it at the moment. If you have any tips that could point me in the right direction, I'd be very appreciative. Thanks in advance.
Using a StringIO would be much cleaner, no point in using a dummy class like that if all you want is the response data...
Something like this would suffice:
import pycurl
import cStringIO
response = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://www.turnkeylinux.org')
c.setopt(c.WRITEFUNCTION, response.write)
c.perform()
c.close()
print response.getvalue()
The pycurl doc explicitly says:
perform() -> None
So the expected result is what you observe.
looking at an example from the pycurl site:
import sys
import pycurl
class Test:
def __init__(self):
self.contents = ''
def body_callback(self, buf):
self.contents = self.contents + buf
print >>sys.stderr, 'Testing', pycurl.version
t = Test()
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, t.body_callback)
c.perform()
c.close()
print t.contents
The interface requires a class instance - Test() - with a specific callback to save the content. Note the call c.setopt(c.WRITEFUNCTION, t.body_callback) - something like this is missing in your code, so you do not receive any data (buf in the example). The example shows how to access the content:
print t.contents

Categories

Resources