Putting a pyCurl XML server response into a variable (Python) - python

I'm a Python novice, trying to use pyCurl. The project I am working on is creating a Python wrapper for the twitpic.com API (http://twitpic.com/api.do). For reference purposes, check out the code (http://pastebin.com/f4c498b6e) and the error I'm getting (http://pastebin.com/mff11d31).
Pay special attention to line 27 of the code, which contains "xml = server.perform()". After researching my problem, I discovered that unlike I had previously thought, .perform() does not return the xml response from twitpic.com, but None, when the upload succeeds (duh!).
After looking at the error output further, it seems to me like the xml input that I want stuffed into the "xml" variable is instead being printed to ether standard output or standard error (not sure which). I'm sure there is an easy way to do this, but I cannot seem to think of it at the moment. If you have any tips that could point me in the right direction, I'd be very appreciative. Thanks in advance.

Using a StringIO would be much cleaner, no point in using a dummy class like that if all you want is the response data...
Something like this would suffice:
import pycurl
import cStringIO
response = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://www.turnkeylinux.org')
c.setopt(c.WRITEFUNCTION, response.write)
c.perform()
c.close()
print response.getvalue()

The pycurl doc explicitly says:
perform() -> None
So the expected result is what you observe.
looking at an example from the pycurl site:
import sys
import pycurl
class Test:
def __init__(self):
self.contents = ''
def body_callback(self, buf):
self.contents = self.contents + buf
print >>sys.stderr, 'Testing', pycurl.version
t = Test()
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, t.body_callback)
c.perform()
c.close()
print t.contents
The interface requires a class instance - Test() - with a specific callback to save the content. Note the call c.setopt(c.WRITEFUNCTION, t.body_callback) - something like this is missing in your code, so you do not receive any data (buf in the example). The example shows how to access the content:
print t.contents

Related

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

IncompleteRead using httplib

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.
At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.
I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Python decode AMF response

I'm making a request in python to a web service which returns AMF. I don't know if it's AMF0 or AMF3 yet.
r = requests.post(url, data=data)
>>> r.text
u'\x00\x03...'
(Full data here)
How can I take r.text and convert it to a python object or similar? I found amfast but it's Decoder class returns a 3.131513074181806e-294 assuming AMF0 and None for AMF3. (Both incorrect)
from amfast.decoder import Decoder
decoder = Decoder(amf3=False)
obj = decoder.decode(StringIO.StringIO(r.text))
I see two problems with your code, the first is using r.text to return binary data. Use r.content instead. The second problem is using decoder.decode method, which decodes one object and not a packet. Use decoder.decode_packet instead.
from amfast.decoder import Decoder
decoder = Decoder(amf3=True)
obj = decoder.decode_packet(r.content)
Using Pyamf works as well, using r.content.
have u tried PyAMF.
from pyamf import remoting
remoting.decode(data)

urllib2 not retrieving entire HTTP response

I'm perplexed as to why I'm not able to download the entire contents of some JSON responses from FriendFeed using urllib2.
>>> import urllib2
>>> stream = urllib2.urlopen('http://friendfeed.com/api/room/the-life-scientists/profile?format=json')
>>> stream.headers['content-length']
'168928'
>>> data = stream.read()
>>> len(data)
61058
>>> # We can see here that I did not retrieve the full JSON
... # given that the stream doesn't end with a closing }
...
>>> data[-40:]
'ce2-003048343a40","name":"Vincent Racani'
How can I retrieve the full response with urllib2?
Best way to get all of the data:
fp = urllib2.urlopen("http://www.example.com/index.cfm")
response = ""
while 1:
data = fp.read()
if not data: # This might need to be if data == "": -- can't remember
break
response += data
print response
The reason is that .read() isn't guaranteed to return the entire response, given the nature of sockets. I thought this was discussed in the documentation (maybe urllib) but I cannot find it.
Use tcpdump (or something like it) to monitor the actual network interactions - then you can analyze why the site is broken for some client libraries. Ensure that you repeat multiple times by scripting the test, so you can see if the problem is consistent:
import urllib2
url = 'http://friendfeed.com/api/room/friendfeed-feedback/profile?format=json'
stream = urllib2.urlopen(url)
expected = int(stream.headers['content-length'])
data = stream.read()
datalen = len(data)
print expected, datalen, expected == datalen
The site's working consistently for me so I can't give examples of finding failures :)
Keep calling stream.read() until it's done...
while data = stream.read() :
... do stuff with data
readlines()
also works

HTTP POST binary files using Python: concise non-pycurl examples?

I'm interested in writing a short python script which uploads a short binary file (.wav/.raw audio) via a POST request to a remote server.
I've done this with pycurl, which makes it very simple and results in a concise script; unfortunately it also requires that the end
user have pycurl installed, which I can't rely on.
I've also seen some examples in other posts which rely only on basic libraries, urllib, urllib2, etc., however these generally seem to be quite verbose, which is also something I'd like to avoid.
I'm wondering if there are any concise examples which do not require the use of external libraries, and which will be quick and easy for 3rd parties to understand - even if they aren't particularly familiar with python.
What I'm using at present looks like,
def upload_wav( wavfile, url=None, **kwargs ):
"""Upload a wav file to the server, return the response."""
class responseCallback:
"""Store the server response."""
def __init__(self):
self.contents=''
def body_callback(self, buf):
self.contents = self.contents + buf
def decode( self ):
self.contents = urllib.unquote(self.contents)
try:
self.contents = simplejson.loads(self.contents)
except:
return self.contents
t = responseCallback()
c = pycurl.Curl()
c.setopt(c.POST,1)
c.setopt(c.WRITEFUNCTION, t.body_callback)
c.setopt(c.URL,url)
postdict = [
('userfile',(c.FORM_FILE,wavfile)), #wav file to post
]
#If there are extra keyword args add them to the postdict
for key in kwargs:
postdict.append( (key,kwargs[key]) )
c.setopt(c.HTTPPOST,postdict)
c.setopt(c.VERBOSE,verbose)
c.perform()
c.close()
t.decode()
return t.contents
this isn't exact, but it gives you the general idea. It works great, it's simple for 3rd parties to understand, but it requires pycurl.
POSTing a file requires multipart/form-data encoding and, as far as I know, there's no easy way (i.e. one-liner or something) to do this with the stdlib. But as you mentioned, there are plenty of recipes out there.
Although they seem verbose, your use case suggests that you can probably just encapsulate it once into a function or class and not worry too much, right? Take a look at the recipe on ActiveState and read the comments for suggestions:
Recipe 146306: Http client to POST using multipart/form-data
or see the MultiPartForm class in this PyMOTW, which seems pretty reusable:
PyMOTW: urllib2 - Library for opening URLs.
I believe both handle binary files.
I met similar issue today, after tried both and pycurl and multipart/form-data, I decide to read python httplib/urllib2 source code to find out, I did get one comparably good solution:
set Content-Length header(of the file) before doing post
pass a opened file when doing post
Here is the code:
import urllib2, os
image_path = "png\\01.png"
url = 'http://xx.oo.com/webserviceapi/postfile/'
length = os.path.getsize(image_path)
png_data = open(image_path, "rb")
request = urllib2.Request(url, data=png_data)
request.add_header('Cache-Control', 'no-cache')
request.add_header('Content-Length', '%d' % length)
request.add_header('Content-Type', 'image/png')
res = urllib2.urlopen(request).read().strip()
return res
see my blog post: http://www.2maomao.com/blog/python-http-post-a-binary-file-using-urllib2/
I know this is an old old stack, but I have a different solution.
If you went thru the trouble of building all the magic headers and everything, and are just UPSET that suddenly a binary file can't pass because python library is mean.. you can monkey patch a solution..
import httplib
class HTTPSConnection(httplib.HTTPSConnection):
def _send_output(self, message_body=None):
self._buffer.extend(("",""))
msg = "\r\n".join(self._buffer)
del self._buffer[:]
self.send(msg)
if message_body is not None:
self.send(message_body)
httplib.HTTPSConnection = HTTPSConnection
If you are using HTTP:// instead of HTTPS:// then replace all instances of HTTPSConnection above with HTTPConnection.
Before people get upset with me, YES, this is a BAD SOLUTION, but it is a way to fix existing code you really don't want to re-engineer to do it some other way.
Why does this fix it? Go look at the original Python source, httplib.py file.
How's urllib substantially more verbose? You build postdict basically the same way, except you start with
postdict = [ ('userfile', open(wavfile, 'rb').read()) ]
Once you vave postdict,
resp = urllib.urlopen(url, urllib.urlencode(postdict))
and then you get and save resp.read() and maybe unquote and try JSON-loading if needed. Seems like it would be actually shorter! So what am I missing...?
urllib.urlencode doesn't like some kinds of binary data.

Categories

Resources