I've been trying to consume the Twitter Streaming API using Python Requests.
There's a simple example in the documentation:
import requests
import json
r = requests.post('https://stream.twitter.com/1/statuses/filter.json',
data={'track': 'requests'}, auth=('username', 'password'))
for line in r.iter_lines():
if line: # filter out keep-alive new lines
print json.loads(line)
When I execute this, the call to requests.post() never returns. I've experimented and proved that it is definitely connecting to Twitter and receiving data from the API. However, instead of returning a response object, it just sits there consuming as much data as Twitter sends. Judging by the code above, I would expect requests.post() to return a response object with an open connection to Twitter down which I could continue to receive realtime results.
(To prove it was receiving data, I connected to Twitter using the same credentials in another shell, whereupon Twitter closed the first connection, and the call returned the response object. The r.content attribute contained all the backed up data received while the connection was open.)
The documentation makes no mention of any other steps required to cause requests.post to return before consuming all the supplied data. Other people seem to be using similar code without encountering this problem, e.g. here.
I'm using:
Python 2.7
Ubuntu 11.04
Requests 0.14.0
You need to switch off prefetching, which I think is a parameter that changed defaults:
r = requests.post('https://stream.twitter.com/1/statuses/filter.json',
data={'track': 'requests'}, auth=('username', 'password'),
prefetch=False)
for line in r.iter_lines():
if line: # filter out keep-alive new lines
print json.loads(line)
Note that as of requests 1.x the parameter has been renamed, and now you use stream=True:
r = requests.post('https://stream.twitter.com/1/statuses/filter.json',
data={'track': 'requests'}, auth=('username', 'password'),
stream=True)
for line in r.iter_lines():
if line: # filter out keep-alive new lines
print json.loads(line)
Ah, I found the answer by reading the code. At some point, a prefetch parameter was added to the post method (and other methods, I assume).
I just needed to add a prefetch=False kwarg to requests.post().
Related
I'm trying to upload a PDF as an attachment to a Trello card using python-requests. I've been unable to get the request in the function below to return anything other than 400: Error parsing body despite significant tweaks (detailed below).
I should note that I'm able to create cards and add URL attachments to them (neither of which require a file upload) without any problems.
Here's the code that handles the POST of the file:
def post_pdf(session, design, card_id):
attachment = {
"name": design["campaign_title"] + " - Combined PDF",
"mimeType": "application/pdf"
}
pdf_post = session.post(
url = "https://api.trello.com/1/cards/" + card_id + "/attachments",
files = {"file": open("combined_pdf.pdf", "rb")},
data = attachment
)
The authentication key and token are set Session params when the session was created, so they're not added here.
Also, in the actual code, the POST is handled by a wrapper function that adds some boilerplate error-checking and rate limiting to the request, as well as more-verbose error dumps when a request fails, but I've confirmed (in the above example) that the same error persists without the wrapper.
Adjustments I've tried
Substituting data = attachment with json = attachment
Substituting data = attachment with params = attachment
Omitting attachment completely and POSTing the file with no associated data
Adding stream = True to the request parameters (this doesn't seem to matter for uploads, but I figured it couldn't hurt to try)
Encoding the file as base64 (this encoding has been required elsewhere; I was grasping at straws)
Encoding the file as base64, combined with the above tweaks to data / json / params
Note: The PDF file is potentially a source of the problem - it's generated by converting several images to PDF format and then concatenating them with pdfunite, so I could well have made mistakes in its creation that are causing Trello to reject the file. What seems to confirm this is that Googling for Trello "Error parsing body" returns two hits, only one of which deals with Trello, and neither of which are useful. This leads me to think that this is a particularly odd / rare error message, which means to me that I've made some kind of serious error encoding the file.
However, the PDF file opens properly on my (and my coworkers') systems without any error messages, artifacts, or other strange behavior. More importantly, trying this with other "known good" PDFs also fails, with the same error code. Because the file's contents fall within the bounds of "company property / information", I'd like to avoid posting it (and / or the raw request body), but I'll do so if there's agreement that it's causing the problem.
I found the solution: the Content-Type header was set incorrectly due to a session-wide setting ( Session.headers.update({"Content-Type": "application/json"}) ) overriding the multipart/form-data header when the upload request was sent. This caused Trello to reject the body. I solved the problem by removing the session-level header, which allowed requests to modify the content type for each request.
I want to read specific bytes from a remote file using a python module. I am using urllib2. Specific bytes in the sense bytes in the form of Offset,Size. I know we can read X number of bytes from a remote file using urlopen(link).read(X). Is there any way so that I can read data which starts from Offset of length Size.?
def readSpecificBytes(link,Offset,size):
# code to be written
This will work with many servers (Apache, etc.), but doesn't always work, esp. not with dynamic content like CGI (*.php, *.cgi, etc.):
import urllib2
def get_part_of_url(link, start_byte, end_byte):
req = urllib2.Request(link)
req.add_header('Range', 'bytes=' + str(start_byte) + '-' + str(end_byte))
resp = urllib2.urlopen(req)
content = resp.read()
Note that this approach means that the server never has to send and you never download the data you don't need/want, which could save tons of bandwidth if you only want a small amount of data from a large file.
When it doesn't work, just read the first set of bytes before the rest.
See Wikipedia Article on HTTP headers for more details.
Unfortunately the file-like object returned by urllib2.urlopen() doesn't actually have a seek() method. You will need to work around this by doing something like this:
def readSpecificBytes(link,Offset,size):
f = urllib2.urlopen(link)
if Offset > 0:
f.read(Offset)
return f.read(size)
I'm attempting to send an HTTP Request to a website and read the data it returns. The first website I tried worked successfully. It returned about 4 packets of data and then returned a 0 packet which the script caught and terminated.
However, attempting to load http://www.google.com/ does not work this way. Instead, it returns about 10 packets of the same length, a final smaller packet, and then proceeds to time out. Is it normal for this to happen? Does it all just depend on what server the host is using?
If anyone could recommend an alternative way to reading with socket.recv() that would take into account that a final null packet is not always sent, it would be greatly appreciated. Thanks.
try:
data = s.recv(4096)
while True:
more = s.recv(4096)
print len(more)
if not more:
break
else:
data += more
except socket.timeout:
errMsg = "Connection timed-out while connecting to %s. Request headers were as follows: %s", (parsedUrl.netloc, rHeader.headerContent)
self.logger.exception(errMsg)
raise Exception
For HTTP, use requests rather than writing your own.
> ipython
In [1]: import requests
In [2]: r = requests.get('http://www.google.com')
In [3]: r.status_code
Out[3]: 200
In [4]: r.text[:80]
Out[4]: u'<!doctype html><html itemscope="itemscope" itemtype="http://schema.org/WebPage">'
In [5]: len(r.text)
Out[5]: 10969
TCP does not give you "packets", but sequential bytes sent from the other side. It is a stream. recv() gives you chunks of that stream that are currently available. You stitch them back together and parse the stream content.
HTTP is rather involved protocol to work out by hand, so you probably want to start with some existing library like httplib instead.
It could be that Google uses Keep-Alive to keep the socket open in order to serve a further request. This would require parsing of the header and reading the exact number of bytes.
Depending on which version of HTTP you use, you have to add Connection: Keep-Alive to your headers or not. (This might be the simplest solution: just use HTTP/1.0 instead of 1.1.)
If you use that feature nevertheless, you would have to receive your first chunk of data and
parse if there is a '\r\nContent-Length: ' inside, and if so, take the bytes between that and the next '\r\n' and convert them to a number. That is your size.
Have a look if you have a '\r\n\r\n' in your data. If so, that is the end of your header. From here, you must read the exact number of bytes mentionned above.
Example:
import socket
s = socket.create_connection(('www.google.com', 80))
s.send("GET / HTTP/1.1\r\n\r\n")
x = s.recv(10000)
poscl = x.lower().find('\r\ncontent-length: ')
poseoh = x.find('\r\n\r\n')
if poscl < poseoh and poscl >= 0 and poseoh >= 0:
# found CL header
poseocl = x.find('\r\n',poscl+17)
cl = int(x[poscl+17:poseocl])
realdata = x[poseoh+4:]
Now, you have the content length in cl and the (start of the) payload data in realdata. The number of bytes missing of this request is missing = cl - len(realdata). If it is 0, you've got everything; if not, do s.read(missing) and recalculate missing until it is 0.
The code above is a simppe start of the job to be done; there are some places where you might need to recv() further before you can proceed.
This is quite compliated. By far easier ways would be
to use HTTP 1.1's Connection: close header in the request,
to use HTTP 1.0,
to use one of the libraries crafted for this task and not to reinvent the wheel.
I have a web service that returns JSON responses when successful. Unfortunately, when I try to test this service via multi-mechanize, I get an error - "not viewing HTML". Obviously it's not viewing HTML, it's getting content clearly marked as JSON. How do I get mechanize to ignore this error and accept the JSON it's getitng back?
It turns out mechanize isn't set up to accept JSON responses out of the box. For a quick and dirty solution to this, update mechanize's _headersutil.py file (check /usr/local/lib/python2.7/dist-packages/mechanize).
In the is_html() method, change the line:
html_types = ["text/html"]
to read:
html_types = ["text/html", "application/json"]
I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?
What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")
If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.