Is it possible to loop over an httplib.HTTPResponse's data? - python

I'm trying to develop a very simple proof-of-concept to retrieve and process data in a streaming manner. The server I'm requesting from will send data in chunks, which is good, but I'm having issues using httplib to iterate through the chunks.
Here's what I'm trying:
import httplib
def getData(src):
d = src.read(1024)
while d and len(d) > 0:
yield d
d = src.read(1024)
if __name__ == "__main__":
con = httplib.HTTPSConnection('example.com', port='8443', cert_file='...', key_file='...')
con.putrequest('GET', '/path/to/resource')
response = con.getresponse()
for s in getData(response):
print s
raw_input() # Just to give me a moment to examine each packet
Pretty simple. Just open an HTTPS connection to server, request a resource, and grab the result, 1024 bytes at a time. I'm definitely making the HTTPS connection successfully, so that's not a problem at all.
However, what I'm finding is that the call to src.read(1024) returns the same thing every time. It only ever returns the first 1024 bytes of the response, apparently never keeping track of a cursor within the file.
So how am I supposed to receive 1024 bytes at a time? The documentation on read() is pretty sparse. I've thought about using urllib or urllib2, but neither seems to be able to make an HTTPS connection.
HTTPS is required, and I am working in a rather restricted corporate environment where packages like Requests are a bit tough to get my hands on. If possible, I'd like to find a solution within Python's standard lib.
// Big Old Fat Edit
Turns out in my original code I had simply forgot to update the d variable. I initialized it with a read outside the yield loop and never changed it in the loop. Once I added it back in there it worked perfectly.
So, in short, I'm just a big idiot.

Is your con.putrequest() actually working? Doing a request with that method requires you to also call a bunch of other methods as you can see in the official httplib documentation:
http://docs.python.org/2/library/httplib.html
As an alternative to using the request() method described above, you
can also send your request step by step, by using the four functions
below.
putrequest()
putheader()
endheaders()
send()
Is there any reason why you're not using the default HTTPConnection.request() function?
Here's a working version for me, using request() instead:
import httlplib
def getData(src, chunk_size=1024):
d = src.read(chunk_size)
while d:
yield d
d = src.read(chunk_size)
if __name__ == "__main__":
con = httplib.HTTPSConnection('google.com')
con.request('GET', '/')
response = con.getresponse()
for s in getData(response, 8):
print s
raw_input() # Just to give me a moment to examine each packet

You can use the seek command to move the cursor along with your read.
This is my attempt at the problem. I apologize if I made it less pythonic in process.
if __name__ == "__main__":
con = httplib.HTTPSConnection('example.com', port='8443', cert_file='...', key_file='...')
con.putrequest('GET', '/path/to/resource')
response = con.getresponse()
c=0
while True:
response.seek(c*1024,0)
data =d.read(1024)
c+=1
if len(data)==0:
break
print data
raw_input()
I hope it is at least helpful.

Related

How to Covert to dictionary in Python

I am working on a large scale embedded system built using Python and we are using ZeroMQ to make everything modular. I have sensor data being sent across a ZeroMQ serial port in the form of the python Dictionary as shown here:
accel_com.publish_message({"ACL_X": ACL_1_X_val})
Where accel_com is a Communicator class we built that wraps the ZeroMQ logic that publishes messages across a port. Here you can see we are sending Dictionaries across.
However, on the other side of the communication port, I have another module that grabs this data using this code:
accel_msg = san.get_last_message("sensor/accelerometer")
accel.ax = accel_msg.get('ACL_X')
accel.ay = accel_msg.get('ACL_Y')
accel.az = accel_msg.get('ACL_Z')
The problem is when I try to treat accel_msg as a Python Dictionary, I get an Error:
'NoneType' object does not have a method 'get()'.
So my guess is the dictionary is not going across the wire correctly. I am not very familiar with Python so I am not sure how to solve this problem.
Expanding on #JoranBeasley's comment:
accel_msg is sometimes None, such as while it's waiting for a message. The solution is to skip over None messages
while True: # waiting indefinitely for messages
accel_msg = san.get_last_message("sensor/accelerometer")
if accel_msg: # or more explicitly, if accel_msg is not None:
accel.ax = accel_msg.get('ACL_X')
accel.ay = accel_msg.get('ACL_Y')
accel.az = accel_msg.get('ACL_Z')
break # if you only want one message. otherwise remove this
else:
print accel_msg # which is almost certainly None

Is get_result() a required call for put_async() in Google App Engine

With the new release of GAE 1.5.0, we now have an easy way to do async datastore calls. Are we required to call get_result() after calling
'put_async'?
For example, if I have an model called MyLogData, can I just call:
put_async(MyLogData(text="My Text"))
right before my handler returns without calling the matching get_result()?
Does GAE automatically block on any pending calls before sending the result to the client?
Note that I don't really care to handle error conditions. i.e. I don't mind if some of these puts fail.
I don't think there is any sure way to know if get_result() is required unless someone on the GAE team verifies this, but I think it's not needed. Here is how I tested it.
I wrote a simple handler:
class DB_TempTestModel(db.Model):
data = db.BlobProperty()
class MyHandler(webapp.RequestHandler):
def get(self):
starttime = datetime.datetime.now()
lots_of_data = ' '*500000
if self.request.get('a') == '1':
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
db.put(DB_TempTestModel(data=lots_of_data))
if self.request.get('a') == '2':
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
db.put_async(DB_TempTestModel(data=lots_of_data))
self.response.out.write(str(datetime.datetime.now()-starttime))
I ran it a bunch of times on a High Replication Application.
The data was always there, making me believe that unless there is a failure in the datastore side of things (unlikely), it's gonna be written.
Here's the interesting part. When the data is written with put_async() (?a=2), the amount of time (to process the request) was on average about 2 to 3 times as fast as put()(?a=1) (not a very scientific test - just eyeballing it).
But the cpu_ms and api_cpu_ms were the same for both ?a=1 and ?a=2.
From the logs:
ms=440 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
vs
ms=149 cpu_ms=627 api_cpu_ms=580 cpm_usd=0.036244
On the client side, looking at the network latency of the requests, it showed the same results, i.e. `?a=2' requests were at least 2 times faster. Definitely a win on the client side... but it seems to not have any gain on the server side.
Anyone on the GAE team care to comment?
db.put_async works fine without get_result when deployed (in fire-and-forget style) but in locally it won't take action until get_result gets called more context
I dunno, but this works:
import datetime
from google.appengine.api import urlfetch
def main():
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, "some://artificially/slow.url")
print "Content-type: text/plain"
print
print str(datetime.datetime.now())
if __name__ == '__main__':
main()
The remote URL sleeps 3 seconds and then sends me an email. The App Engine handler returns immediately, and the remote URL completes as expected. Since both services abstract the same underlying RPC framework, I would guess the datastore behaves similarly.
Good question, though. Perhaps Nick or another Googler can answer definitively.

Record streaming and saving internet radio in python

I am looking for a python snippet to read an internet radio stream(.asx, .pls etc) and save it to a file.
The final project is cron'ed script that will record an hour or two of internet radio and then transfer it to my phone for playback during my commute. (3g is kind of spotty along my commute)
any snippits or pointers are welcome.
The following has worked for me using the requests library to handle the http request.
import requests
stream_url = 'http://your-stream-source.com/stream'
r = requests.get(stream_url, stream=True)
with open('stream.mp3', 'wb') as f:
try:
for block in r.iter_content(1024):
f.write(block)
except KeyboardInterrupt:
pass
That will save a stream to the stream.mp3 file until you interrupt it with ctrl+C.
So after tinkering and playing with it Ive found Streamripper to work best. This is the command i use
streamripper http://yp.shoutcast.com/sbin/tunein-station.pls?id=1377200 -d ./streams -l 10800 -a tb$FNAME
If you find that your requests or urllib.request call in Python 3 fails to save a stream because you receive "ICY 200 OK" in return instead of an "HTTP/1.0 200 OK" header, you need to tell the underlying functions ICY 200 OK is OK!
What you can effectively do is intercept the routine that handles reading the status after opening the stream, just before processing the headers.
Simply put a routine like this above your stream opening code.
def NiceToICY(self):
class InterceptedHTTPResponse():
pass
import io
line = self.fp.readline().replace(b"ICY 200 OK\r\n", b"HTTP/1.0 200 OK\r\n")
InterceptedSelf = InterceptedHTTPResponse()
InterceptedSelf.fp = io.BufferedReader(io.BytesIO(line))
InterceptedSelf.debuglevel = self.debuglevel
InterceptedSelf._close_conn = self._close_conn
return ORIGINAL_HTTP_CLIENT_READ_STATUS(InterceptedSelf)
Then put these lines at the start of your main routine, before you open the URL.
ORIGINAL_HTTP_CLIENT_READ_STATUS = urllib.request.http.client.HTTPResponse._read_status
urllib.request.http.client.HTTPResponse._read_status = NiceToICY
They will override the standard routine (this one time only) and run the NiceToICY function in place of the normal status check when it has opened the stream. NiceToICY replaces the unrecognised status response, then copies across the relevant bits of the original response which are needed by the 'real' _read_status function. Finally the original is called and the values from that are passed back to the caller and everything else continues as normal.
I have found this to be the simplest way to get round the problem of the status message causing an error. Hope it's useful for you, too.
I am aware this is a year old, but this is still a viable question, which I have recently been fiddling with.
Most internet radio stations will give you an option of type of download, I choose the MP3 version, then read the info from a raw socket and write it to a file. The trick is figuring out how fast your download is compared to playing the song so you can create a balance on the read/write size. This would be in your buffer def.
Now that you have the file, it is fine to simply leave it on your drive (record), but most players will delete from file the already played chunk and clear the file out off the drive and ram when streaming is stopped.
I have used some code snippets from a file archive without compression app to handle a lot of the file file handling, playing, buffering magic. It's very similar in how the process flows. If you write up some sudo-code (which I highly recommend) you can see the similarities.
I'm only familiar with how shoutcast streaming works (which would be the .pls file you mention):
You download the pls file, which is just a playlist. It's format is fairly simple as it's just a text file that points to where the real stream is.
You can connect to that stream as it's just HTTP, that streams either MP3 or AAC. For your use, just save every byte you get to a file and you'll get an MP3 or AAC file you can transfer to your mp3 player.
Shoutcast has one addition that is optional: metadata. You can find how that works here, but is not really needed.
If you want a sample application that does this, let me know and I'll make up something later.
In line with the answer from https://stackoverflow.com/users/1543257/dingles (https://stackoverflow.com/a/41338150), here's how you can achieve the same result with the asynchronous HTTP client library - aiohttp:
import functools
import aiohttp
from aiohttp.client_proto import ResponseHandler
from aiohttp.http_parser import HttpResponseParserPy
class ICYHttpResponseParser(HttpResponseParserPy):
def parse_message(self, lines):
if lines[0].startswith(b"ICY "):
lines[0] = b"HTTP/1.0 " + lines[0][4:]
return super().parse_message(lines)
class ICYResponseHandler(ResponseHandler):
def set_response_params(
self,
*,
timer = None,
skip_payload = False,
read_until_eof = False,
auto_decompress = True,
read_timeout = None,
read_bufsize = 2 ** 16,
timeout_ceil_threshold = 5,
) -> None:
# this is a copy of the implementation from here:
# https://github.com/aio-libs/aiohttp/blob/v3.8.1/aiohttp/client_proto.py#L137-L165
self._skip_payload = skip_payload
self._read_timeout = read_timeout
self._reschedule_timeout()
self._timeout_ceil_threshold = timeout_ceil_threshold
self._parser = ICYHttpResponseParser(
self,
self._loop,
read_bufsize,
timer=timer,
payload_exception=aiohttp.ClientPayloadError,
response_with_body=not skip_payload,
read_until_eof=read_until_eof,
auto_decompress=auto_decompress,
)
if self._tail:
data, self._tail = self._tail, b""
self.data_received(data)
class ICYConnector(aiohttp.TCPConnector):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._factory = functools.partial(ICYResponseHandler, loop=self._loop)
This can then be used as follows:
session = aiohttp.ClientSession(connector=ICYConnector())
async with session.get("url") as resp:
print(resp.status)
Yes, it's using a few private classes and attributes but this is the only solution to change the handling of something that's part of HTTP spec and (theoretically) should not ever need to be changed by the library's user...
All things considered, I would say this is still rather clean in comparison to monkey patching which would cause the behavior to be changed for all requests (especially true for asyncio where setting before and resetting after a request does not guarantee that something else won't make a request while request to ICY is being made). This way, you can dedicate a ClientSession object specifically for requests to servers that respond with the ICY status line.
Note that this comes with a performance penalty for requests made with ICYConnector - in order for this to work, I am using the pure Python implementation of HttpResponseParser which is going to be slower than the one that aiohttp uses by default and is written in C. This cannot really be done differently without vendoring the whole library as the behavior for parsing status line is deeply hidden in the C code.

How to solve Python memory leak when using urrlib2?

I'm trying to write a simple Python script for my mobile phone to periodically load a web page using urrlib2. In fact I don't really care about the server response, I'd only like to pass some values in the URL to the PHP. The problem is that Python for S60 uses the old 2.5.4 Python core, which seems to have a memory leak in the urrlib2 module. As I read there's seems to be such problems in every type of network communications as well. This bug have been reported here a couple of years ago, while some workarounds were posted as well. I've tried everything I could find on that page, and with the help of Google, but my phone still runs out of memory after ~70 page loads. Strangely the Garbege Collector does not seem to make any difference either, except making my script much slower. It is said that, that the newer (3.1) core solves this issue, but unfortunately I can't wait a year (or more) for the S60 port to come.
here's how my script looks after adding every little trick I've found:
import urrlib2, httplib, gc
while(true):
url = "http://something.com/foo.php?parameter=" + value
f = urllib2.urlopen(url)
f.read(1)
f.fp._sock.recv=None # hacky avoidance
f.close()
del f
gc.collect()
Any suggestions, how to make it work forever without getting the "cannot allocate memory" error?
Thanks for advance,
cheers, b_m
update:
I've managed to connect 92 times before it ran out of memory, but It's still not good enough.
update2:
Tried the socket method as suggested earlier, this is the second best (wrong) solution so far:
class UpdateSocketThread(threading.Thread):
def run(self):
global data
while 1:
url = "/foo.php?parameter=%d"%data
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('something.com', 80))
s.send('GET '+url+' HTTP/1.0\r\n\r\n')
s.close()
sleep(1)
I tried the little tricks, from above too. The thread closes after ~50 uploads (the phone has 50MB of memory left, obviously the Python shell has not.)
UPDATE:
I think I'm getting closer to the solution! I tried sending multiple data without closing and reopening the socket. This may be the key since this method will only leave one open file descriptor. The problem is:
import socket
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
socket.connect(("something.com", 80))
socket.send("test") #returns 4 (sent bytes, which is cool)
socket.send("test") #4
socket.send("test") #4
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns the number of sent bytes, ok
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns 0 on the phone, error on Windows7*
socket.send("GET /foo.php?parameter=bar HTTP/1.0\r\n\r\n") #returns 0 on the phone, error on Windows7*
socket.send("test") #returns 0, strange...
*: error message: 10053, software caused connection abort
Why can't I send multiple messages??
Using the test code suggested by your link, I tested my Python installation and confirmed that it indeed leaks. But, if, as #Russell suggested, I put each urlopen in its own process, the OS should clean up the memory leaks. In my tests, memory, unreachable objects and open files all remain more or less constant. I split the code into two files:
connection.py
import cPickle, urllib2
def connectFunction(queryString):
conn = urllib2.urlopen('http://something.com/foo.php?parameter='+str(queryString))
data = conn.read()
outfile = ('sometempfile'. 'wb')
cPickle.dump(data, outfile)
outfile.close()
if __name__ == '__main__':
connectFunction(sys.argv[1])
###launcher.py
import subprocess, cPickle
#code from your link to check the number of unreachable objects
def print_unreachable_len():
# check memory on memory leaks
import gc
gc.set_debug(gc.DEBUG_SAVEALL)
gc.collect()
unreachableL = []
for it in gc.garbage:
unreachableL.append(it)
return len(str(unreachableL))
#my code
if __name__ == '__main__':
print 'Before running a single process:', print_unreachable_len()
return_value_list = []
for i, value in enumerate(values): #where values is a list or a generator containing (or yielding) the parameters to pass to the URL
subprocess.call(['python', 'connection.py', str(value)])
print 'after running', i, 'processes:', print_unreachable_len()
infile = open('sometempfile', 'rb')
return_value_list.append(cPickle.load(infile))
infile.close()
Obviously, this is sequential, so you will only execute a single connection at a time, which may or may not be an issue for you. If it is, you will have to find a non-blocking way of communicating with the processes you're launching, but I'll leave that as an exercise for you.
EDIT: On re-reading your question, it seems you don't care about the server response. In that case, you can get rid of all the pickling related code. And obviously, you won't have the print_unreachable_len() related bits in your final code either.
There exist a reference cycle in urllib2 created in urllib2.py:1216. The issue is on going and exists since 2009.
https://bugs.python.org/issue1208304
I think this is probably your problem. To summarize that thread, there's a memory leak in Pys60's DNS lookup, and you can work around it by moving DNS lookup outside the inner loop.
This seems like a (very!) hacky workaround, but a bit of googling found this comment on the problem:
Apparently adding f.read(1) will stop the leaking!
import urllib2
f = urllib2.urlopen('http://www.google.com')
f.read(1)
f.close()
EDIT: oh, I see you already have f.read(1)... I'm all out of ideas then :/
Consider using the low-level socket API (related howto) instead of urllib2.
HOST = 'daring.cwi.nl' # The remote host
PORT = 50007 # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))
s.send('GET /path/to/file/index.html HTTP/1.0\n\n')
# you'll need to figure out how much data to read and read that exactly
# or wait for read() to return data of zero length (I think!)
DATA_SZ = 1024
data = s.recv(DATA_SZ)
s.close()
print 'Received', repr(data)
How to execute and read a HTTP request via low-level sockets is a bit beyond the scope of the question (and perhaps may make a good question on its own on stackoverflow — I searched but didn't see it), but I hope this points you in the direction of a solution that may resolve your problem!
edit An answer in here about using makefile may be helpful: HTTP basic authentication using sockets in python
This does not leak for me with Python 2.6.1 on a Mac. Which version are you using?
BTW, your program doesn't work due to a few typos. Here is one that does work:
import urllib2, httplib, gc
value = "foo"
count = 0
while(True):
url = "http://192.168.1.1/?parameter=" + value
f = urllib2.urlopen(url)
f.read(1)
f.fp._sock.recv=None # hacky avoidance
f.close()
del f
print "count=",count
count += 1
Depending on platform and python version, python might not release memory back to OS. See this stackoverflow thread. That said, python should not endlessly consume memory. Judging from the code you use, it appears to be bug in python runtime unless, urllib/sockets use globals which I don't believe it does - blame it on Python on S60!
Have you considered other sources of memory leakage? Endless log file open, ever increasing array or smth like that? If it truly is a bug in sockets interface, then your only option is to use the subprocess approach.

Python telnetlib: surprising problem

I am using the Python module telnetlib to create a telnet session (with a chess server), and I'm having an issue I really can't wrap my brain around. The following code works perfectly:
>>> f = login("my_server") #code for login(host) below.
>>> f.read_very_eager()
This spits out everything the server usually prints upon login. However, when I put it inside a function and then call it thus:
>>> def foo():
... f = login("my_server")
... return f.read_very_eager()
...
>>> foo()
I get nothing (the empty string). I can check that the login is performed properly, but for some reason I can't see the text. So where does it get swallowed?
Many thanks.
For completeness, here is login(host):
def login(host, handle="guest", password=""):
try:
f = telnetlib.Telnet(host) #connect to host
except:
raise Error("Could not connect to host")
f.read_until("login: ")
try:
f.write(handle + "\n\r")
except:
raise Error("Could not write username to host")
if handle == "guest":
f.read_until(":\n\r")
else:
f.read_until("password: ")
try:
f.write(password + "\n\r")
except:
raise Error("Could not write password to host")
return f
The reason why this works when you try it out manually but not when in a function is because when you try it out manually, the server has enough time to react upon the login and send data back. When it's all in one function, you send the password to the server and never wait long enough for the server to reply.
If you prefer a (probably more correct) technical answer:
In file telnetlib.py (c:\python26\Lib\telnetlib.py on my Windows computer), function read_very_eager(self) calls self.sock_avail() Now, function sock_avail(self) does the following:
def sock_avail(self):
"""Test whether data is available on the socket."""
return select.select([self], [], [], 0) == ([self], [], [])
What this does is really simple: if there is -anything- to read from our socket (the server has answered), it'll return True, otherwise it'll return False.
So, what read_very_eager(self) does is: check if there is anything available to read. If there is, then read from the socket, otherwise just return an empty string.
If you look at the code of read_some(self) you'll see that it doesn't check if there is any data available to read. It'll try reading till there is something available, which means that if the server takes for instance 100ms before answering you, it'll wait 100ms before returning the answer.
I'm having the same trouble as you, unfortunately the combination of select.select, which I have in a while loop until I am able to read, and then calling read_some() does not work for me, still only reading 1% of the actual output. If I put a time.sleep(10) on before I read and do a read_very_eager() it seems to work...this is a very crude way of doing things but it does work..I wish there was a better answer and I wish I had more reputation points so I could respond to user387821 and see if he has any additional tips.

Categories

Resources