Yet another producer/consumer problem in Twisted Python

Yet another producer/consumer problem in Twisted Python - python

I am building a server which stores key/value data on top of Redis using Twisted Python.
The server receives a JSON dictionary via HTTP, which is converted into a Python dictionary and put in a buffer. Everytime new data is stored, the server schedules a task which pops one dictionary from the buffer and writes every tuple into a Redis instance, using a txredis client.
class Datastore(Resource):
isLeaf = True
def __init__(self):
self.clientCreator = protocol.ClientCreator(reactor, Redis)
d = self.clientCreator.connectTCP(...)
d.addCallback(self.setRedis)
self.redis = None
self.buffer = deque()
def render_POST(self, request):
try:
task_id = request.requestHeaders.getRawHeaders('x-task-id')[0]
except IndexError:
request.setResponseCode(503)
return '<html><body>Error reading task_id</body></html>'
data = json.loads(request.content.read())
self.buffer.append((task_id, data))
reactor.callLater(0, self.write_on_redis)
return ' '
#defer.inlineCallbacks
def write_on_redis(self):
try:
task_id, dic = self.buffer.pop()
log.msg('Buffer: %s' % len(self.buffer))
except IndexError:
log.msg('buffer empty')
defer.returnValue(1)
m = yield self.redis.sismember('DONE', task_id)
# Simple check
if m == '1':
log.msg('%s already stored' % task_id)
else:
log.msg('%s unpacking' % task_id)
s = yield self.redis.sadd('DONE', task_id)
d = defer.Deferred()
for k, v in dic.iteritems():
k = k.encode()
d.addCallback(self.redis.push, k, v)
d.callback(None)
Basically, I am facing a Producer/Consumer problem between two different connections, but I am not sure that the current implementation works well in the Twisted paradygm.
I have read the small documentation about producer/consumer interfaces in Twisted, but I am not sure if I can use them in my case.
Any critics is welcome: I am trying to get a grasp of event-driven programming, after too many years of thread concurrency.

The producer and consumer APIs in Twisted, IProducer and IConsumer, are about flow control. You don't seem to have any flow control here, you're just relaying messages from one protocol to another.
Since there's no flow control, the buffer is just extra complexity. You could get rid of it by just passing the data directly to the write_on_redis method. This way write_on_redis doesn't need to handle the empty buffer case, you don't need the extra attribute on the resource, and you can even get rid of the callLater (although you can also do this even if you keep the buffer).
I don't know if any of this answers your question, though. As far as whether this approach "works well", here are the things I notice just by reading the code:
If data arrives faster than redis accepts it, your list of outstanding jobs may become arbitrarily large, causing you to run out of memory. This is what flow control would help with.
With no error handling around the sismember call or the sadd call, you may lose tasks if either of these fail, since you've already popped them from the work buffer.
Doing a push as a callback on that Deferred d also means that any failed push will prevent the rest of the data from being pushed. It also passes the result of the Deferred returned by push (I'm assuming it returns a Deferred) as the first argument to the next call, so unless push more or less ignores its first argument, you won't be pushing the right data to redis.
If you want to implement flow control, then you need to have your HTTP server check the length of self.buffer and possibly reject the new task - not adding it to self.buffer and returning some error code to the client. You still won't be using IConsumer and IProducer, but it's sort of similar.

Related

Python script with multiple threads works normally only in debug mode

I am currently working with one Python 2.7 script with multiple threads. One of the threads is listening for JSON data in long polling mode and parse it after receiving or go into timeout after some period. I noticed that it works as expected only in debug mode (I use Wing IDE). In case of just normal run it seems like this particular thread of the script hanging after first GET request, before entering the "for" loop. Loop condition doesn't affect the result. At the same time other threads continue to work normally.
I believe this is related to multi-threading. How to properly troubleshoot and fix this issue?
Below I put code of the class responsible for long polling job.
class Listener(threading.Thread):
def __init__(self, router, *args, **kwargs):
self.stop = False
self._cid = kwargs.pop("cid", None)
self._auth = kwargs.pop("auth", None)
self._router = router
self._c = webclient.AAHWebClient()
threading.Thread.__init__(self, *args, **kwargs)
def run(self):
while True:
try:
# Data items that should be routed to the device is retrieved by doing a
# long polling GET request on the "/tunnel" resource. This will block until
# there are data items available, or the request times out
log.info("LISTENER: Waiting for data...")
response = self._c.send_request("GET", self._cid, auth=self._auth)
# A timed out request will not contain any data
if len(response) == 0:
log.info("LISTENER: No data this time")
else:
items = response["resources"]["tunnel"]
undeliverable = []
#print items # - reaching this point, able to return output
for item in items:
# The data items contains the data as a base64 encoded string and the
# external reference ID for the device that should receive it
extId = item["extId"]
data = base64.b64decode(item["data"])
# Try to deliver the data to the device identified by "extId"
if not self._router.route(extId, data):
item["message"] = "Could not be routed"
undeliverable.append(item)
# Data items that for some reason could not be delivered to the device should
# be POST:ed back to the "/tunnel" resource as "undeliverable"
if len(undeliverable) > 0:
log.warning("LISTENER: Sending error report...")
response = self._c.send_request("POST", "/tunnel", body={"undeliverable": undeliverable}, auth=self._auth)
except webclient.RequestError as e:
log.error("LISTENER: ERROR %d - %s", e.status, e.response)
UPD:
class Router:
def route(self, extId, data):
log.info("ROUTER: Received data for %s: %s", extId, repr(data))
# nothing special
return True

If you're using the CPython interpreter you're not actually system threading:
CPython implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
So your process is probably locking while listening on the first request because your are long polling.
Multi-processing might be a better choice. I haven't tried it with long polling but the Twisted framework might also work in your situation.

gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

Sequentially queuing multiple defereds in twisted

Currently I am a still a beginner in twisted and this vexed me.
I am sending out a sequence of commands via TCP and awaiting a response from the lineRecieved reader, which can take a number of seconds to process and arrive so I wrapped it in a defered. The first deferred works fine but the second one fires as the first one is still processing resulting in garbage as the endpoint can only process one command at a time. Expected behavior in an asysc system but not what I need to happen. If I have one or two commands I could use a deferedChain to process but as I have potentially dozens of commands to run sequentially I fear this will turn into unmaintainable spaghetti fast.
What is the clean way to do this?
Many thanks
Example code
def connectionMade(self):
self.fire_def('command1')
print'fire command 2'
self.fire_def('command2')#Fires when command one is running
def fire_def(self,request):
d = self.getInfo(request)
d.addCallback(self.print_result)
return d
def print_result(result):
print result
def getInfo(self,request):
print 'sending', request
self.d = defer.Deferred()
self.sendLine(request)
return self.d
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
self.d.callback(self.buffer)

The code in your question only knows how to keep track of one Deferred. If application code calls getInfo twice without enough intervening time for the first action to complete with a result, then it will corrupt its own internal tracking state:
def getInfo(self,request):
print 'sending', request
self.d = defer.Deferred()
self.sendLine(request)
return self.d
d_foo = getInfo(foo)
d_bar = getInfo(bar)
In this sequence, d_foo and d_bar are different Deferred instances. However, on the second call to getInfo, the value of the attributeself.d is changed from d_foo to d_bar. The d_foo Deferred is lost. Later, when `lineReceived runs:
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
self.d.callback(self.buffer)
self.d is d_bar even though the line is probably a response to the foo request. This means d_bar will get the response for foo and d_foo will never get any response at all.
To fix this problem, it may help to keep a list (or queue) of Deferred instances on the protocol. Append to it when a new info request is made, pop from the front of it when a response is received. (I'm not sure what protocol you're implementing, so I don't know how you'll decide how many lines is sufficient to constitute a response. If the protocol doesn't define this then it is broken and you may want to switch to a better protocol.)
If you fix this, then responses will at least get delivered to different Deferred instances.
You also described a problem relating to forcing sequential operation. There are a few ways I could interpret this. One way is to interpret it as meaning you only want one request to be "outstanding" on the network at a time. In other words, you don't want getInfo to send new request lines until after lineReceived has delivered response data to the Deferred returned by the previous call to getInfo.
In this case, Deferred chaining is just the thing. Despite the fact that you have N Deferreds, when you're imposing this sequential restriction, you actually have a series of 2 Deferreds. You have the Deferred that runs earlier and the Deferred that should only run later after the earlier one has its result. You extend this to N by then considering the later Deferred to be the earlier Deferred in a new pair, and a third Deferred becomes the new later Deferred.
Or put another way, if you have D1, D2, D3, and D4, then you chain them like:
D2 is chained to D1 and only runs when D1 is complete
D3 is chained to D2 and only runs when D2 is complete
D4 is chained to D3 and only runs when D3 is complete
However, while this can work, it's actually not the easiest way to implement serialization. Instead, I suggest explicitly queueing up work in getInfo and explicitly unqueueing it in lineReceived:.
def _sendRequest(self, d, request):
print 'sending', request
self.d = d
self.sendLine(request)
def getInfo(self,request):
if self.d is None:
d = defer.Deferred()
self._sendRequest(d, request)
return d
else:
queued_d = defer.Deferred()
self._requests.append((request, queued_d))
return queued_d
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
now_d = self.d
self.d = None
buffer = self.buffer
self.buffer = []
if self._requests:
request, queued_d = self._requests.pop(0)
self._sendRequest(queued_d, request)
now_d.callback(buffer)
Notice how in lineReceived the code takes care to put everything into a consistent state before the now_d.callback(buffer) line. This is a subtle but important point. There may be callbacks on now_d which impact the protocol - for example, by calling getInfo again. It is important for the protocol to be in a consistent state before causing that code to run, otherwise it will get confused - perhaps by sending requests out of order, or queueing up requests when they should actually be sent. This is an example of making code safe against re-entrancy. This isn't an idea that's unique to Twisted-using programs, but since people most often associate the idea of re-entrancy with threaded program, people often overlook the possibility when writing Twisted-based code.

Basically, you return deferreds from one another if you want them executed one after another.
So you want d2 to execute only after d1 is done, fine then, return d2 from d1's callback.
In other words, per your example, you'd need to call command2 somewhere near the end of command1's callback.

Process a FIFO Queue but drop items if they are too old

I've written a basic utility that listens for messages in one thread, adds them to a FIFO queue and processes them in another thread. Each message takes a fixed time to process (it's waiting for a blinking light to stop blinking), but messages can arrive randomly (patterns in the code is a dictionary of regexes to match the incoming message to, if a match is found it adds it to the queue along with a color pattern to blink).
blink_queue = Queue()
def receive(data) :
message = data['text']
for pattern in patterns:
if re.match(pattern, message):
blink_queue.put(patterns[pattern])
break
return True
def blinker(q) :
while True:
args = q.get().split()
subprocess.Popen(
[blink_app] + args,
startupinfo=startupinfo,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
time.sleep(blink_wait)
q.task_done()
def subscribe():
print("Listening for messages on '%s' channel..." % channel)
pubnub.subscribe({
'channel' : channel,
'callback' : receive
})
blink_worker = Thread(target=blinker, args=(blink_queue,))
blink_worker.daemon=True
blink_worker.start()
sub_thread = Thread(target=subscribe)
sub_thread.daemon=True
sub_thread.start()
sub_thread.join()
How do I implement a FIFO Queue in Python that automatically trims the oldest (first) queue if it grows to big. Do I create another watching thread, or do I keep the size in check on the subscribe thread? I'm really new at Python, so if there is a totally logical Data Type please feel free to call me a noob and send me in the right direction.

Turns out there is a logical type collections.deque. From the documentation:
If maxlen is not specified or is None, deques may grow to an arbitrary
length. Otherwise, the deque is bounded to the specified maximum
length. Once a bounded length deque is full, when new items are added,
a corresponding number of items are discarded from the opposite end.
(and here is the commit that implements this datatype)

For this I would subclass Queue and overload the put method to remove items in the fashion you desire if the Queue gets too large.
e.g.
class NukeOldDataQueue(Queue.Queue):
def put(self,*args,**kwargs):
if self.full():
try:
oldest_data = self.get()
print('[WARNING]: throwing away old data:'+repr(oldest_data))
# a True value from `full()` does not guarantee
# that anything remains in the queue when `get()` is called
except Queue.Empty:
pass
Queue.Queue.put(self,*args,**kwargs)
You may also want to pass the block=False parameter or manipulate the timeout parameter depending on how bad it is to accidentally throw away new data or whether blocking on the put() call is acceptable.

Is this a good use case for ndb async urlfetch tasklets?

I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case.
Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests.
def http_get(url):
rpc = urlfetch.create_rpc(deadline=3)
urlfetch.make_fetch_call(rpc,url)
return rpc
rpcs = []
urls = [...] # hundreds of urls
while rpcs < 10:
rpcs.append(http_get(urls.pop()))
while rpcs:
rpc = rpcs.pop(0)
result = rpc.get_result()
if result.status_code == 200:
# append another item to rpcs
# process result
else:
# re-append same item to rpcs
Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case.
I should add that processing the result does not involve any db operations.

Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation):
- your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute.
- task works in background, and when result is ready — it stores result somwhere, until you ask for it.
Simple example:
def get_fetch_all():
urls = ["http://www.example.com/", "http://mirror.example.com/"]
ctx = ndb.get_context()
futures = [ctx.urlfetch(url) for url in urls]
results = ndb.Future.wait_all(futures)
# do something with results here
If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this.
#ndb.tasklet
def get_data_and_store(url):
ctx = ndb.get_context()
# until we don't receive result here, this function is "paused", allowing other
# parallel tasks to work. when data will be fetched, control will be returned
result = yield ctx.urlfetch("http://www.google.com/")
if result.status_code == 200:
store = Storage(data=result.content)
# async job to put data
yield store.put_async()
raise ndb.Return(True)
else:
raise ndb.Return(False)
And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch.
I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.