Sequentially queuing multiple defereds in twisted - python

Currently I am a still a beginner in twisted and this vexed me.
I am sending out a sequence of commands via TCP and awaiting a response from the lineRecieved reader, which can take a number of seconds to process and arrive so I wrapped it in a defered. The first deferred works fine but the second one fires as the first one is still processing resulting in garbage as the endpoint can only process one command at a time. Expected behavior in an asysc system but not what I need to happen. If I have one or two commands I could use a deferedChain to process but as I have potentially dozens of commands to run sequentially I fear this will turn into unmaintainable spaghetti fast.
What is the clean way to do this?
Many thanks
Example code
def connectionMade(self):
self.fire_def('command1')
print'fire command 2'
self.fire_def('command2')#Fires when command one is running
def fire_def(self,request):
d = self.getInfo(request)
d.addCallback(self.print_result)
return d
def print_result(result):
print result
def getInfo(self,request):
print 'sending', request
self.d = defer.Deferred()
self.sendLine(request)
return self.d
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
self.d.callback(self.buffer)

The code in your question only knows how to keep track of one Deferred. If application code calls getInfo twice without enough intervening time for the first action to complete with a result, then it will corrupt its own internal tracking state:
def getInfo(self,request):
print 'sending', request
self.d = defer.Deferred()
self.sendLine(request)
return self.d
d_foo = getInfo(foo)
d_bar = getInfo(bar)
In this sequence, d_foo and d_bar are different Deferred instances. However, on the second call to getInfo, the value of the attributeself.d is changed from d_foo to d_bar. The d_foo Deferred is lost. Later, when `lineReceived runs:
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
self.d.callback(self.buffer)
self.d is d_bar even though the line is probably a response to the foo request. This means d_bar will get the response for foo and d_foo will never get any response at all.
To fix this problem, it may help to keep a list (or queue) of Deferred instances on the protocol. Append to it when a new info request is made, pop from the front of it when a response is received. (I'm not sure what protocol you're implementing, so I don't know how you'll decide how many lines is sufficient to constitute a response. If the protocol doesn't define this then it is broken and you may want to switch to a better protocol.)
If you fix this, then responses will at least get delivered to different Deferred instances.
You also described a problem relating to forcing sequential operation. There are a few ways I could interpret this. One way is to interpret it as meaning you only want one request to be "outstanding" on the network at a time. In other words, you don't want getInfo to send new request lines until after lineReceived has delivered response data to the Deferred returned by the previous call to getInfo.
In this case, Deferred chaining is just the thing. Despite the fact that you have N Deferreds, when you're imposing this sequential restriction, you actually have a series of 2 Deferreds. You have the Deferred that runs earlier and the Deferred that should only run later after the earlier one has its result. You extend this to N by then considering the later Deferred to be the earlier Deferred in a new pair, and a third Deferred becomes the new later Deferred.
Or put another way, if you have D1, D2, D3, and D4, then you chain them like:
D2 is chained to D1 and only runs when D1 is complete
D3 is chained to D2 and only runs when D2 is complete
D4 is chained to D3 and only runs when D3 is complete
However, while this can work, it's actually not the easiest way to implement serialization. Instead, I suggest explicitly queueing up work in getInfo and explicitly unqueueing it in lineReceived:.
def _sendRequest(self, d, request):
print 'sending', request
self.d = d
self.sendLine(request)
def getInfo(self,request):
if self.d is None:
d = defer.Deferred()
self._sendRequest(d, request)
return d
else:
queued_d = defer.Deferred()
self._requests.append((request, queued_d))
return queued_d
def lineReceived(self, line):
line = line.strip()
self.buffer.append(line)
if self.d is None:
return
if 'result_I_want' in self.buffer:
print 'Firing Callback'
now_d = self.d
self.d = None
buffer = self.buffer
self.buffer = []
if self._requests:
request, queued_d = self._requests.pop(0)
self._sendRequest(queued_d, request)
now_d.callback(buffer)
Notice how in lineReceived the code takes care to put everything into a consistent state before the now_d.callback(buffer) line. This is a subtle but important point. There may be callbacks on now_d which impact the protocol - for example, by calling getInfo again. It is important for the protocol to be in a consistent state before causing that code to run, otherwise it will get confused - perhaps by sending requests out of order, or queueing up requests when they should actually be sent. This is an example of making code safe against re-entrancy. This isn't an idea that's unique to Twisted-using programs, but since people most often associate the idea of re-entrancy with threaded program, people often overlook the possibility when writing Twisted-based code.

Basically, you return deferreds from one another if you want them executed one after another.
So you want d2 to execute only after d1 is done, fine then, return d2 from d1's callback.
In other words, per your example, you'd need to call command2 somewhere near the end of command1's callback.

Related

How to send out two request at the same time with python

So I was following a guide at http://tavendo.com/blog/post/going-asynchronous-from-flask-to-twisted-klein/ to create an asynchronous web service.
in my code, I had a function that will send out the request like
def query(text):
resp = yield treq.get("http://api.Iwanttoquery")
content = yield treq.content(resp)
returnValue(content)
#inlineCallbacks
def caller():
output1 = yield query("one")
output2 = yield query("two")
Since each query to the api usually take about 3 seconds, with my current code the result comes back after 6 seconds. I wonder is there a way to send out two queries at the same time so after 3 seconds I can get the content of both output1 and output2? Thanks.
What you need to do is use a DeferredList instead of inlineCallbacks. Basically you provide a list of deferreds and after each one completes, a final callback with the results of all the deferreds is executed.
import treq
from twisted.internet import defer, reactor
def query(text):
get = treq.get('http://google.com')
get.addCallback(treq.content)
return get
output1 = query('one')
output2 = query('two')
final = defer.DeferredList([output1, output2]) # wait for both queries to finish
final.addCallback(print) # print the results from all the queries in the list
reactor.run()
Each query() function will execute requests concurrently then return a Deferred. This happens almost immediately, so basically output1 and output2 are executing at the same time. Then you append the deferreds (ie. output1 and output2) inside a list and pass it in DeferredList, which itself returns a Deferred. Finally, you add a callback to the DeferredList to do something with the results (in this case I just print them). This is all done without the use of threads, which is the best part in my opinion! Hope this makes sense and please comment if it doesn't.
PS
If you need further help with Klein, I'm working on revamping the documentation here https://github.com/notoriousno/klein-basics (hopefully I'll make a blog post one of these days). Please take a look at some of the docs (the files with .rst). My shameless plug is now concluded :D

Python script with multiple threads works normally only in debug mode

I am currently working with one Python 2.7 script with multiple threads. One of the threads is listening for JSON data in long polling mode and parse it after receiving or go into timeout after some period. I noticed that it works as expected only in debug mode (I use Wing IDE). In case of just normal run it seems like this particular thread of the script hanging after first GET request, before entering the "for" loop. Loop condition doesn't affect the result. At the same time other threads continue to work normally.
I believe this is related to multi-threading. How to properly troubleshoot and fix this issue?
Below I put code of the class responsible for long polling job.
class Listener(threading.Thread):
def __init__(self, router, *args, **kwargs):
self.stop = False
self._cid = kwargs.pop("cid", None)
self._auth = kwargs.pop("auth", None)
self._router = router
self._c = webclient.AAHWebClient()
threading.Thread.__init__(self, *args, **kwargs)
def run(self):
while True:
try:
# Data items that should be routed to the device is retrieved by doing a
# long polling GET request on the "/tunnel" resource. This will block until
# there are data items available, or the request times out
log.info("LISTENER: Waiting for data...")
response = self._c.send_request("GET", self._cid, auth=self._auth)
# A timed out request will not contain any data
if len(response) == 0:
log.info("LISTENER: No data this time")
else:
items = response["resources"]["tunnel"]
undeliverable = []
#print items # - reaching this point, able to return output
for item in items:
# The data items contains the data as a base64 encoded string and the
# external reference ID for the device that should receive it
extId = item["extId"]
data = base64.b64decode(item["data"])
# Try to deliver the data to the device identified by "extId"
if not self._router.route(extId, data):
item["message"] = "Could not be routed"
undeliverable.append(item)
# Data items that for some reason could not be delivered to the device should
# be POST:ed back to the "/tunnel" resource as "undeliverable"
if len(undeliverable) > 0:
log.warning("LISTENER: Sending error report...")
response = self._c.send_request("POST", "/tunnel", body={"undeliverable": undeliverable}, auth=self._auth)
except webclient.RequestError as e:
log.error("LISTENER: ERROR %d - %s", e.status, e.response)
UPD:
class Router:
def route(self, extId, data):
log.info("ROUTER: Received data for %s: %s", extId, repr(data))
# nothing special
return True
If you're using the CPython interpreter you're not actually system threading:
CPython implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
So your process is probably locking while listening on the first request because your are long polling.
Multi-processing might be a better choice. I haven't tried it with long polling but the Twisted framework might also work in your situation.

Is this an insane implementation of producer consumer type thing?

# file1.py
class _Producer(self):
def __init__(self):
self.chunksize = 6220800
with open('/dev/zero') as f:
self.thing = f.read(self.chunksize)
self.n = 0
self.start()
def start(self):
import subprocess
import threading
def produce():
self._proc = subprocess.Popen(['producer_proc'], stdout=subprocess.PIPE)
while True:
self.thing = self._proc.stdout.read(self.chunksize)
if len(self.thing) != self.chunksize:
msg = 'Expected {0} bytes. Read {1} bytes'.format(self.chunksize, len(self.thing))
raise Exception(msg)
self.n += 1
t = threading.Thread(target=produce)
t.daemon = True
t.start()
self._thread = t
def stop(self):
if self._thread.is_alive():
self._proc.terminate()
self._thread.join(1)
producer = _Producer()
producer.start()
I have written some code more or less like the above design, and now I want to be able to consume the output of producer_proc in other files by going:
# some_other_file.py
import file1
my_thing = file1.producer.thing
Multiple other consumers might be grabbing a reference to file.producer.thing, they all need to use from the same producer_proc. And the producer_proc should never be blocked. Is this a sane implementation? Does the python GIL make it thread safe, or do I need to reimplement using a Queue for getting data of the worker thread? Do consumers need to explicitly make a copy of the thing?
I guess am trying to implement something like Producer/Consumer pattern or Observer pattern, but I'm not really clear on all the technical details of design patterns.
A single producer is constantly making things
Multiple consumers using things at arbitrary times
producer.thing should be replaced by a fresh thing as soon as the new one is available, most things will go unused but that's ok
It's OK for multiple consumers to read the same thing, or to read the same thing twice in succession. They only want to be sure they have got the most recent thing when asked for it, not some stale old thing.
A consumer should be able to keep using a thing as long as they have it in scope, even though the producer may have already overwritten his self.thing with a fresh new thing.
Given your (unusual!) requirements, your implementation seems correct. In particular,
If you're only updating one attribute, the Python GIL should be sufficient. Single bytecode instructions are atomic.
If you do anything more complex, add locking! It's basically harmless anyway - if you cared about performance or multicore scalability, you probably wouldn't be using Python!
In particular, be aware that self.thing and self.n in this code are updated in a separate bytecode instructions. The GIL could be released/acquired between, so you can't get a consistent view of the two of them unless you add locking. If you're not going to do that, I'd suggest removing self.n as it's an "attractive nuisance" (easily misused) or at least adding a comment/docstring with this caveat.
Consumers don't need to make a copy. You're not ever mutating a particular object pointed to by self.thing (and couldn't with string objects; they're immutable) and Python is garbage-collected, so as long as a consumer grabbed a reference to it, it can keep accessing it without worrying too much about what other threads are doing. The worst that could happen is your program using a lot of memory from several generations of self.thing being kept alive.
I'm a bit curious where your requirements came from. In particular, that you don't care if a thing is never used or used many times.

Dynamically allocating and destroying mutexes?

I have an application that's built on top of Eventlet.
I'm trying to write a decent decorator for synchronizing calls to certain methods across threads.
The decorator currently looks something like this:
_semaphores_semaphore = semaphore.Semaphore()
_semaphores = {}
def synchronized(name):
def wrap(f):
def inner(*args, **kwargs):
# Grab the lock protecting _semaphores.
with _semaphores_semaphore:
# If the named semaphore does not yet exist, create it.
if name not in _semaphores:
_semaphores[name] = semaphore.Semaphore()
sem = _semaphores[name]
with sem:
return f(*args, **kwargs)
This works fine, and looks nice and thread safe to me, although this whole thread safety and locking business might be a bit rusty for me.
The problem is that a specific, existing use of semaphores elsewhere in the application, which I'm wanting to convert to using this decorator, creates these semaphores on the fly: Based on user input, it has to create a file. It checks in a dict whether it already has a semaphore for this file, if not, it creates one, and locks it. Once it's done and has released the lock, it checks if it's been locked again (by another process in the mean time), and if not, it deletes the semaphore. This code is written with the assumption of green threads and is safe in that context, but if I want to convert it to use my decorator, and this is what I can't work out.
If I don't care about cleaning up the possibly-never-to-be-used-again semaphores (there could be hundreds of thousands of these), I'm fine. If I do want to clean them up, I'm not sure what to do.
To delete the semaphore, it seems obvious that I need to be holding the _semaphores_semaphore, since I'm manipulating the _semaphores dict, but I have to do something with the specific semaphore, too, but everything I can think of seems to be racy:
* While inside the "with sem:" block, I could grab the _semaphores_semaphore and sem from _semaphores. However, other threads might be blocked waiting for it (at "with sem:"), and if a new thread comes along wanting to touch the same resource, it will not find the same semaphore in _semaphores, but instead create a new one => fail.
I could improve this slightly by checking the balance of sem to see if another thread is already waiting for me to release it. If so, leave it alone, if not, delete it. This way, the last thread waiting to act on the resource will delete it. However, if a thread has just left the "with _semaphores_semaphore:" block, but hasn't yet made it to "with sem:", I have the same problem as before => fail.
It feels like I'm missing something obvious, but I can't work out what it is.
I think you might be able to solve it with a reader-writer lock aka. shared-exclusive lock on the _semaphores dict.
This is untested code, to show the principle. An RWLock implementation can be found in e.g. http://code.activestate.com/recipes/413393-multiple-reader-one-writer-mrow-resource-locking/
_semaphores_rwlock = RWLock()
_semaphores = {}
def synchronized(name):
def wrap(f):
def inner(*args, **kwargs):
lock = _semaphores_rwlock.reader()
# If the named semaphore does not yet exist, create it.
if name not in _semaphores:
lock = _semaphores_rwlock.writer()
_semaphores[name] = semaphore.Semaphore()
sem = _semaphores[name]
with sem:
retval = f(*args, **kwargs)
lock.release()
return retval
When you want to clean up you do:
wlock = _semaphores_rwlock.writer() #this might take a while; it waits for all readers to release
cleanup(_semaphores)
wlock.release()
mchro's answer didn't work for me since it blocks all threads on a single semaphore whenever one thread needs to create a new semaphore.
The answer that I came up with is to keep counters of occupants between the two transactions with _semaphores (which are both done behind the same mutex):
A: get semaphore
A1: dangerzone
B: with sem: block etc
C: cleanup semaphore
The problem is knowing how many people are between A and C. The counter of the semaphore doesn't tell you that, since someone may be in A1. The answer is to keep a counter of entrants along with each semaphore in _semaphores, increment it at A, decrement it at C, and if it's at 0 then you know that there's no-one else in A-C with the same key and you can safely delete it.

Yet another producer/consumer problem in Twisted Python

I am building a server which stores key/value data on top of Redis using Twisted Python.
The server receives a JSON dictionary via HTTP, which is converted into a Python dictionary and put in a buffer. Everytime new data is stored, the server schedules a task which pops one dictionary from the buffer and writes every tuple into a Redis instance, using a txredis client.
class Datastore(Resource):
isLeaf = True
def __init__(self):
self.clientCreator = protocol.ClientCreator(reactor, Redis)
d = self.clientCreator.connectTCP(...)
d.addCallback(self.setRedis)
self.redis = None
self.buffer = deque()
def render_POST(self, request):
try:
task_id = request.requestHeaders.getRawHeaders('x-task-id')[0]
except IndexError:
request.setResponseCode(503)
return '<html><body>Error reading task_id</body></html>'
data = json.loads(request.content.read())
self.buffer.append((task_id, data))
reactor.callLater(0, self.write_on_redis)
return ' '
#defer.inlineCallbacks
def write_on_redis(self):
try:
task_id, dic = self.buffer.pop()
log.msg('Buffer: %s' % len(self.buffer))
except IndexError:
log.msg('buffer empty')
defer.returnValue(1)
m = yield self.redis.sismember('DONE', task_id)
# Simple check
if m == '1':
log.msg('%s already stored' % task_id)
else:
log.msg('%s unpacking' % task_id)
s = yield self.redis.sadd('DONE', task_id)
d = defer.Deferred()
for k, v in dic.iteritems():
k = k.encode()
d.addCallback(self.redis.push, k, v)
d.callback(None)
Basically, I am facing a Producer/Consumer problem between two different connections, but I am not sure that the current implementation works well in the Twisted paradygm.
I have read the small documentation about producer/consumer interfaces in Twisted, but I am not sure if I can use them in my case.
Any critics is welcome: I am trying to get a grasp of event-driven programming, after too many years of thread concurrency.
The producer and consumer APIs in Twisted, IProducer and IConsumer, are about flow control. You don't seem to have any flow control here, you're just relaying messages from one protocol to another.
Since there's no flow control, the buffer is just extra complexity. You could get rid of it by just passing the data directly to the write_on_redis method. This way write_on_redis doesn't need to handle the empty buffer case, you don't need the extra attribute on the resource, and you can even get rid of the callLater (although you can also do this even if you keep the buffer).
I don't know if any of this answers your question, though. As far as whether this approach "works well", here are the things I notice just by reading the code:
If data arrives faster than redis accepts it, your list of outstanding jobs may become arbitrarily large, causing you to run out of memory. This is what flow control would help with.
With no error handling around the sismember call or the sadd call, you may lose tasks if either of these fail, since you've already popped them from the work buffer.
Doing a push as a callback on that Deferred d also means that any failed push will prevent the rest of the data from being pushed. It also passes the result of the Deferred returned by push (I'm assuming it returns a Deferred) as the first argument to the next call, so unless push more or less ignores its first argument, you won't be pushing the right data to redis.
If you want to implement flow control, then you need to have your HTTP server check the length of self.buffer and possibly reject the new task - not adding it to self.buffer and returning some error code to the client. You still won't be using IConsumer and IProducer, but it's sort of similar.

Categories

Resources