Python threading in web programming - python

I face a potential race condition in a web application:
# get the submissions so far from the cache
submissions = cache.get('user_data')
# add the data from this user to the local dict
submissions[user_id] = submission
# update the cached dict on server
submissions = cache.update('user_data', submissions)
if len(submissions) == some_number:
...
The logic is simple, we first fetch a shared dictionary stored in the cache of web server, add the submission (delivered by each request to the server) to its local copy, and then we update the cached copy by replacing it with this updated local copy. Finally we do something else if we have received a certain number of pieces of data. Notice that
submissions = cache.update('user_data', submissions)
will return the latest copy of dictionary from the cache, i.e. the newly updated one.
Because the server may serve multiple requests (each in its own thread) at the same time, and all these threads access the shared dictionary in cache as described above, thus creating potential race conditions.
I wonder, in the context of web programming, how should I efficiently handle threading to prevent race conditions in this particular case, without sacrificing too much performance. Some code examples would be much appreciated.

My preferred solution would be to have a single thread that modifies the submissions dict and a queue that feed that thread. If you are paranoid, you can even expose a read-only view on the submissions dict. Using a queue and consumer pattern, you will not have a problem with locking.
Of course, this assumes that you have a web framework that will let you create that thread.
EDIT: multiprocess was not a good suggestion; removed.
EDIT: This sort of stuff is really simple in Python:
import threading, Queue
Stop = object()
def consumer(real_dict, queue):
while True:
try:
item = queue.get(timeout=100)
if item == Stop:
break
user, submission = item
real_dict[user] = submission
except Queue.Empty:
continue
q = Queue.Queue()
thedict={}
t = threading.Thread(target=consumer, args=(thedict,q,))
t.start()
Then, you can try:
>>> thedict
{}
>>> q.put(('foo', 'bar'))
>>> thedict
{'foo': 'bar'}
>>> q.put(Stop)
>>> q.put(('baz', 'bar'))
>>> thedict
{'foo': 'bar'}

You appear to be transferring lots of data back and forth between your web application and your cache. That's already a problem. You're also right to be suspicious, since it would be possible for the pattern to be like this (remembering that sub is local to each thread):
Thread A Thread B Cache
--------------------------------------------
[A]=P, [B]=Q
sub = get()
[A]=P, [B]=Q
>>>> suspend
sub = get()
[A]=P, [B]=Q
sub[B] = Y
[A]=P, [B]=Y
update(sub)
[A]=P, [B]=Y
>>>> suspend
sub[A] = X
[A]=X, [B]=Q
update(sub)
[A]=X, [B]=Q !!!!!!!!
This sort of pattern can happen for real, and it results in state getting wiped out. It's also inefficient because thread A should usually only need to know about its current user, not everything.
While you could fix this by great big gobs of locking, that would be horribly inefficient. So, you need to redesign so that you transfer much less data around, which will give a performance boost and reduce the amount of locking you need.

This is one of the more difficult questions to answer because it seems to be a bigger design problem.
One potential solution to this problem would be to have one well-defined place where this is updated. For instance, you might want to set up another service that's dedicated to updating the cache and nothing else. Alternatively, if these updates aren't time-sensitive, you may also want to consider using a task queue.
Another solution: you could give each item a separate key and store a list of the keys under a separate key. This doesn't necessarily solve the problem, but it does make it more manageable. Instead of worrying about separate threads overwriting the entire submissions cache, you just have to worry about them overwriting individual elements within it.
If you have the time to add a new piece to your infrastructure, I'd highly recommend looking at Redis, more specifically Redis hashes[1]. The reason being that Redis handles this problem out of the box, with about the same speed as you'd get with memcache (although I definitely encourage you to benchmark it for yourself).
[1] Note: I just found this link through a quick Google search, and haven't verified it. I don't vouch for its correctness.

Related

Django (REST Framework) Returns Empty List Every Other Call

I have a simple function that generates a random sample list of items from the outer dictionary.
def get_random_product_feed_for(mid, n=DEFAULT_AMOUNT_ITEMS_RETURNED):
assert mid is not None, 'Merchant ID cannot be null while retrieving products feed'
n = min(n, MAX_FEED_ITEMS_RETURNED)
if mid in ADVERTISERS_PRODUCT_FEEDS: # check if merchant is present in the outer dict
feeds = ADVERTISERS_PRODUCT_FEEDS[mid] # retrieve merchant's items
# Sample list
if len(feeds) >= n:
random_feeds = random.sample(feeds, n)
else:
random_feeds = feeds
return random_feeds
return []
Where ADVERTISERS_PRODUCT_FEEDS = defaultdict(list).
However, when I use this function in REST Framework API call, it returns empty list half the time; every other call. I don't think, however, the problem is in views or serializers.
Nevertheless, the setup is a little bit more complex than this. ADVERTISERS_PRODUCT_FEEDS is fetched asynchronously, because it is processed from large files that need to be downloaded.
threads = [] # to keep only one active thread for the process
def fetch_products_feed():
for thr in threads:
if not thr.is_alive():
threads.remove(thr)
if len(threads) > 0:
logging.warning(
'Attempted to create multiple threads for product feeds fetching process. '
'Wait until it is done!'
)
return
thread = threading.Thread(target=fetch_products_feed_sync, name='Fetch-Products-Thread')
threads.append(thread)
thread.start()
So far I can only assume that fetch_products_feed_sync is doing its job properly as well (in sake of not overcomplicating the question). It just reads the items and adds them to ADVERTISERS_PRODUCT_FEEDS.
This whole setup actually works locally, but I host the server on AWS. So the problem appears only there. Locally I get the result every call.
I can only suggest that threading messes everything up. Am I right? I hoped that since the main thread only reads ADVERTISERS_PRODUCT_FEEDS and not changes it, it should be fine.
Maybe every call, the application changes the thread or something. So, at first, ADVERTISERS_PRODUCT_FEEDS has values in it, but at the next step (call) ADVERTISERS_PRODUCT_FEEDS has no values, because the application is in another thread?
How would you advice me to debug it, considering everyone has tough time debugging things on AWS?
I see that, for example, if mid is passed incorrectly, then it returns empty list, but it seems not to be a problem (I will try to debug it and give an update on that later).
Updates:
It was discovered that the problem is in ADVERTISERS_PRODUCT_FEEDS being empty in the scope of get_random_product_feed_for every other call. I tried to debug the name of the thread that is active inside get_random_product_feed_for. Seems like both successful (ADVERTISERS_PRODUCT_FEEDS having items) and unsuccessful (ADVERTISERS_PRODUCT_FEEDS having no items) calls are coming from MainThread. So, why MainThread at one point has access to data, at the next moment (function call) it doesn't, the moment afterward it has it again, etc?
The problem appears to be in multiple processes that AWS creates for WSGI.
I discovered that if you tweak NumProcesses in AWS's configuration, then the rate of failed results will increase accordingly. So, if I have NumProcesses=3, then the application will return empty set 2/3. Which means that only one of three processes has fetched the wanted data.
As far as I understood, there is no solution to share memory among the processes. The only solution that can be done is to persist the data (e.g., in database).

Using python dictionary as a temporary in-memory key-value database?

I need something like a temporary in-memory key-value store. I know there are solutions like Redis. But I wonder if using a python dictionary could work? And potentially be even faster?
So think a Tornado (or similar) server running and holding a python dictionary in memory and just return the appropriate value based on the HTTP request.
Why I need this?
As part of a service there are key values being stored but they have this property: the more recent they are the more likely they are to be accessed. So I want to keep say last 100 key values in memory (as well as writing to disk) for faster retrieval.
If the server dies the dictionary can be restored again from disk.
Has anyone done something like this? Am I totally missing something here?
PS: I think it's not possible with a WSGI server, right? Because as far as I know you can't keep something in memory between individual requests.
I'd definitely work with memcached. Once it has been setup you can easily decorate your functions/methods like it's done in my example:
#!/usr/bin/env python
import time
import memcache
import hashlib
def memoize(f):
def newfn(*args, **kwargs):
mc = memcache.Client(['127.0.0.1:11211'], debug=0)
# generate md5 out of args and function
m = hashlib.md5()
margs = [x.__repr__() for x in args]
mkwargs = [x.__repr__() for x in kwargs.values()]
map(m.update, margs + mkwargs)
m.update(f.__name__)
m.update(f.__class__.__name__)
key = m.hexdigest()
value = mc.get(key)
if value:
return value
else:
value = f(*args, **kwargs)
mc.set(key, value, 60)
return value
return f(*args)
return newfn
#memoize
def expensive_function(x):
time.sleep(5)
return x
if __name__ == '__main__':
print expensive_function('abc')
print expensive_function('abc')
Don't care about network latency since that kind of optimization will be a waste of your time.
An in process Python dictionary is way faster than a memcached server. According to a non-rigorous benchmark that I performed some days ago, a single get takes around 2us using an in process python dictionary and around 50us using a memcached server listening on localhost. In my benchmark, I was using libmemcached as C client and python-libmemcached as python wrapper over this C-client.
If you are bundling the dictionary into the same server as is running your actual service, then yes, that would work fine.
If you're creating separate things, well, this is basically what memcached is for. Don't reinvent the wheel.
I am experimenting with something similar, and the corecache library is a great way to test a few caching systems.
https://pypi.python.org/pypi/cachecore
In particular, their SimpleCache implementation relies on a vanilla python dict, and in my preliminary tests it's extremely fast, 10x faster than calling memcached locally (assuming I'm already in the python application that needs caching, probably the tornado service in your case).
It's possible and it is much faster than redis/memcache because of no network latency. You can use cPickle to dump the dictionary every once in a while. It's tricky though if your program spawns sub processes, then updating the values in one process doesn't affect the other.
You could just cache last data in dict, nobody prohibits about it and it works in one-server environment
When new data added - store it to some redis (memcachedb)
When server restarts - just load newest N records to dictionary
All depends on data volume. I believe it takes more memory to keep complex structures in dictionary in python, thought access will be fast - yes

Python threading passing statuses

Basically what I'm trying to do is fetch a couple of websites using proxies and process the data. The problem is that the requests rarely fail in a convincing way, setting socket timeouts wasnt very helpful either because they often didn't work.
So what I did is:
q = Queue()
s = ['google.com','ebay.com',] # And so on
for item in s:
q.put(item)
def worker():
item = q.get()
data = fetch(item) # This is the buggy part
# Process the data, yadayada
for i in range(workers):
t = InterruptableThread(target=worker)
t.start()
# Somewhere else
if WorkerHasLivedLongerThanTimeout:
worker.terminate()
(InterruptableThread class)
The problem is that I only want to kill threads which are still stuck on the fetching part. Also, I want the item to return to the queue. Ie:
def worker():
self.status = 0
item = q.get()
data = fetch(item) # This is the buggy part
self.status = 1 # Don't kill me now, bro!
# Process the data, yadayada
# Somewhere else
if WorkerHasLivedLongerThanTimeout and worker.status != 1:
q.put(worker.item)
worker.terminate()
How can this be done?
edit: breaking news; see below · · · ······
I decided recently that I wanted to do something pretty similar, and what came out of it was the pqueue_fetcher module. It ended up being mainly a learning endeavour: I learned, among other things, that it's almost certainly better to use something like twisted than to try to kill Python threads with any sort of reliability.
That being said, there's code in that module that more or less answers your question. It basically consists of a class whose objects can be set up to get locations from a priority queue and feed them into a fetch function that's supplied at object instantiation. If the location's resources get successfully received before their thread is killed, they get forwarded on to the results queue; otherwise they're returned to the locations queue with a downgraded priority. Success is determined by a passed-in function that defaults to bool.
Along the way I ended up creating the terminable_thread module, which just packages the most mature variation I could find of the code you linked to as InterruptableThread. It also adds a fix for 64-bit machines, which I needed in order to use that code on my ubuntu box. terminable_thread is a dependency of pqueue_fetcher.
Probably the biggest stumbling block I hit is that raising an asynchronous exception as do terminable_thread, and the InterruptableThread you mentioned, can have some weird results. In the test suite for pqueue_fetcher, the fetch function blocks by calling time.sleep. I found that if a thread is terminate()d while so blocking, and the sleep call is the last (or not even the last) statement in a nested try block, execution will actually bounce to the except clause of the outer try block, even if the inner one has an except matching the raised exception. I'm still sort of shaking my head in disbelief, but there's a test case in pqueue_fetcher that reenacts this. I believe "leaky abstraction" is the correct term here.
I wrote a hacky workaround that just does some random thing (in this case getting a value from a generator) to break up the "atomicity" (not sure if that's actually what it is) of that part of the code. This workaround can be overridden via the fission parameter to pqueue_fetcher.Fetcher. It (i.e. the default one) seems to work, but certainly not in any way that I would consider particularly reliable or portable.
So my call after discovering this interesting piece of data was to heretofore avoid using this technique (i.e. calling ctypes.pythonapi.PyThreadState_SetAsyncExc) altogether.
In any case, this still won't work if you need to guarantee that any request whose entire data set has been received (and i.e. acknowledged to the server) gets forwarded on to results. In order to be sure of that, you have to guarantee that the bit that does that last network transaction and the forwarding is guarded from being interrupted, without guarding the entire retrieval operation from being interrupted (since this would prevent timeouts from working..). And in order to do that you need to basically rewrite the retrieval operation (i.e. the socket code) to be aware of whichever exception you're going to raise with terminable_thread.Thread.raise_exc.
I've yet to learn twisted, but being the Premier Python Asynchronous Networking Framework©™®, I expect it must have some elegant or at least workable way of dealing with such details. I'm hoping it provides a parallel way to implement fetching from non-network sources (e.g. a local filestore, or a DB, or an etc.), since I'd like to build an app that can glean data from a variety of sources in a medium-agnostic way.
Anyhow, if you're still intent on trying to work out a way to manage the threads yourself, you can perhaps learn from my efforts. Hope this helps.
· · · · ······ this just in:
I've realized that the tests that I thought had stabilized have actually not, and are giving inconsistent results. This appears to be related to the issues mentioned above with exception handling and the use of the fission function. I'm not really sure what's going on with it, and don't plan to investigate in the immediate future unless I end up having a need to actually do things this way.

How to synchronize a python dict with multiprocessing

I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?
Intro
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyro module (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
Server.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
Client.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
The key in syncdict to operate on
The amount to increment the value accessed by the key every cycle
The amount of time to sleep per cycle in seconds
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
Annoying problems
The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
Closing
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.
I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclib to make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment to perform the increment and one taking just the key and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistent shared dict, or persistence is not necessary to you), then measure and optimize as and if needed.
In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this article is suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
From the site:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter
Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.

Python: deferToThread XMLRPC Server - Twisted - Cherrypy?

This question is related to others I have asked on here, mainly regarding sorting huge sets of data in memory.
Basically this is what I want / have:
Twisted XMLRPC server running. This server keeps several (32) instances of Foo class in memory. Each Foo class contains a list bar (which will contain several million records). There is a service that retrieves data from a database, and passes it to the XMLRPC server. The data is basically a dictionary, with keys corresponding to each Foo instance, and values are a list of dictionaries, like so:
data = {'foo1':[{'k1':'v1', 'k2':'v2'}, {'k1':'v1', 'k2':'v2'}], 'foo2':...}
Each Foo instance is then passed the value corresponding to it's key, and the Foo.bar dictionaries are updated and sorted.
class XMLRPCController(xmlrpc.XMLRPC):
def __init__(self):
...
self.foos = {'foo1':Foo(), 'foo2':Foo(), 'foo3':Foo()}
...
def update(self, data):
for k, v in data:
threads.deferToThread(self.foos[k].processData, v)
def getData(self, fookey):
# return first 10 records of specified Foo.bar
return self.foos[fookey].bar[0:10]
class Foo():
def __init__(self):
bar = []
def processData(self, new_bar_data):
for record in new_bar_data:
# do processing, and add record, then sort
# BUNCH OF PROCESSING CODE
self.bar.sort(reverse=True)
The problem is that when the update function is called in the XMLRPCController with a lot of records (say 100K +) it stops responding to my getData calls until all 32 Foo instances have completed the process_data method. I thought deferToThread would work, but I think I am misunderstanding where the problem is.
Any suggestions... I am open to using something else, like Cherrypy if it supports this required behavior.
EDIT
#Troy: This is how the reactor is set up
reactor.listenTCP(port_no, server.Site(XMLRPCController)
reactor.run()
As far as GIL, would it be a viable option to change
sys.setcheckinterval()
value to something smaller, so the lock on the data is released so it can be read?
The easiest way to get the app to be responsive is to break up the CPU-intensive processing in smaller chunks, while letting the twisted reactor run in between. For example by calling reactor.callLater(0, process_next_chunk) to advance to next chunk. Effectively implementing cooperative multitasking by yourself.
Another way would be to use separate processes to do the work, then you will benefit from multiple cores. Take a look at Ampoule: https://launchpad.net/ampoule It provides an API similar to deferToThread.
I don't know how long your processData method runs nor how you're setting up your twisted reactor. By default, the twisted reactor has a thread pool of between 0 and 10 threads. You may be trying to defer as many as 32 long-running calculations to as many as 10 threads. This is sub-optimal.
You also need to ask what role the GIL is playing in updating all these collections.
Edit:
Before you make any serious changes to your program (like calling sys.setcheckinterval()) you should probably run it using the profiler or the python trace module. These should tell you what methods are using all your time. Without the right information, you can't make the right changes.

Categories

Resources