Memcache client with connection pool for Python? - python

python-memcached memcache client is written in a way where each thread gets its own connection. This makes python-memcached code simple, which is nice, but presents a problem if your application has hundreds or thousands of threads (or if you run lots of applications), because you will quickly run out of available connections in memcache.
Typically this kind of problem is solved by using a connection pool, and indeed the Java memcache libraries I have seen implement connection pooling. After reading the documentation for various Python memcache libraries it seems the only one offering connection pool is pylibmc, but it has two problems for me: it is not pure Python, and it does not seem to have a timeout for reserving a client from the pool. While not being pure Python is perhaps not a deal breaker, not having a timeout certainly is. It is also not clear how those pools would work with for example dogpile.cache.
Preferably I would like to find a pure Python memcache client with connection pooling that would work with dogpile.cache, but I am open to other suggestions as well. I'd rather avoid changing the application logic, though (like pushing all memcache operations into fewer background threads).

A coworker came up with an idea that seems to work well enough for our use case, so sharing that here. The basic idea is that you create the number of memcache clients you want to use up front, put them in a queue, and whenever you need a memcache client you pull one from the queue. Due to Queue.Queue get() method having optional timeout parameter, you can also handle the case where you can't get a client in time. You also need to deal with the use of threading.local in memcache client.
Here is how it could work in code (note that I haven't actually run this exact version so there might be some issues, but this should give you an idea if the textual description did not make sense to you):
import Queue
import memcache
# See http://stackoverflow.com/questions/9539052/python-dynamically-changing-base-classes-at-runtime-how-to
# Don't inherit client from threading.local so that we can reuse clients in
# different threads
memcache.Client = type('Client', (object,), dict(memcache.Client.__dict__))
# Client.__init__ references local, so need to replace that, too
class Local(object): pass
memcache.local = Local
class PoolClient(object):
'''Pool of memcache clients that has the same API as memcache.Client'''
def __init__(self, pool_size, pool_timeout, *args, **kwargs):
self.pool_timeout = pool_timeout
self.queue = Queue.Queue()
for _i in range(pool_size):
self.queue.put(memcache.Client(*args, **kwargs))
def __getattr__(self, name):
return lambda *args, **kw: self._call_client_method(name, *args, **kw)
def _call_client_method(self, name, *args, **kwargs):
try:
client = self.queue.get(timeout=self.pool_timeout)
except Queue.Empty:
return
try:
return getattr(client, name)(*args, **kwargs)
finally:
self.queue.put(client)

Many thank to #Heikki Toivenen for providing ideas to the problem! However, I'm not sure how to call the get() method exactly in order to use a memcache client in the PoolClient. Direct calling of get() method with arbitrary name gives AttributeError or MemcachedKeyNoneError.
By combining #Heikki Toivonen's and pylibmc's solution to the problem, I came up with the following code for the problem and posted here for the convenience of future users (I have debugged this code and it should be ready to run):
import Queue, memcache
from contextlib import contextmanager
memcache.Client = type('Client', (object,), dict(memcache.Client.__dict__))
# Client.__init__ references local, so need to replace that, too
class Local(object): pass
memcache.local = Local
class PoolClient(object):
'''Pool of memcache clients that has the same API as memcache.Client'''
def __init__(self, pool_size, pool_timeout, *args, **kwargs):
self.pool_timeout = pool_timeout
self.queue = Queue.Queue()
for _i in range(pool_size):
self.queue.put(memcache.Client(*args, **kwargs))
print "pool_size:", pool_size, ", Queue_size:", self.queue.qsize()
#contextmanager
def reserve( self ):
''' Reference: http://sendapatch.se/projects/pylibmc/pooling.html#pylibmc.ClientPool'''
client = self.queue.get(timeout=self.pool_timeout)
try:
yield client
finally:
self.queue.put( client )
print "Queue_size:", self.queue.qsize()
# Intanlise an instance of PoolClient
mc_client_pool = PoolClient( 5, 0, ['127.0.0.1:11211'] )
# Use a memcache client from the pool of memcache client in your apps
with mc_client_pool.reserve() as mc_client:
#do your work here

Related

When using airflow I get "UserWarning: MongoClient opened before fork" + best practice tips to instantiate mongo client in python

I have created a python flow that executes various steps. I also use MongoDB which for now is used for config purposes but will further use it for persistence.
When I execute my code from Pycharm all is good but when I executed it through Airflow I get the fork warning "UserWarning: MongoClient opened before fork. Create MongoClient only after forking.".
At first I thought that the problem is that I open many instances of the Mongo client so I used the singleton pattern so that the connection is instantiated only once -please see code below-.
The only way to get rid of the warning is by adding connect=False parameter as you may see commented in my code, but it seems to be a workaround and not a steady solution --according to documentation-.
So what is the problem here? It has something to do with Airflow maybe -since I am not using the mongo_hook-?
In addition please let me know if using the singleton pattern to instantiate once the Mongo Client is a good practice? The client is getting called from various modules.
Note that Mongo as well as Airflow are running in separate dockers.
def singleton(class_):
instances = {}
def get_instance(*args, **kwargs):
if class_ not in instances:
instances[class_] = class_(*args, **kwargs)
return instances[class_]
return get_instance
#singleton
class MongoPersistence(Persistence):
def __init__(self, driver, host, user, password, port, db):
self._uri = '{driver}://{host}:{port}'.format(driver=driver, host=host, port=port)
self._client = MongoClient(self._uri
, serverSelectionTimeoutMS=3000 # 3 second timeout
, username=self.user
, password=self.password
# ,connect=False
)
def find_one(self, **kwargs):
return self._client[kwargs.get('db_name')][kwargs.get('collection_name')].find_one(kwargs.get('query'))
Thank you,
Dina

Why using Gunicorn with GEvent could increate the query time to Redis/Database?

In actual production load (web app) with an Redis server (v4.x) when using gunicorn with worker_class gevent the query time increases by 3. Database access also got worse (but not so much, only 50%). I'm trying to figure out why this would happen. Any ideas? The app is very IO Bound, with lots of database queries and redis accesses for every single request, which should be the perfect scenario for gevent.
Moving from SYNC to GEVENT (~11 A.M)
Would the monkey patching to the socket decrease performance somehow? I tried to fine tuning worker_connections without success, even the extremely low level of just 2 (almost sync again), give me the same bad results. Am I missing some gotcha for how gevent and it's pseudothreads work?
Disclaimer: I'm using NewRelic to monitor the performance and redis-py/django/mysql. I tried some tweaks like using the BlockingConnectionPool for Redis, but my Database access performance also decreased so Redis is not the only problem. The worker size is 5 (CPUs * 2 + 1). I also had tons of GreenletExit/ConnectionError[redis] at random times, which was minimized by moving worker_connections from 2k (default) to 10.
Example of a redis connection after monkeypatching.
_rconn = None
def redisconn():
global _rconn
if _rconn is None:
try:
_rconn = redis.StrictRedis(host=redisconfig['host'], port=redisconfig['port'], db=redisconfig['db'])
except:
traceback.print_exc()
_rconn = None
return _rconn
class RedisCache(object):
def __init__(self):
#in case of a lost connection lets sit and wait till it's online
global _rconn
if not _rconn:
while not _rconn:
try:
redisconn()
except:
print('Attempting Connection To Redis...')
gsleep(1)
self.r = _rconn
self.rpool = self.r.connection_pool
def get(self, key):
return self.r.get(key)
def set(self, key, meta, expire=86400, nx=True):
return self.r.set(key, json.loads(meta), ex=expire, nx=nx)
def connections(self):
return self.rpool._created_connections
def inuse(self):
return self.rpool._in_use_connections

Pyro4 RPC blocking

I am currently doing development on a high performance numeric calculations system that posts it results to a web server. We use flask on a seperate process to serve the web page to keep everything interactive and use websockets to send the data to a JS plotter. The calculations are also split using multiprocessing for performance reasons.
We are currently using Pyro4 to get parameters updated in the calculations from the server. However, if the number of updates per second gets high on our Pyro4 settings object it starts blocking and makes it impossible to update any parameters until we restart the server. The proxy calls are made inside of an async websocket callback.
We are currently not getting any tracebacks, nor exceptions which makes debugging this all tricky. Are there any other people that have a lot of experience with Pyro in this context?
Pyro daemon init code:
Pyro4.config.SERVERTYPE = "multiplex"
daemon = Pyro4.Daemon()
ns = Pyro4.locateNS()
settings = cg.Settings(web_opt, src_opt, rec_opt, det_opt)
uri = daemon.register(settings)
ns.register("cg.settings", uri)
daemon.requestLoop()
Settings class:
class Settings(object):
def __init__(self, web_opt, src_opt, rec_opt, det_opt):
self.pipes = [web_opt, src_opt, rec_opt, det_opt]
def update(self, update):
[pipe.send(update) for pipe in self.pipes]
Proxy call:
def onMessage(self, payload, isBinary):
data = json.loads(payload)
self.factory.content.set_by_uuid(data['id'], data['value'], self.client_address)
settings = Pyro4.Proxy("PYRONAME:cg.settings")
values = self.factory.content.values
settings.update(values)
settings._pyroRelease()

Celery task schedule (Ensuring a task is only executed one at a time)

I have a task, somewhat like this:
#task()
def async_work(info):
...
At any moment, I may call async_work with some info. For some reason, I need to make sure that only one async_work is running at a time, other calling request must wait for.
So I come up with the following code:
is_locked = False
#task()
def async_work(info):
while is_locked:
pass
is_locked = True
...
is_locked = False
But it says it's invalid to access local variables...
How to solve it?
It is invalid to access local variables since you can have several celery workers running tasks. And those workers might even be on different hosts. So, basically, there is as many is_locked variable instances as many Celery workers are running
your async_work task. Thus, even though your code won't raise any errors you wouldn't get desired effect with it.
To achieve you goal you need to configure Celery to run only one worker. Since any worker can process a single task at any given time you get what you need.
EDIT:
According to Workers Guide > Concurrency:
By default multiprocessing is used to perform concurrent execution of
tasks, but you can also use Eventlet. The number of worker
processes/threads can be changed using the --concurrency argument
and defaults to the number of CPUs available on the machine.
Thus you need to run the worker like this:
$ celery worker --concurrency=1
EDIT 2:
Surprisingly there's another solution, moreover it is even in the official docs, see the Ensuring a task is only executed one at a time article.
You probably don't want to use concurrency=1 for your celery workers - you want your tasks to be processed concurrently. Instead you can use some kind of locking mechanism. Just ensure timeout for cache is bigger than time to finish your task.
Redis
import redis
from contextlib import contextmanager
redis_client = redis.Redis(host='localhost', port=6378)
#contextmanager
def redis_lock(lock_name):
"""Yield 1 if specified lock_name is not already set in redis. Otherwise returns 0.
Enables sort of lock functionality.
"""
status = redis_client.set(lock_name, 'lock', nx=True)
try:
yield status
finally:
redis_client.delete(lock_name)
#task()
def async_work(info):
with redis_lock('my_lock_name') as acquired:
do_some_work()
Memcache
Example inspired by celery documentation
from contextlib import contextmanager
from django.core.cache import cache
#contextmanager
def memcache_lock(lock_name):
status = cache.add(lock_name, 'lock')
try:
yield status
finally:
cache.delete(lock_name)
#task()
def async_work(info):
with memcache_lock('my_lock_name') as acquired:
do_some_work()
I have implemented a decorator to handle this. It's based on Ensuring a task is only executed one at a time from the official Celery docs.
It uses the function's name and its args and kwargs to create a lock_id, which is set/get in Django's cache layer (I have only tested this with Memcached but it should work with Redis as well). If the lock_id is already set in the cache it will put the task back on the queue and exit.
CACHE_LOCK_EXPIRE = 30
def no_simultaneous_execution(f):
"""
Decorator that prevents a task form being executed with the
same *args and **kwargs more than one at a time.
"""
#functools.wraps(f)
def wrapper(self, *args, **kwargs):
# Create lock_id used as cache key
lock_id = '{}-{}-{}'.format(self.name, args, kwargs)
# Timeout with a small diff, so we'll leave the lock delete
# to the cache if it's close to being auto-removed/expired
timeout_at = monotonic() + CACHE_LOCK_EXPIRE - 3
# Try to acquire a lock, or put task back on queue
lock_acquired = cache.add(lock_id, True, CACHE_LOCK_EXPIRE)
if not lock_acquired:
self.apply_async(args=args, kwargs=kwargs, countdown=3)
return
try:
f(self, *args, **kwargs)
finally:
# Release the lock
if monotonic() < timeout_at:
cache.delete(lock_id)
return wrapper
You would then apply it on any task as the first decorator:
#shared_task(bind=True, base=MyTask)
#no_simultaneous_execution
def sometask(self, some_arg):
...

Python SimpleXMLRPCServer: get user IP and simple authentication

I am trying to make a very simple XML RPC Server with Python that provides basic authentication + ability to obtain the connected user's IP. Let's take the example provided in http://docs.python.org/library/xmlrpclib.html :
import xmlrpclib
from SimpleXMLRPCServer import SimpleXMLRPCServer
def is_even(n):
return n%2 == 0
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_function(is_even, "is_even")
server.serve_forever()
So now, the first idea behind this is to make the user supply credentials and process them before allowing him to use the functions. I need very simple authentication, for example just a code. Right now what I'm doing is to force the user to supply this code in the function call and test it with an if-statement.
The second one is to be able to get the user IP when he calls a function or either store it after he connects to the server.
Moreover, I already have an Apache Server running and it might be simpler to integrate this into it.
What do you think?
This is a related question that I found helpful:
IP address of client in Python SimpleXMLRPCServer?
What worked for me was to grab the client_address in an overridden finish_request method of the server, stash it in the server itself, and then access this in an overridden server _dispatch routine. You might be able to access the server itself from within the method, too, but I was just trying to add the IP address as an automatic first argument to all my method calls. The reason I used a dict was because I'm also going to add a session token and perhaps other metadata as well.
from xmlrpc.server import DocXMLRPCServer
from socketserver import BaseServer
class NewXMLRPCServer( DocXMLRPCServer):
def finish_request( self, request, client_address):
self.client_address = client_address
BaseServer.finish_request( self, request, client_address)
def _dispatch( self, method, params):
metadata = { 'client_address' : self.client_address[ 0] }
newParams = ( metadata, ) + params
return DocXMLRPCServer._dispatch( self, method, metadata)
Note this will BREAK introspection functions like system.listMethods() because that isn't expecting the extra argument. One idea would be to check the method name for "system." and just pass the regular params in that case.

Categories

Resources