Does anyone have a minimal working example of how to use uWSGI to share memory across requests in say Django?
I have a large file in proprietary format (not database-compatible) that I need to load for each request.
An instagram post got me thinking which states:
For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.
How would you set something like that up?
There are multiple ways to handle this:
Share by "abusing" copy-on-write for read-only data
If your data is read-only, you can leverage the fact that uWSGI is executing your python code to get the application before forking into multiple processes. This means all the data that is already loaded before the fork happens will be shared with all your processes.
This can be a great tool because you don't have to do anything dealing with multi-processing to enjoy this mechanism. But be careful, as soon as any process writes to this data, it will copy it first to get its own local version.
Django doesn't make it easy because all the views are lazy. This means django won't try to run code related to your view when application is created. Therefore to enjoy the pre-fork sharing you need to load the data in a code outside your views. For instance you can consider loading the data right before or after you built your application object (like in the gist linked by #john-strood).
Use uWSGI cache framework
If you need to write to this data, a first solution is to use uWSGI cache framework. It's fairly easy to use. You need to configure in advance how much memory you need, and then all your processes can read and write to it. You don't have to deal with locking or other multi-processing related issues.
The drawback is that you still incur IO latency between your processes and uWSGI cache's process. This is insignificant for tiny chunks of data, but would be prohibitive for gigabytes.
Use shared memory manually
As a last resort, if your data is not read-only and you need to load large chunks at all request, so large that even sending through a unix socket would take too long, then you need to load your data directly in the shared memory space. Here uWSGI won't help, and you will have to deal with locking and multi-processing issues yourself.
You can refer to multiprocessing's shared memory documentation.
Related
By locking, I don't mean the Object Lock S3 makes available. I'm talking about the following situation:
I have multiple (Python) processes that read and write to a single file hosted on S3; maybe the file is an index of sorts that needs to be updated periodically.
The processes run in parallel, so I want to make sure only a single process can ever write to the file at a given time (to avoid concomitant write clobbering data).
If I was writing this to a shared filesystem, I could just ask use flock and use that as a way to sync access to the file, but I can't do that on S3 afaict.
What is the easiest way to lock files on AWS S3?
Unfortunately, AWS S3 does not offer a native way of locking objects - there's no flock analogue, as you pointed out. Instead you have a few options:
Use a database
For example, Postgres offers advisory locks. When setting this up, you will need to do the following:
Make sure all processes can access the database.
Make sure the database can handle the incoming connections (if you're running some type of large processing grid, then you may want to put your Postgres instance behind PGBouncer)
Be careful that you do not close the session from the client before you're done with the lock.
There are a few other caveats you need to consider when using advisory locks - from the Postgres documentation:
Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.
In certain cases using advisory locking methods, especially in queries involving explicit ordering and LIMIT clauses, care must be taken to control the locks acquired because of the order in which SQL expressions are evaluated
Use an external service
I've seen people use something like lockable to solve this issue. From their docs, they seem to have a Python library:
$ pip install lockable-dev
from lockable import Lock
with Lock('my-lock-name'):
#do stuff
If you're not using Python, you can still use their service by hitting some HTTP endpoints:
curl https://api.lockable.dev/v1/acquire/my-lock-name
curl https://api.lockable.dev/v1/release/my-lock-name
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
I am having a hard time trying to figure out the big picture of the handling of multiple requests by the uwsgi server with django or pyramid application.
My understanding at the moment is this:
When multiple http requests are sent to uwsgi server concurrently, the server creates a separate processes or threads (copies of itself) for every request (or assigns to them the request) and every process/thread loads the webapplication's code (say django or pyramid) into computers memory and executes it and returns the response. In between every copy of the code can access the session, cache or database. There is a separate database server usually and it can also handle concurrent requests to the database.
So here some questions I am fighting with.
Is my above understanding correct or not?
Are the copies of code interact with each other somehow or are they wholly separated from each other?
What about the session or cache? Are they shared between them or are they local to each copy?
How are they created: by the webserver or by copies of python code?
How are responses returned to the requesters: by each process concurrently or are they put to some kind of queue and sent synchroniously?
I have googled these questions and have found very interesting answers on StackOverflow but anyway can't get the whole picture and the whole process remains a mystery for me. It would be fantastic if someone can explain the whole picture in terms of django or pyramid with uwsgi or whatever webserver.
Sorry for asking kind of dumb questions, but they really torment me every night and I am looking forward to your help:)
There's no magic in pyramid or django that gets you past process boundaries. The answers depend entirely on the particular server you've selected and the settings you've selected. For example, uwsgi has the ability to run multiple threads and multiple processes. If uwsig spins up multiple processes then they will each have their own copies of data which are not shared unless you took the time to create some IPC (this is why you should keep state in a third party like a database instead of in-memory objects which are not shared across processes). Each process initializes a WSGI object (let's call it app) which the server calls via body_iter = app(environ, start_response). This app object is shared across all of the threads in the process and is invoked concurrently, thus it needs to be threadsafe (usually the structures the app uses are either threadlocal or readonly to deal with this, for example a connection pool to the database).
In general the answers to your questions are that things happen concurrently, and objects may or may not be shared based on your server model but in general you should take anything that you want to be shared and store it somewhere that can handle concurrency properly (a database).
The power and weakness of webservers is that they are in principle stateless. This enables them to be massively parallel. So indeed for each page request a different thread may be spawned. Wether or not this indeed happens depends on the configuration settings of you webserver. There's also a cost to spawning many threads, so if possible threads are reused from a thread pool.
Almost all serious webservers have page cache. So if the same page is requested multiple times, it can be retrieved from cache. In addition, browsers do their own caching. A webserver has to be clever about what to cache and what not. Static pages aren't a big problem, although they may be replaced, in which case it is quite confusing to still get the old page served because of the cache.
One way to defeat the cache is by adding (dummy) parameters to the page request.
The statelessness of the web was initialy welcomed as a necessity to achieve scalability, where webpages of busy sites even could be served concurrently from different servers at nearby or remote locations.
However the trend is to have stateful apps. State can be maintained on the browser, easing the burden on the server. If it's maintained on the server it requires the server to know 'who's talking'. One way to do this is saving and recognizing cookies (small identifiable bits of data) on the client.
For databases the story is a bit different. As soon as anything gets stored that relates to a particular user, the application is in principle stateful. While there's no conceptual difference between retaining state on disk and in RAM memory, traditionally statefulness was left to the database, which in turned used thread pools and load balancing to do its job efficiently.
With the advent of very large internet shops like amazon and google, mandatory disk access to achieve statefulness created a performance problem. The answer were in-memory databases. While they may be accessed traditionally using e.g. SQL, they offer much more flexibility in the way data is stored conceptually.
A type of database that enjoys growing popularity is persistent object store. With this database, while the distinction still can be made formally, the boundary between webserver and database is blurred. Both have their data in RAM (but can swap to disk if needed), both work with objects rather than flat records as in SQL tables. These objects can be interconnected in complex ways.
In short there's an explosion of innovative storage / thread pooling / caching/ persistence / redundance / synchronisation technology, driving what has become popularly know as 'the cloud'.
I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.
Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.
I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.
This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)
Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.
I have been looking into different systems for creating a fast cache in a web-farm running Python/mod_wsgi. Memcache and others are options ... But I was wondering:
Because I don't need to share data across machines, wanting each machine to maintain a local cache ...
Does Python or WSGI provide a mechanism for Python native shared data in Apache such that the data persists and is available to all threads/processes until the server is restarted? This way I could just keep a cache of objects with concurrency control in the memory space of all running application instances?
If not, it sure would be useful
Thanks!
This is thoroughly covered by the Sharing and Global Data section of the mod_wsgi documentation. The short answer is: No, not unless you run everything in one process, but that's not an ideal solution.
It should be noted that caching is ridiculously easy to do with Beaker middleware, which supports multiple backends including memcache.
There's Django's thread-safe in-memory cache back-end, see here. It's cPickle-based, and although it's designed for use with Django, it has minimal dependencies on the rest of Django and you could easily refactor it to remove these. Obviously each process would get its own cache, shared between its threads; If you want a cache shared by all processes on the same machine, you could just use this cache in its own process with an IPC interface of your choice (domain sockets, say) or use memcached locally, or, if you might ever want persistence across restarts, something like Tokyo Cabinet with a Python interface like this.
I realize this is an old thread, but here's another option for a "server-wide dict": http://poshmodule.sourceforge.net/posh/html/posh.html (POSH, Python Shared Objects). Disclaimer: haven't used it myself yet.