I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.
Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.
I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.
This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)
Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.
Related
Does anyone have a minimal working example of how to use uWSGI to share memory across requests in say Django?
I have a large file in proprietary format (not database-compatible) that I need to load for each request.
An instagram post got me thinking which states:
For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.
How would you set something like that up?
There are multiple ways to handle this:
Share by "abusing" copy-on-write for read-only data
If your data is read-only, you can leverage the fact that uWSGI is executing your python code to get the application before forking into multiple processes. This means all the data that is already loaded before the fork happens will be shared with all your processes.
This can be a great tool because you don't have to do anything dealing with multi-processing to enjoy this mechanism. But be careful, as soon as any process writes to this data, it will copy it first to get its own local version.
Django doesn't make it easy because all the views are lazy. This means django won't try to run code related to your view when application is created. Therefore to enjoy the pre-fork sharing you need to load the data in a code outside your views. For instance you can consider loading the data right before or after you built your application object (like in the gist linked by #john-strood).
Use uWSGI cache framework
If you need to write to this data, a first solution is to use uWSGI cache framework. It's fairly easy to use. You need to configure in advance how much memory you need, and then all your processes can read and write to it. You don't have to deal with locking or other multi-processing related issues.
The drawback is that you still incur IO latency between your processes and uWSGI cache's process. This is insignificant for tiny chunks of data, but would be prohibitive for gigabytes.
Use shared memory manually
As a last resort, if your data is not read-only and you need to load large chunks at all request, so large that even sending through a unix socket would take too long, then you need to load your data directly in the shared memory space. Here uWSGI won't help, and you will have to deal with locking and multi-processing issues yourself.
You can refer to multiprocessing's shared memory documentation.
For my python application I am thinking of using shelve, part of the standard library. There will be hundreds of processes, each writing something to the same shelve object. The writing will always be to add a new key,value pair to the shelve. The keys are unique, so no two processes will update the same entry.
What could go wrong in such a scenario?
The shelve documentation is explicit about this.
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
So, without process synchronisation, I wouldn't do it.
How are the processes started? If they are created by a master process then you can look at the multiprocessing module. Use a Queue to which the child processes write back their results, and have the master remove items from the queue and write them to the shelf. Example of this sort of this is at https://stackoverflow.com/a/24501437/21945.
If you have no process hierarchy then you'll need to use locking to control read and write access to the shelf file. If you are using Linux or similar you might use posix_ipc named semaphore.
The other obvious option is to use a database server - Postgresql or similar.
In your case you'd probably have better luck using a more robust kvp store, such as redis. It's pretty easy to setup a local redis service or a remote redis service (such as on AWS's ElastiCache service)
I'm starting with Stackless Python so it's a whole new amazing world for me.
I usually use regular threads, and they normally have Thread-local storage (TLS), which
is a very useful feature when you need NOT share memory with other threads.
So, I'm wondering if Stackless Python has something similar: A way to store local memory
(a python object) for a given tasklet. Is that possible?
Thanks in advance.
-f
Solution1: The TLS can be simulated in stackless/greenlet using the current tasklet object, retrieved by the call of stackless.getcurrent(), to store additional data.
Solution2: If the tasklet didn't support to add extra fields, than you can have a global WeakKeyDictionary instance that will have as weakref key the tasklet, and value represents your TLS.
What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).
At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.
Would some type of concurrency help? PyCurl.CurlMulti object?
I am open to all suggestions. Thanks!
I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.
In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.
If you do use multiple threads, this is a good candidate for the Queue class.
You might consider using s3fs, and just running concurrent file system commands from Python.
I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).
Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.
If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.
what about thread + queue, I love this article: Practical threaded programming with Python
Each job can be done with appropriate tools :)
You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.
On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.
Of course there may other programs with usable interface exists.
Regards!
I have been looking into different systems for creating a fast cache in a web-farm running Python/mod_wsgi. Memcache and others are options ... But I was wondering:
Because I don't need to share data across machines, wanting each machine to maintain a local cache ...
Does Python or WSGI provide a mechanism for Python native shared data in Apache such that the data persists and is available to all threads/processes until the server is restarted? This way I could just keep a cache of objects with concurrency control in the memory space of all running application instances?
If not, it sure would be useful
Thanks!
This is thoroughly covered by the Sharing and Global Data section of the mod_wsgi documentation. The short answer is: No, not unless you run everything in one process, but that's not an ideal solution.
It should be noted that caching is ridiculously easy to do with Beaker middleware, which supports multiple backends including memcache.
There's Django's thread-safe in-memory cache back-end, see here. It's cPickle-based, and although it's designed for use with Django, it has minimal dependencies on the rest of Django and you could easily refactor it to remove these. Obviously each process would get its own cache, shared between its threads; If you want a cache shared by all processes on the same machine, you could just use this cache in its own process with an IPC interface of your choice (domain sockets, say) or use memcached locally, or, if you might ever want persistence across restarts, something like Tokyo Cabinet with a Python interface like this.
I realize this is an old thread, but here's another option for a "server-wide dict": http://poshmodule.sourceforge.net/posh/html/posh.html (POSH, Python Shared Objects). Disclaimer: haven't used it myself yet.