I'm starting with Stackless Python so it's a whole new amazing world for me.
I usually use regular threads, and they normally have Thread-local storage (TLS), which
is a very useful feature when you need NOT share memory with other threads.
So, I'm wondering if Stackless Python has something similar: A way to store local memory
(a python object) for a given tasklet. Is that possible?
Thanks in advance.
-f
Solution1: The TLS can be simulated in stackless/greenlet using the current tasklet object, retrieved by the call of stackless.getcurrent(), to store additional data.
Solution2: If the tasklet didn't support to add extra fields, than you can have a global WeakKeyDictionary instance that will have as weakref key the tasklet, and value represents your TLS.
Related
​What is the proper way of using connection pools in a streaming pyspark application ?
I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.
Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.
Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:
spark.python.worker.reuse has to be set to true.
A way to reference an object between different calls.
The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.
I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.
Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.
I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.
This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)
Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.
I am basically familiar with RPC solutions available in Python: XML-RPC and Pyro. I can make an remote object by binding it on the server-side and then I can get proxy object on the client side on which I can operate. When I call some method on remote object e.g. proxy.get_file() then the rpc mechanism tries to serialize a resultant object (a file in this case). This is usually expected behavior, but what I need is to get a file object as another remote proxy object instead of getting it transferred to client side:
afile_proxy = proxy.get_file()
Instead of:
afile = proxy.get_file()
I could rebind this object on server-side and handle such case on the client side but this would require some boiler-plate code. Is there a mechanism/library that would do this for me? It could for example keep objects remote until they are primitive ones.
I have found a library that does exactly what I need: RPyC. From intro:
simple, immutable python objects (like strings, integers, tuples, etc.) are passed by value, meaning the value itself is passed to the other side.
all other objects are passed by reference, meaning a "reference" to the object is passed to the other side. This allows changes applied on the referenced object to be reflected on the actual object.
Anyway, thanks for pointing out a 'reference' term. :)
I am involved in developing a new remote-object interaction framework Versile Python (VPy) which performs similar functions as the ones you have listed. VPy is in development with the current releases primarily intended for testing, however feel free to take a look.
There are two ways you could perform the type of remote I/O you are describing with VPy. One is to use remote native object references to e.g. file objects similar to Pyro/RPyC and access those objects similar to if they were local.
Another option is to use the VPy remote stream framework which is quite flexible and can be configured to perform bi-directional streaming and operations such as remotely repositioning or truncating the stream. The second approach has the advantage it enables asynchronous I/O plus the stream framework splits up data transmission in order to reduce the effects of round-trip latency.
afile_proxy = proxy.get_file_proxy()
And define in the API what a FileProxy object is. It all depends on what the client needs to do with the proxy. Ge the name? Delete it? Get its contents?
You could even get away with a reference (a URL, maybe) if all you want is to keep track of something you want to process later. It's what's done on the web with all embedded content, like images.
edit: im asking if global variables are safe in a single-threaded web framework like tornado
im using the mongoengine orm, which gets a database connection from a global variable:
_get_db() # gets the db connection
im also using tornado, a single-threaded python web framework. in one particular view, i need to grab a database connection and dereference a DBRef object [similar to a foreign key]:
# dereference a DBRef
_get_db().dereference(some_db_ref)
since the connection returned by _get_db is a global var, is there a possibility of collision and the wrong value being returned to the wrong thread?
Threads are always required to hold the GIL when interacting with Python objects. The namespace holding the variables is a Python object (either a frameobject or a dict, depending on what kind of variable it is.) It's always safe to get or set variables in multiple threads. You will never get garbage data.
However, the usual race conditions do apply as to which object you get, or which object you replace when you assign. A statement like x += 1 is not thread-safe, because a different thread can run between the get and the store, changing the value of x, which you would then overwrite.
Assuming MongoEngine is wrapping PyMongo (and I believe it is), then you should be fine. PyMongo is completely thread-safe.
No, but locks are pretty straightforward to use in python. Use the try: finally: pattern to ensure that a lock is released after you modify your global variable.
There is nothing about globals that makes them any more or less thread safe than any other variables. Whether or not it's possible for an operation to fail or return incorrect results when run in different threads, the best practice is that you should protect data shared between threads.
If I'm reading you right, you're asking if a variable is safe in a single-threaded environment. In this case, where data is not shared between concurrent processes, the variable is safe (after all, there's nothing else running that may interrupt it).
I have been looking into different systems for creating a fast cache in a web-farm running Python/mod_wsgi. Memcache and others are options ... But I was wondering:
Because I don't need to share data across machines, wanting each machine to maintain a local cache ...
Does Python or WSGI provide a mechanism for Python native shared data in Apache such that the data persists and is available to all threads/processes until the server is restarted? This way I could just keep a cache of objects with concurrency control in the memory space of all running application instances?
If not, it sure would be useful
Thanks!
This is thoroughly covered by the Sharing and Global Data section of the mod_wsgi documentation. The short answer is: No, not unless you run everything in one process, but that's not an ideal solution.
It should be noted that caching is ridiculously easy to do with Beaker middleware, which supports multiple backends including memcache.
There's Django's thread-safe in-memory cache back-end, see here. It's cPickle-based, and although it's designed for use with Django, it has minimal dependencies on the rest of Django and you could easily refactor it to remove these. Obviously each process would get its own cache, shared between its threads; If you want a cache shared by all processes on the same machine, you could just use this cache in its own process with an IPC interface of your choice (domain sockets, say) or use memcached locally, or, if you might ever want persistence across restarts, something like Tokyo Cabinet with a Python interface like this.
I realize this is an old thread, but here's another option for a "server-wide dict": http://poshmodule.sourceforge.net/posh/html/posh.html (POSH, Python Shared Objects). Disclaimer: haven't used it myself yet.