Communicating between one Read and one Write process in python - python

I am looking for a way to pass information safely (!) between two processes of python scripts.
One process reads only, and the other writes only.
The data I need to pass is a dictionary.
The best solution I found so far is a local SQL server (since the data itself is kind of a table) and I assume SQLite will handle the transaction safely.
Is there a better way, maybe a module to lock a file from being read while it is written to and vice versa?
I am using linux ubuntu 11.10, but a cross platform solution will be welcome.

To transfer data between processes, you can use pipes and sockets. Python uses pickling to convert between objects and byte streams.
To transfer the data safely, you need to make sure that the data is transferred exactly once. Which means: The destination process needs to be able to say "I didn't get everything, send it again" while the sender needs some form of receipt.
To achieve this, you should add a "header" to the data which gives it a unique key (maybe a timestamp or a hash). Both receiver and sender can then keep a list of those they have sent/seen to avoid processing data twice.

For one-way communication you could use e.g. multiprocessing.Queue or multiprocessing.queues.SimpleQueue.
Shared memory is also an option using multiprocessing.Array. But you'd have to split up the dictionary in at least two arrays (keys and values). This will only work well if all the values are of the same basic type.
The good point about both multiprocessing.Queue and multiprocessing.Array is that they are both protected by locks internally, so you don't have to worry about that.

Related

Python parallelism preserving data

I need to repeatedly calculate very large python arrays based on small input and very large constant bulk if data, stored on the drive. I can successfully parallelize it by splitting that input bulk and joining response. Here comes the problem: sending identical data bulk to the pool is too slow. Moreover, I double required memory. Ideally I would read data in the thread from the file, and keep it there for multiple re-use.
How do I do it? I can only think of creating multiple servers that will listen to requests from the pool. Somehow it looks unnatural solution to quite common problem. Do I miss better solution?
best regards,
Vladimir

Python: What to worry about when mutliple processes are concurrently writing to "shelve"?

For my python application I am thinking of using shelve, part of the standard library. There will be hundreds of processes, each writing something to the same shelve object. The writing will always be to add a new key,value pair to the shelve. The keys are unique, so no two processes will update the same entry.
What could go wrong in such a scenario?
The shelve documentation is explicit about this.
The shelve module does not support concurrent read/write access to
shelved objects. (Multiple simultaneous read accesses are safe.) When
a program has a shelf open for writing, no other program should have
it open for reading or writing. Unix file locking can be used to solve
this, but this differs across Unix versions and requires knowledge
about the database implementation used.
So, without process synchronisation, I wouldn't do it.
How are the processes started? If they are created by a master process then you can look at the multiprocessing module. Use a Queue to which the child processes write back their results, and have the master remove items from the queue and write them to the shelf. Example of this sort of this is at https://stackoverflow.com/a/24501437/21945.
If you have no process hierarchy then you'll need to use locking to control read and write access to the shelf file. If you are using Linux or similar you might use posix_ipc named semaphore.
The other obvious option is to use a database server - Postgresql or similar.
In your case you'd probably have better luck using a more robust kvp store, such as redis. It's pretty easy to setup a local redis service or a remote redis service (such as on AWS's ElastiCache service)

Python-C api concurrency issue

We are developing a small c server application. The server application does some data processing and responds back to the client. To keep the data processing part configurable and flexible we decided to go for scripting and based on the availability of various ready modules we decided to go for Python. We are using the Python-C api to send/receive the data between c and python.
The Algorithm works something like this:-
Server receives some data from client, this data is stored in a dictionary created in c. The dictionary is created using the api function PyDict_New(); from c. The input is stored as a key value pair in the dictionary using the api function PyDict_SetItemString();
Next, we execute the python script PyRun_SimpleString(); passing the script as a parameter. This script makes use of the dictionary created in c. Please note, we make the dictionary created in c, accessible to the script using the methods PyImport_AddModule(); and PyModule_AddObject();
We store the result of the data processing in the script as a key value pair in the same dictionary created above. The c code can then simply access the result variable(key-value pair) after the script has executed.
The problem
The problem we are facing is in the case of concurrent requests coming in from different clients. When multiple requests come in from different clients we tend to object reference count exceptions. Please note, that for each request which comes in for a user, we create an independent dictionary for that user alone. To overcome this problem we encompassed the call to PyRun_SimpleString(); within PyEval_AcquireLock(); and PyEval_ReleaseLock();, but doing this has resulted in the script execution being a blocking call. So if a script is taking long time to execute, all the other users are also waiting for a response.
Could you please suggest the best possible approach or give pointers to where we are going wrong. Please ping me for more information.
Any help/guidance will be appreciated.
Perhaps you are missing one of the calls mentioned in this answer.
I suggest you investigate the multiprocessing module.
You should probably read http://docs.python.org/c-api/init.html#thread-state-and-the-global-interpreter-lock Your problem is explained in the first paragraph.
When you acquire the GIL, do so around your direct manipulation of Python objects. The call to PyRun_SimpleString will handle the GIL internally and will give it up on long-running operations or just every X instructions. It WILL NOT be truly multi-threaded, however.
Edit:
You need to acquire the lock and you need to ensure that Python knows it's in a different thread state:
// acquire the lock and switch thread state
PyEval_AcquireLock();
PyThreadState_Swap(perThreadState);
// execute some python code
PyEval_SimpleString("print 123");
// clear the thread state and release the lock
PyThreadState_Swap(NULL);
PyEval_ReleaseLock();

A good persistent synchronous queue in python

I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.
Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.
I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.
This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)
Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.

Python twisted asynchronous write using deferred

With regard to the Python Twisted framework, can someone explain to me how to write asynchronously a very large data string to a consumer, say the protocol.transport object?
I think what I am missing is a write(data_chunk) function that returns a Deferred. This is what I would like to do:
data_block = get_lots_and_lots_data()
CHUNK_SIZE = 1024 # write 1-K at a time.
def write_chunk(data, i):
d = transport.deferredWrite(data[i:i+CHUNK_SIZE])
d.addCallback(write_chunk, data, i+1)
write_chunk(data, 0)
But, after a day of wandering around in the Twisted API/Documentation, I can't seem to locate anything like the deferredWrite equivalence. What am I missing?
As Jean-Paul says, you should use IProducer and IConsumer, but you should also note that the lack of deferredWrite is a somewhat intentional omission.
For one thing, creating a Deferred for potentially every byte of data that gets written is a performance problem: we tried it in the web2 project and found that it was the most significant performance issue with the whole system, and we are trying to avoid that mistake as we backport web2 code to twisted.web.
More importantly, however, having a Deferred which gets returned when the write "completes" would provide a misleading impression: that the other end of the wire has received the data that you've sent. There's no reasonable way to discern this. Proxies, smart routers, application bugs and all manner of network contrivances can conspire to fool you into thinking that your data has actually arrived on the other end of the connection, even if it never gets processed. If you need to know that the other end has processed your data, make sure that your application protocol has an acknowledgement message that is only transmitted after the data has been received and processed.
The main reason to use producers and consumers in this kind of code is to avoid allocating memory in the first place. If your code really does read all of the data that it's going to write to its peer into a giant string in memory first (data_block = get_lots_and_lots_data() pretty directly implies that) then you won't lose much by doing transport.write(data_block). The transport will wake up and send a chunk of data as often as it can. Plus, you can simply do transport.write(hugeString) and then transport.loseConnection(), and the transport won't actually disconnect until either all of the data has been sent or the connection is otherwise interrupted. (Again: if you don't wait for an acknowledgement, you won't know if the data got there. But if you just want to dump some bytes into the socket and forget about it, this works okay.)
If get_lots_and_lots_data() is actually reading a file, you can use the included FileSender class. If it's something which is sort of like a file but not exactly, the implementation of FileSender might be a useful example.
The way large amounts of data is generally handled in Twisted is using the Producer/Consumer APIs. This doesn't give you a write method that returns a Deferred, but it does give you notification about when it's time to write more data.

Categories

Resources