I'm trying to write a small wsgi application which will put some objects to an external queue after each request. I want to make this in batch, ie. make the webserver put the object to a buffer-like structure in memory, and another thread and/or process for sending these objects to the queue in batch, when buffer is big enough or after certain timeout, and clearing the buffer. I don't want to be in NIH syndrome and not want to bother with threading stuff, however I could not find a suitable code for this job. Any suggestions?
Examine https://docs.python.org/library/queue.html to see if it meets your needs.
Since you write "thread and/or process", see also multiprocessing.Queue and multiprocessing.JoinableQueue from 2.6. Those are interprocess variants of Queue.
Use a buffered stream if you are using python 3.0.
Related
I have many doubts about design a simply python program..
I have opened a socket from a server that stream data via simply telnet server.
I have 3 type of strings that begins with RED,BLUE,YELLOW and after that string the data, example:
RED 21763;22;321321
BLUE 1;32132;3432
BLUE 1222;332;3
YELLOW 1;32132;3432
I would split data in three objects, like queue, and then fork three process to elaborate this data in parallel meanwhile they arrive to socket in a sort of very basic realtime computation of these data.
So to achive my goal shoud use thread/fork process and objects like queues for interprocess comunications?
Or there is any different kind of approch that could I use? I'm don't known anything about multithreading programming :)
Thanks for helping.
This should give you a brief idea of threads vs fork.
Creation of threads require lot less overhead. I would go with the thread architecture. Each of the three thread functions will be supplied with the respective queue on which it needs to do the realtime computation. Use of synchronization and mutual exclusion mechanisms will prevent unexpected behaviors. You can also use valgrind with drd for debugging your multithreaded program.
Context
I have been looking at the source code SEE HERE for multiprocessing Queue Python 2.7 and have some questions.
A deque is used for a buffer and any items put on the Queue are appended to the deque but for get(), a pipe is used.
We can see that during put, if the feeder thread has not been started yet it will start.
The thread will pop objects off the thread and send them on the read side of the above pipe.
Questions
So, why use a deque and a pipe?
Couldn't one just use a deque (or any other data structure with FIFO behavior) and synchronize push and pop?
Likewise couldn't one also just use a Pipe, wrapping send and recv?
Maybe there is something here that I am missing but the feeder thread popping items and putting them on the Pipe seems like overkill.
The multiprocessing.Queue is a port of the standard Queue capable of running on multiple processes. Therefore it tries to reproduce the same behaviour.
A deque is a list with fast insertion/extraction on both sides with, theoretically, infinite size. It's very well suited for representing a stack or a queue. It does not work across different processes though.
A Pipe works more like a socket and allows to transfer data across processes. Pipes are Operating System objects and their implementation differs from OS to OS. Moreover, pipes have a limited size. If you fill a pipe your next call to send will block until the other side of it does not get drained.
If you want to expose a Queue capable to work across multiple processes in a similar fashion than the standard one, you need the following features.
A buffer capable of storing messages in arrival order which have not been consumed yet.
A channel capable of transferring such messages across different processes.
Atomic put and get methods able to leave the control to the User on when to block the program flow.
The use of a deque a Thread and a Pipe is one of the simplest way to deliver these features but it's not the only one.
I personally prefer the use of bare pipes to let processes communicate as it gives me more control on my application.
A dequeue can only be in one process memory so using it to pass data between processes is impossible(...*)
You could use just a Pipe but then you would need to protect it with locks, and I guess this is why a dequeue was introduced.
I'm working with a piece of hardware that must be stopped and started at different intervals. Unfortunately, it doesn't teardown gracefully, so restarting within the same process results in libusb errors. One workaround would be to move the configuration of the hardware to a different process, and stop/start the process when required.
What would be the best way to do this in Python?
The pickle module allows you to serialize objects to a string, so you can transfer it via the disk or a socket.
You could also use multiprocessing, which is intended for parallelism, but could be used here too. (Actually, multiprocessing relies on pickle.)
I have a pipeline which at some point splits work into various sub-processes that do the same thing in parallel. Thus their output should go into the same file.
Is it too risky to say all of those processes should write into the same file? Or does python try and retry if it sees that this resource is occupied?
This is system dependent. In Windows, the resource is locked and you get an exception. In Linux you can write the file with two processes (written data could be mixed)
Ideally in such cases you should use semaphores to synchronize access to shared resources.
If using semaphores is too heavy for your needs, then the only alternative is to write in separate files...
Edit: As pointed out by eye in a later post, a resource manager is another alternative to handle concurrent writers
In general, this is not a good idea and will take a lot of care to get right. Since the writes will have to be serialized, it might also adversely affect scalability.
I'd recommend writing to separate files and merging (or just leaving them as separate files).
A better solution is to implement a resource manager (writer) to avoid opening the same file twice. This manager could use threading synchronization mechanisms (threading.Lock) to avoid simultaneous access on some platforms.
How about having all of the different processes write their output into a queue, and have a single process that reads that queue, and writes to the file?
Use multiprocessing.Lock() instead of threading.Lock(). Just a word of caution! might slow down your concurrent processing ability because one process just waits for the lock to be released
I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.