Python, concurency, critical sections - python

here I have some question about possible critical sections.
In my code I have a function dealing with queue. This function is one and only to put elements in the queue. But a number of threads operating concurently get elements from this queue. Since there is a chance (I am not sure if such a chance exists tbh) that multiple threads will attempt to get one element each from the queue at the same time, is it possible that they will get exactly the same element from the queue?
One of the things my workers do is opening a file (different workers opens different files in exclusive dirs). I am using context manager "with open(>some file<, 'w') as file...". So is it possible, that at the same time multiple threads opening different files but using exectly the same variable 'file' will mess up things cause it looks like I have a critical section here, doesnt it?

Your first question is easy to answer with the documentation of the queue class. If you implemented a custom queue, the locking is on you but the python queue module states:
Internally, those three types of queues use locks to temporarily block competing threads; however, they are not designed to handle reentrancy within a thread.
I am uncertain if your second question follows from the first question.
It would be helpful to clear up your question with an example.

Related

Optimizing number of threads in Python

I am writing a program that analyses csv files in a directory, initially one file at a time. This could be several hundred files, but all of them are relatively small. My main runtime limitation was I/O, so I turned to multithreading using the threading library, which is a first for me.
I created a thread for each function call, following this guide, where each function call opens a csv in the desired directory. As a result, I have a list of threads for each file (i.e. hundreds of threads). However, my program still ran slowly, with the bulk of its time spent method 'acquire' of '_thread.lock' objects according to cProfile. I believe that this is because of the large number of threads resulting in lots of threads waiting for others to finish their tasks - is this correct?
How would you recommend I resolve this? My current idea is to split my list of files into equally sized chunks and to assign a thread to each chunk, rather than a thread to each file, and for each thread to iterate through the files in each chunk.
Python has something called the Global Interpreter Lock which seriously hurts your performance with that many threads, as each one is waiting to hold the "interpreter lock." I would recommend using Processes which if I remember are similar to Python thread objects in their use but do not suffer the same performance penalty of waiting for a lock. A thread and a process are different, but for your application, it sounds like it should not matter.
It is worth noting that the GIL can be released when performing I/O such as reading from a file, and therefore using threads might be fine - you just need to use fewer of them. In fact, with the number of threads/processes you are looking to create it might be a better idea to use a fixed pool of workers.

When to use event/condition/lock/semaphore in python's threading module?

Python provides 4 different synchronizing mechanisms in threading module: Event/Condition/Lock(RLock)/Semaphore.
I understand they can be used to synchronize access of shared resources/critical sections between threads. But I am not quite sure when to use which.
Can they be used interchangeably? Or are some of them 'higher level', using others as building blocks? If so, which ones are built on which?
It would be great if someone can illustrate with some examples.
This article probably contains all the information you need. The question is indeed very broad, but let me try to explain how I use each as an example:
Event - Use it when you need threads to communicate a certain state was met so they can both work together in sync. I use it mostly for initiation process of two threads where one dependes on the other.
Example: A client has a threaded manager, and its __init__() needs to know the manager is done instantiating some attributes before it can move on.
Lock/RLock - Use it when you are working with a shared resource and you want to make sure no other thread is reading/writing to it. Although I'd argue that while locking before writing is mandatory, locking before reading could be optional. But it is good to make sure that while you are reading/writing, no other thread is modifying it at the same time. RLock has the ability to be acquired multiple times by its owner, and release() must be called the same amount of times acquire() was in order for it to be used by another thread trying to acquire it.
I haven't used Condition that much, and frankly never had to use Semaphore, so this answer has room of editing and improvement.

Multiprocessing or Multithreading for plugin architecture in Python

I'm trying to implement a plugin architecture in Python.
I've started writing it using the Threading module where each plugin is a thread which I invoke using the Thread.start() method (since all plugins subclass BasePlugin which subclasses Thread). However I've just come across the multiprocessing module.
I'm currently wondering if I should switch to the multiprocessing module and share data using shared memory / Pipes etc...
I'd like to get other's opinions on this.
The plugin architecture I've been working on works as follows:
An event is received by the Plugin Manager. The Plugin Manager checks for all the plugins who've subscribed to that type of event. It activates them and sends them the event object (since it holds additional information). If one of the plugins is already active there is no need to spawn it (just send the event object to it).
In addition there are a few resources which belong only to one plugin at any point in time. Each plugin can request the resource (I'm not worrying about any race condition here since there won't be that many plugins active at once).
Threads share memory with the primary process and each other. For example you can have a list that is available to all threads. An item appended to a list can be seen by other threads. But you have to be careful. You have to understand which operations on data structures are thread safe and which are not. What happens to the behaviour of your program when two threads are checking for the existence of a key in a dictionary and then writing to it?
Multiple processes do not share memory. The new process that you start gets a copy of the memory at the point where it was spawned.
Threads use less resources. But can be hard to reason about. On the other hand communication between processes is tricky. And you can't just access an arbitrary Python data structure. Which it sounds like you want to be able to do.
A badly written plugin, if it was in a thread, could crash your whole program. Whereas if it was in a separate process this wouldn't happen. Maybe that's a consideration?

Concurrently searching a graph in Python 3

I'd like to create a small p2p application that concurrently processes incoming data from other known / trusted nodes (it mostly stores it in an SQLite database). In order to recognize these nodes, upon connecting, each node introduces itself and my application then needs to check whether it knows this node directly or maybe indirectly through another node. Hence, I need to do a graph search which obviously needs processing time and which I'd like to outsource to a separate process (or even multiple worker processes? See my 2nd question below). Also, in some cases it is necessary to adjust the graph, add new edges or vertices.
Let's say I have 4 worker processes accepting and handling incoming connections via asynchronous I/O. What's the best way for them to access (read / modify) the graph? A single queue obviously doesn't do the trick for read access because I need to pass the search results back somehow.
Hence, one way to do it would be another queue which would be filled by the graph searching process and which I could add to the event loop. The event loop could then pass the results to a handler. However, this event/callback-based approach would make it necessary to also always pass the corresponding sockets to the callbacks and thus to the Queue – which is nasty because sockets are not picklable. (Let alone the fact that callbacks lead to spaghetti code.)
Another idea that's just crossed my mind might be to create a pipe to the graph process for each incoming connection and then, on the graph's side, do asynchronous I/O as well. However, in order to avoid callbacks, if I understand correctly, I would need an async I/O library making use of yield from (i.e. tulip / PEP 3156). Are there other options?
Regarding async I/O on the graph's side: This is certainly the best way to handle many incoming requests at once but doing graph lookups is a CPU intensive task, thus could profit from using multiple worker threads or processes. The problem is: Multiple threads allow shared data but Python's GIL somewhat negates the performance benefit. Multiple processes on the other hand don't have this problem but how can I share and synchronize data between them? (For me it seems quite impossible to split up a graph.) Is there any way to solve this problem in a nice way? Also, does it make sense in terms of performance to mix asynchronous I/O with multithreading / multiprocessing?
Answering your last question: It does! But, IMHO, the question is: does it makes sense mix Events and Threads? You can check this article about hybrid concurrency models: http://bibliotecadigital.sbc.org.br/download.php?paper=3027
My tip: Start with just one process and an event loop, like in the tulip model. I'll try to explain how can you use tulip to have Events+async I/O (and threads or other processes) without callbacks at all.
You could have something like accept = yield from check_incoming(), which should be a tulip coroutine (check_incoming), and inside this function you could use loop.run_in_executor() to run your graph search in a thread/process pool (I'll explain more about this later). This function run_in_executor() returns a Future, in which you can also yield from tasks.wait([future_returned_by_run_in_executor], loop=self). The next step would be result = future_returned_by_run_in_executor.result() and finally return True or False.
The process pool requires that only pickable objects can be executed and returned. This requirement is not a problem but it's implicit that the graph operation must be self contained in a function and must obtain the graph instance somehow. The Thread pool has the GIL problem since you mentioned CPU bound tasks which can lead to 'acquiring-gil-conflicts' but this was improved in the new Python 3.x GIL. Both solutions have limitations..
So.. instead of a pool, you can have another single process with it's own event loop just to manage all the graph work and connect both processes with a unix domain socket for instance..
This second process, just like the first one, must also accept incoming connections (but now they are from a known source) and can use a thread pool just like I said earlier but it won't "conflict" with the first event loop process(the one that handles external clients), only with the second event loop. Threads sharing the same graph instance requires some locking/unlocking.
Hope it helped!

Python multiple threads accessing same file

I have two threads, one which writes to a file, and another which periodically
moves the file to a different location. The writes always calls open before writing a message, and calls close after writing the message. The mover uses shutil.move to do the move.
I see that after the first move is done, the writer cannot write to the file anymore, i.e. the size of the file is always 0 after the first move. Am I doing something wrong?
Locking is a possible solution, but I prefer the general architecture of having each external resource (including a file) dealt with by a single, separate thread. Other threads send work requests to the dedicated thread on a Queue.Queue instance (and provide a separate queue of their own as part of the work request's parameters if they need result back), the dedicated thread spends most of its time waiting on a .get on that queue and whenever it gets a requests goes on and executes it (and returns results on the passed-in queue if needed).
I've provided detailed examples of this approach e.g. in "Python in a Nutshell". Python's Queue is intrinsically thread-safe and simplifies your life enormously.
Among the advantages of this architecture is that it translates smoothly to multiprocessing if and when you decide to switch some work to a separate process instead of a separate thread (e.g. to take advantage of multiple cores) -- multiprocessing provides its own workalike Queue type to make such a transition smooth as silk;-).
When two threads access the same resources, weird things happen. To avoid that, always lock the resource. Python has the convenient threading.Lock for that, as well as some other tools (see documentation of the threading module).
Check out http://www.evanfosmark.com/2009/01/cross-platform-file-locking-support-in-python/
You can use a simple lock with his code, as written by Evan Fosmark in an older StackOverflow question:
from filelock import FileLock
with FileLock("myfile.txt"):
# work with the file as it is now locked
print("Lock acquired.")
One of the more elegant libraries I've ever seen.

Categories

Resources