Loading Threadable classes dynamically in Python - python

I have a question posted here, and I got it resolved.
My new question has to do with the code at the end that iterates through the modules in the directory and loads them dynmaically:
modules = pkgutil.iter_modules(path=[os.path.join(path,'scrapers')])
for loader, mod_name, ispkg in modules:
# Ensure that module isn't already loaded, and that it isn't the parent class
if (mod_name not in sys.modules) and (mod_name != "Scrape_BASE"):
# Import module
loaded_mod = __import__('scrapers.'+mod_name, fromlist=[mod_name])
# Load class from imported module. Make sure the module and the class are named the same
class_name = mod_name
loaded_class = getattr(loaded_mod, class_name)
# only instantiate subclasses of Scrape_BASE
if(issubclass(loaded_class,Scrape_BASE.Scrape_BASE)):
# Create an instance of the class and run it
instance = loaded_class()
instance.start()
instance.join()
text = instance.GetText()
In most of the classes I am reading a PDF from a website, scraping the content and setting the text that is subsequently returned by GetText().
In some cases, the PDF is too big and I end up with a Segmentation Fault. Is there a way to monitor the threads to make them time-out after 3 minutes or so? Does anyone have a suggestion as to how I implement this?

The right way to do this is to change the code in those classes that you haven't shown us, so that they don't run forever. If that's possible, you should definitely do that. And if what you're trying to time out is "reading the PDF from a website", it's almost certainly possible.
But sometimes, it isn't possible; sometimes you're just, e.g., calling some C function that has no timeout. So, what do you do about that?
Well, threads can't be interrupted. So you need to use processes instead. multiprocessing.Process is very similar to threading.Thread, except that it runs the code in a child process instead of a thread in the same process.
This does mean that you can't share any global data with your workers without making it explicit, but that's generally a good thing. However, it does mean that the input data (which in this case doesn't seem to be anything) and the output (which seems to be a big string) have to be picklable, and explicitly passed over queues. This is pretty easy to do; read the Exchanging objects between processes section for details.
While we're at it, you may want to consider rethinking your design to think in terms of tasks instead of threads. If you have, say, 200 PDFs to download, you don't really want 200 threads; you want maybe 8 or 12 threads, all servicing a queue of 200 jobs. The multiprocessing module has support for process pools, but you may find concurrent.futures a better fit for this. Both multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor let you just pass a function and some arguments, and then wait for the results, without having to worry about scheduling or queues or anything else.

Related

What gets re-loaded in new instances of a program using multiprocessing in Python

I gather that using multiprocessing causes new instances of the program to be initialized. e.g. the following code would cause 7 instances to be initialized:
pool = mp.Pool(processes=7)
pool.apply_async(process, args=(properties,)) for properties in properties_list]
My question though is what is actually getting loaded in the new instance? For example, if i have file_a.py, which calls a function from file_b.py, and the function in file_b.py uses multiprocessing, is it just file_b.py that gets re-loaded, or file_a.py and file_b.py?
I believe all of your imports will be imported.
The details vary depending on your start method.
fork method (the default on Unix) - this actually forks your program. In this case, all memory and resources of the parent are cloned to the child. Whatever was loaded in the parent will be loaded in the child. (And all resources, like file descriptors, will be shared between the two processes.)
spawn or forkserver method (spawn is the default on Windows) - both of these start at least one new instance of your python interpreter, and pickles whatever arguments and other resources are needed for the run method. As far as I'm aware, all of file_a.py is parsed, including the import of file_b.py.
In both of the latter cases, the documentation says that "no unnecessary resources are inherited", but this isn't referring to loading imported code; it's talking about operating system resources like shared memory access or file descriptors.

Why do we need gevent.queue?

My understanding of Gevent is that it's merely concurrency and not parallelism. My understanding of concurrency mechanisms like Gevent and AsyncIO is that, nothing in the Python application is ever executing at the same time.
The closest you get is, calling a non-blocking IO method, and while waiting for that call to return other methods within the Python application are able to be executed. Again, none of the methods within the Python application ever actually execute Python code at the same time.
With that said, why is there a need for gevent.queue? It sounds to me like the Python application doesn't really need to worry about more than one Python method accessing a queue instance at a time.
I'm sure there's a scenario that I'm not seeing that gevent.queue fixes, I'm just curious what that is.
Although you are right that no two statements execute at the same time within a single Python process, you might want to ensure that a series of statements execute atomically, or you might want to impose an order on the execution of certain statements, and in that case things like gevent.queue become useful. A tutorial is here.

When does a module scope variable reference get released by the interpreter?

I'm trying to implement a clean-up routine in a utility module I have. In looking around for solutions to my problem, I finally settled on using a weakref callback to do my cleanup. However, I'm concerned that it won't work as expected because of a strong reference to the object from within the same module. To illustrate:
foo_lib.py
class Foo(object):
_refs = {}
def __init__(self, x):
self.x = x
self._weak_self = weakref.ref(self, Foo._clean)
Foo._refs[self._weak_self] = x
#classmethod
def _clean(cls, ref):
print 'cleaned %s' % cls._refs[ref]
foo = Foo()
Other classes then reference foo_lib.foo. I did find an old document from 1.5.1 that sort of references my concerns (http://www.python.org/doc/essays/cleanup/) but nothing that makes me fully comfortable that foo will be released in such a way that the callback will be triggered reliably. Can anyone point me towards some docs that would clear this question up for me?
The right thing to do here is to explicitly release your strong reference at some point, rather than counting on shutdown to do it.
In particular, if the module is released, its globals will be released… but it doesn't seem to be documented anywhere that the module will get released. So, there may still be a reference to your object at shutdown. And, as Martijn Pieters pointed out:
It is not guaranteed that __del__() methods are called for objects that still exist when the interpreter exits.
However, if you can ensure that there are no (non-weak) references to your object some time before the interpreter exits, you can guarantee that your cleanup runs.
You can use atexit handlers to explicitly clean up after yourself, but you can just do it explicitly before falling off the end of your main module (or calling sys.exit, or finishing your last non-daemon thread, or whatever). The simplest thing to do is often to take your entire main function and wrap it in a with or try/finally.
Or, even more simply, don't try to put cleanup code into __del__ methods or weakref callbacks; just put the cleanup code itself into your with or finally or atexit.
In a comment on another answer:
what I'm actually trying to do is close out a subprocess that is normally kept open by a timer, but needs to be nuked when the program exits. Is the only really "reliable" way to do this to start a daemonic subprocess to monitor and kill the other process separately?
The usual way to do this kind of thing is to replace the timer with something signalable from outside. Without knowing your app architecture and what kind of timer you're using (e.g., a single-threaded async server where the reactor kicks the timer vs. a single-threaded async GUI app where an OS timer message kicks the timer vs. a multi-threaded app where the timer is just a thread that sleeps between intervals vs. …), it's hard to explain more specifically.
Meanwhile, you may also want to look at whether there's a simpler way to handle your subprocesses. For example, maybe using an explicit process group, and killing your process group instead of your process (which will kill all of the children, on both Windows and Unix… although the details are very different)? Or maybe give the subprocess a pipe and have it quit when the other end of the pipe goes down?
Note that the documentation also gives you no guarantees about the order in which left-over references are deleted, if they are. In fact, if you're using CPython, Py_Finalize specifically says that it's "done in random order".
The source is interesting. It's obviously not explicitly randomized, and it's not even entirely arbitrary. First it does GC collect until nothing is left, then it finalizes the GC itself, then it does a PyImport_Cleanup (which is basically just sys.modules.clear()), then there's another collect commented out (with some discussion as to why), and finally a _PyImport_Fini (which is defined only as "For internal use only").
But this means that, assuming your module really is holding the only (non-weak) reference(s) to your object, and there are no unbreakable cycles involving the module itself, your module will get cleaned up at shutdown, which will drop the last reference to your object, causing it to get cleaned up as well. (Of course you cannot count on anything other than builtins, extension modules, and things you have a direct reference to still existing at this point… but your code above should be fine, because foo can't be cleaned up before Foo, and it doesn't rely on any other non-builtins.)
Keep in mind that this is CPython-specific—and in fact CPython 3.3-specific; you will want to read the relevant equivalent source for your version to be sure. Again, the documentation explicitly says things get deleted "in random order", so that's what you have to expect if you don't want to rely on implementation-specific behavior.
Of course your cleanup code still isn't guaranteed to be called. For example, an unhandled signal (on Unix) or structured exception (on Windows) will kill the interpreter without giving it a chance to clean up anything. And even if you write handlers for that, someone could always pull the power cord. So, if you need a completely robust design, you need to be interruptable without cleanup at any point (by journaling, using atomic file operations, protocols with explicit acknowledgement, etc.).
Python modules are cleaned up when exiting, and any __del__ methods probably are called:
It is not guaranteed that __del__() methods are called for objects that still exist when the interpreter exits.
Names starting with an underscore are cleared first:
Starting with version 1.5, Python guarantees that globals whose name begins with a single underscore are deleted from their module before other globals are deleted; if no other references to such globals exist, this may help in assuring that imported modules are still available at the time when the __del__() method is called.
Weak reference callbacks rely on the same mechanisms as __del__ methods do; the C deallocation functions (type->tp_dealloc).
The foo instance will retain a reference to the Foo._clean class method, but the global name Foo could be cleared already (it is assigned None in CPython); your method should be safe as it never refers to Foo once the callback has been registered.

How to convert Python threading code to multiprocessing code?

I need to convert a threading application to a multiprocessing application for multiple reasons (GIL, memory leaks). Fortunately the threads are quite isolated and only communicate via Queue.Queues. This primitive is also available in multiprocessing so everything looks fine. Now before I enter this minefield I'd like to get some advice on the upcoming problems:
How to ensure that my objects can be transfered via the Queue? Do I need to provide some __setstate__?
Can I rely on put returning instantly (like with threading Queues)?
General hints/tips?
Anything worthwhile to read apart from the Python documentation?
Answer to part 1:
Everything that has to pass through a multiprocessing.Queue (or Pipe or whatever) has to be picklable. This includes basic types such as tuples, lists and dicts. Also classes are supported if they are top-level and not too complicated (check the details). Trying to pass lambdas around will fail however.
Answer to part 2:
A put consists of two parts: It takes a semaphore to modify the queue and it optionally starts a feeder thread. So if no other Process tries to put to the same Queue at the same time (for instance because there is only one Process writing to it), it should be fast. For me it turned out to be fast enough for all practical purposes.
Partial answer to part 3:
The plain multiprocessing.queue.Queue lacks a task_done method, so it cannot be used as a drop-in replacement directly. (A subclass provides the method.)
The old processing.queue.Queue lacks a qsize method and the newer multiprocessing version is inaccurate (just keep this in mind).
Since filedescriptors normally inherited on fork, care needs to be taken about closing them in the right processes.

Static methods and thread safety

In python with all this idea of "Everything is an object" where is thread-safety?
I am developing django website with wsgi. Also it would work in linux, and as I know they use effective process management, so we could not think about thread-safety alot. I am not doubt in how module loads, and there functions are static or not? Every information would be helpfull.
Functions in a module are equivalent to static methods in a class. The issue of thread safety arises when multiple threads may be modifying shared data, or even one thread may be modifying such data while others are reading it; it's best avoided by making data be owned by ONE module (accessed via Queue.Queue from others), but when that's not feasible you have to resort to locking and other, more complex, synchronization primitives.
This applies whether the access to shared data happens in module functions, static methods, or instance methods -- and the shared data is such whether it's instance variables, class ones, or global ones (scoping and thread safety are essentially disjoint, except that function-local data is, to a pont, intrinsically thread-safe -- no other thread will ever see the data inside a function instance, until and unless the function deliberately "shares" it through shared containers).
If you use the multiprocessing module in Python's standard library, instead of the threading module, you may in fact not have to care about "thread safety" -- essentially because NO data is shared among processes... well, unless you go out of your way to change that, e.g. via mmapped files;-).
See the python documentation to better understand the general thread safety implications of Python.
Django itself seems to be thread safe as of 1.0.3, but your code may not and you will have to verify that...
My advice would be to simply don't care about that and serve your application with multiple processes instead of multiple threads (for example by using apache 'prefork' instead of 'worker' MPM).

Categories

Resources