Perform Heavy Computations on Shared Data in Parallel in Python

Perform Heavy Computations on Shared Data in Parallel in Python - python

A quick question about parallel processing in Python. Lets say I have a big shared data structure and want to apply many functions on it in parallel. These functions are read only on the data structure but perform mutation in a result object:
def compute_heavy_task(self):
big_shared_object = self.big_shared_object
result_refs = self.result_refs
for ref in result_refs:
some_expensive_task(ref, big_shared_object)
How do I do these in parallel, say 5 at a time, or 10 at a time. How how about number of processors at a time?

You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.
But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.
One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.
If your data structure is really simple, like a giant array of ints, this is pretty simple:
class MyClass(object):
def __init__(self, giant_iterator_of_ints):
self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
def compute_heavy_task(self):
lock = multiprocessing.Lock()
def subtask(my_range):
return some_expensive_task(self.big_shared_object, lock, my_range)
pool = multiprocessing.pool.Pool(5)
my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
results = pool.map_async(subtask, my_ranges)
pool.close()
pool.join()
Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.
Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)
If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.
I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.
I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.
Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.

Related

Python: Use of threading for fitness evaluation in Genetic Algorithms

I'm curious if threading is the right approach for my use case. I'm working on a genetic algorithm, which needs to evaluate the fitness of genes 1...n. The evaluation of each is independent of others for the most part. Yet, each gene will be passed through the same function, eval(gene.
My intention is that once all genes have been evaluated, I will sort by fitness, and only retain top x.
From this tutorial, it seems that I should be able to do the following, where the specifics of eval are out of scope for this question, but suppose each function call updates a common dictionary of form, {gene : fitness}:
for gene in gene_pool:
thread_i = threading.Thread(target=eval(gene), name=f"fitness gene{i}")
thread_i.start()
for i in range(len(genes)):
thread_i.join()
In the tutorial, I don't see the function actually invoking the function eval(), but rather just referencing its name, eval. I'm not sure if this will problematic for my use case.
My first question is: Is this the right approach? Should I consider a different approach to threading?
I don't believe that I will need to account for race conditions or locks because, while every thread will update the same dictionary, the keys and values will be independent.
And my last question: Does multiprocessing generally a better bet? It seems that it's a bit higher level, which might be ideal for someone new to parallelization.

In Python, threading is constrained by the GIL, so that it is very limited performance-wise. In case of IO-bound code (reading/writing files, requests on the network, ...) async is the way to go.
But from what you explain, your code is rather CPU-bound (computing many things). Then if you want your code to go fast then you need to circumvent the Python GIL. There is two main ways :
multiprocessing (having multiple different Python processes in parallel)
or calling code written in lower-level languages (Cython, C, ...), typically wrapped in a nice library
If you want something simple, stick to multiprocessing : at the start create a pool whose size is the number of competing genes (N), then at each iteration submit to the pool N new tasks to it and wait for their results (the pool.map function), repeat as many times as you want.
I think it is the simplest way to get a full parallelization, which should give you decent speed.

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Can I separate Python set update into several threads?

I'm doing some brute-force computing and putting the results into a set all_data. Computing chunks of data gives a list of numbers new_data, which I want to add to the big set: all_data.update(new_data). Now while the computational part is easily made parallel by means of multiprocessing.Pool.map, the update part is slow.
Obviously, there is a problem if one has two identical elements in new_data, that are absent in all_data, and trying to add them at the same moment. But if we assume new_data to be a set as well, is there still a problem? The only problem I can see is the way sets are organized in memory, so the question is:
Is there a way to organize a set structure that allows simultaneous addition of elements? If yes, is it realized in Python?

In pure Python, no. Due to the GIL, all Python code (including manipulations with built-in structures) is single-threaded. The very existence of GIL is justified by eliminating the need to lock access to them.
Even the result fetching in multiprocessing.Pool.map is already sequential.
ParallelProgramming article in SciPy wiki outlines related options on parallel code but I didn't find anything directly regarding off-the-shelf concurrent data structures.
It does, however, mention a few C extensions under "Sophisticated parallelization" which do support parallel I/O.
Note that a set is actually a hash table which by its very nature cannot be updated in parallel (even with fine-grained locking, sub-operations of each type (look-up, insertion incl. collision resolution) have to be sequenced). So you need to either
replace it with some other data organization that can be parallelized better, and/or
speed up the update operations, e.g. by using shared memory

Iterating over a large data set in long running Python process - memory issues?

I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!

Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.

Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.

Delayed execution in python for big data

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.
For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:
for row in rows(data):
# do stuff with row
Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:
def riches():
rich_data = []
while True:
row = (yield)
if row == None: break
rich_data.append("Rich" if row.income > 100000 else "Poor")
yield rich_data
The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.
Going back to the caller loop, it could look someting like this:
richConsumer = riches()
richConsumer.next() # advance to first yield
for row in rows(data):
richConsumer.send(row)
# other consumers.send(row) here
richConsumer.send(None) # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data
I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.