Delayed execution in python for big data

Delayed execution in python for big data - python

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.
For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:
for row in rows(data):
# do stuff with row
Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:
def riches():
rich_data = []
while True:
row = (yield)
if row == None: break
rich_data.append("Rich" if row.income > 100000 else "Poor")
yield rich_data
The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.
Going back to the caller loop, it could look someting like this:
richConsumer = riches()
richConsumer.next() # advance to first yield
for row in rows(data):
richConsumer.send(row)
# other consumers.send(row) here
richConsumer.send(None) # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data
I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

Related

How can I speed up a loop that queries a kd-tree?

The following section of my code is taking ages to run (it's the only loop in the function, so it's the most likely culprit):
tree = KDTree(x_rest)
for i in range(len(x_lost)):
_, idx = tree.query([x_lost[i]], k=int(np.sqrt(len(x_rest))), p=1)
y_lost[i] = mode(y_rest[idx][0])[0][0]
Is there a way to speed this up? I have a few suggestions from Stack Overflow:
This answer suggests using Cython. I'm not particularly familiar with it, but I'm not very against it either.
This answer uses a multiprocessing Pool. I'm not sure how useful this will be: my current execution takes over 12h to run, and I'm hoping for at least a 5-10x speedup (though that may not be possible).

Here are a few notes about how you could speed this up:
This code loops over x_rest, and calls tree.query() with one point
from x_rest at a time. However, query() supports querying multiple
points at once. The loop inside query() is implemented in Cython,
so I would expect it to be much faster than a loop written in
Python. If you call it like this, it will return an array of matches.
The query() function supports a parameter called workers,
which if set to a value larger than one, runs your query in
parallel. Since workers is implemented using threads, it will likely be faster than a solution using multiprocessing.Pool, since it avoids pickling. See the documentation.
The code above doesn't define the mode() function, but I'm assuming
it's scipy.stats.mode(). If that's the case, rather than calling mode() repeatedly, you can use the axis argument, which would let you take the mode of nearby points for multiple queries at once.

How to notify a parent thread of job completion in Python

I would like to use the following code to find a specific number of permutations in a list.
def permutations(ls, prefix, result):
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
result.append(prefix)
return result
My issue is that I cannot simply include another parameter to count the number of permutations found. I can't do that because each recursive call to permutations() "splits" into a new "version" (it's like a fork in the road with the old counter and each branch counts its own number of permutations found, not communicating with the others). In other words, this won't work:
def permutations(ls, prefix, result, count, limit):
if count > limit:
return
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
count += 1
result.append(prefix)
return result
So what I would like to do instead of including a count parameter in the function signature, is to notify some other part of my program every time a new permutation is found, and keep track of the count that way. This may be a good use of threading, but I would like to do it without parallelization if possible (or at least the simplest parallelized solution possible).
I realize that I would have to spawn a new thread at each call to permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result) in the for loop.
I'm hoping that someone would be so kind as to point me in the right direction or let me know if there is a better way to do this in Python.

If you are not using threading, then I recommend not using threading and also not thinking in terms of using threading.
The reason is that the more simply and directly you are able to tackle a problem, the easier it is to think about.
As a second tip, any time you find yourself iterating through permutations, you probably should find a better approach. The reason is that the number of permutations of length n grows as n!, and depending on what you are doing/your patience, computers top out somewhere between n=10 and n=15. So finding ways to count without actually iterating becomes essential. How do do that, of course, depends on your problem.
But back to the problem as stated. I would personally solve this type of problem in Python using generators. That is, you have code that can produce the next element of the list in a generator, and then elsewhere you can have code that processes it. This allows you to start processing your list right away, and not keep it all in memory.
In a language without generators, I would tackle this with closures. That is you pass in a function (or object) that you call for each value, which does whatever it wants to do. That again allows you to separate the iteration logic from the logic of what you want to do with each iteration.
If you're working with some other form of cooperative multi-tasking, use that instead. So, for example, in JavaScript you would have to figure out how to coordinate using Promises. (Luckily the async/await syntax lets you do that and make it look almost like a generator approach. Note that you may wind up with large parts of the data set in memory at once. How to avoid that is a topic in and of itself.) For another example, in Go you should use channels and goroutines.
I would only go to global variables as a last resort. And if you do, remember that you need enough memory to keep the entire data set that you iterated over in memory at once. This may be a lot of memory!
I prefer all of these over the usual multi-threading approach.

How to run generator code in parallel?

I have code like this:
def generator():
while True:
# do slow calculation
yield x
I would like to move the slow calculation to separate process(es).
I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.
The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.
I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.
Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.
Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:
from joblib import Parallel, delayed
def my_long_running_job(x):
# do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())
Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Perform Heavy Computations on Shared Data in Parallel in Python

A quick question about parallel processing in Python. Lets say I have a big shared data structure and want to apply many functions on it in parallel. These functions are read only on the data structure but perform mutation in a result object:
def compute_heavy_task(self):
big_shared_object = self.big_shared_object
result_refs = self.result_refs
for ref in result_refs:
some_expensive_task(ref, big_shared_object)
How do I do these in parallel, say 5 at a time, or 10 at a time. How how about number of processors at a time?

You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.
But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.
One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.
If your data structure is really simple, like a giant array of ints, this is pretty simple:
class MyClass(object):
def __init__(self, giant_iterator_of_ints):
self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
def compute_heavy_task(self):
lock = multiprocessing.Lock()
def subtask(my_range):
return some_expensive_task(self.big_shared_object, lock, my_range)
pool = multiprocessing.pool.Pool(5)
my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
results = pool.map_async(subtask, my_ranges)
pool.close()
pool.join()
Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.
Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)
If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.
I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.
I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.
Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.