How can I speed up a loop that queries a kd-tree?

How can I speed up a loop that queries a kd-tree? - python

The following section of my code is taking ages to run (it's the only loop in the function, so it's the most likely culprit):
tree = KDTree(x_rest)
for i in range(len(x_lost)):
_, idx = tree.query([x_lost[i]], k=int(np.sqrt(len(x_rest))), p=1)
y_lost[i] = mode(y_rest[idx][0])[0][0]
Is there a way to speed this up? I have a few suggestions from Stack Overflow:
This answer suggests using Cython. I'm not particularly familiar with it, but I'm not very against it either.
This answer uses a multiprocessing Pool. I'm not sure how useful this will be: my current execution takes over 12h to run, and I'm hoping for at least a 5-10x speedup (though that may not be possible).

Here are a few notes about how you could speed this up:
This code loops over x_rest, and calls tree.query() with one point
from x_rest at a time. However, query() supports querying multiple
points at once. The loop inside query() is implemented in Cython,
so I would expect it to be much faster than a loop written in
Python. If you call it like this, it will return an array of matches.
The query() function supports a parameter called workers,
which if set to a value larger than one, runs your query in
parallel. Since workers is implemented using threads, it will likely be faster than a solution using multiprocessing.Pool, since it avoids pickling. See the documentation.
The code above doesn't define the mode() function, but I'm assuming
it's scipy.stats.mode(). If that's the case, rather than calling mode() repeatedly, you can use the axis argument, which would let you take the mode of nearby points for multiple queries at once.

Related

Python: Use of threading for fitness evaluation in Genetic Algorithms

I'm curious if threading is the right approach for my use case. I'm working on a genetic algorithm, which needs to evaluate the fitness of genes 1...n. The evaluation of each is independent of others for the most part. Yet, each gene will be passed through the same function, eval(gene.
My intention is that once all genes have been evaluated, I will sort by fitness, and only retain top x.
From this tutorial, it seems that I should be able to do the following, where the specifics of eval are out of scope for this question, but suppose each function call updates a common dictionary of form, {gene : fitness}:
for gene in gene_pool:
thread_i = threading.Thread(target=eval(gene), name=f"fitness gene{i}")
thread_i.start()
for i in range(len(genes)):
thread_i.join()
In the tutorial, I don't see the function actually invoking the function eval(), but rather just referencing its name, eval. I'm not sure if this will problematic for my use case.
My first question is: Is this the right approach? Should I consider a different approach to threading?
I don't believe that I will need to account for race conditions or locks because, while every thread will update the same dictionary, the keys and values will be independent.
And my last question: Does multiprocessing generally a better bet? It seems that it's a bit higher level, which might be ideal for someone new to parallelization.

In Python, threading is constrained by the GIL, so that it is very limited performance-wise. In case of IO-bound code (reading/writing files, requests on the network, ...) async is the way to go.
But from what you explain, your code is rather CPU-bound (computing many things). Then if you want your code to go fast then you need to circumvent the Python GIL. There is two main ways :
multiprocessing (having multiple different Python processes in parallel)
or calling code written in lower-level languages (Cython, C, ...), typically wrapped in a nice library
If you want something simple, stick to multiprocessing : at the start create a pool whose size is the number of competing genes (N), then at each iteration submit to the pool N new tasks to it and wait for their results (the pool.map function), repeat as many times as you want.
I think it is the simplest way to get a full parallelization, which should give you decent speed.

Python: Reduce vs For loop vs built in functions

newbie programmer here. Just started learning some functional programming and I was wondering what's going on behind the scenes in the various scenarios of reduce, a for loop, and built in functions. One thing I noticed when I calculated the times for running each of these was that using reduce() took the longest, the for loop inside the function took the second longest, and using a built in function max() took the shortest. Can somebody explain what's going on behind the scenes that causes these speed differences?
I defined the for loop as:
def f(iterable):
j = next(iterable)
for i in iterable:
if i > j:
j = i
return j
and then compared it with
max(iterable)
and
reduce(lambda x, y: x if x>y else y, iterable)
and noticed, as stated previously, that using reduce() took the longest, the for loop inside the function took the second longest, and using a built in function max() took the shortest.

Python is an interpreted language. (At least, it's partly interpreted. Technically source code is compiled into byte code which is then interpreted.) Code running in an interpreter is almost always going to be a lot slower than native code running on the raw hardware of your machine.
But, a lot of the builtin functions and objects of Python are not written in the Python language itself. A function like max is implemented in C, so it can be pretty fast. It can be a lot faster than pure Python code that the interpreter needs to handle through.
Furthermore, some parts of pure Python code are faster than other parts. Function calls are notoriously slower than most other bits of code, so doing a lot of function calls is generally to be avoided if possible in performance-sensitive sections of your code.
So lets examine your three examples again with these performance thoughts in mind. The max function is implemented in C, so it's fastest. The pure-Python function is slower because its loop and comparisons all need to be interpreted, and while it contains several function calls, most of them are to builtin functions (like next which in turn calls __next__ method of your iterator, both of which are likely builtins). The slowest example is the one using reduce, which, though it is builtin itself, keeps calling back out to the lambda function you gave it as an argument. The repeated function calls to the relatively slow lambda function are what make it the slowest of your three examples.
Note that none of these speed differences change the asymptotic performance of your code. All three of your examples are O(N) where N is the number of items in the iterable. And often asymptotic performance is a lot more important than raw per-item speed if you need your code to be able to scale up to a larger problem. If you were instead comparing a exponentially scaling algorithm with an alternative that was linear (or even polynomial), you'd see vastly different performance numbers once the input size got large enough. Of course it's also possible that you won't care about scalability, if you only need the code to work once for a relatively modest data set. But in that case, the performance differences between builtin functions and lambdas probably don't matter all that much either.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

Delayed execution in python for big data

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.
For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:
for row in rows(data):
# do stuff with row
Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:
def riches():
rich_data = []
while True:
row = (yield)
if row == None: break
rich_data.append("Rich" if row.income > 100000 else "Poor")
yield rich_data
The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.
Going back to the caller loop, it could look someting like this:
richConsumer = riches()
richConsumer.next() # advance to first yield
for row in rows(data):
richConsumer.send(row)
# other consumers.send(row) here
richConsumer.send(None) # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data
I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

Does creating separate functions instead of one big one slow processing time?

I'm working in the Google App Engine environment and programming in Python. I am creating a function that essentially generates a random number/letter string and then stores to the memcache.
def generate_random_string():
# return a random 6-digit long string
def check_and_store_to_memcache():
randomstring = generate_random_string()
#check against memcache
#if ok, then store key value with another value
#if not ok, run generate_random_string() again and check again.
Does creating two functions instead of just one big one affect performance? I prefer two, as it better matches how I think, but don't mind combining them if that's "best practice".

Focus on being able to read and easily understand your code.
Once you've done this, if you have a performance problem, then look into what might be causing it.
Most languages, python included, tend to have fairly low overhead for making method calls. Putting this code into a single function is not going to (dramatically) change the performance metrics - I'd guess that your random number generation will probably be the bulk of the time, not having 2 functions.
That being said, splitting functions does have a (very, very minor) impact on performance. However, I'd think of it this way - it may take you from going 80 mph on the highway to 79.99mph (which you'll never really notice). The important things to watch for are avoiding stoplights and traffic jams, since they're going to make you have to stop altogether...

In almost all cases, "inlining" functions to increase speed is like getting a hair cut to lose weight.

Reed is right. For the change you're considering, the cost of a function call is a small number of cycles, and you'd have to be doing it 10^8 or so times per second before you'd notice.
However, I would caution that often people go to the other extreme, and then it is as if function calls were costly. I've seen this in over-designed systems where there were many layers of abstraction.
What happens is there is some human psychology that says if something is easy to call, then it is fast. This leads to writing more function calls than strictly necessary, and when this occurs over multiple layers of abstraction, the wastage can be exponential.
Following Reed's driving example, a function call can be like a detour, and if the detour contains detours, and if those also contain detours, soon there is tremendous time being wasted, for no obvious reason, because each function call looks innocent.

Like others have said, I wouldn't worry about it in this particular scenario. The very small overhead involved in function calls would pale in comparison to what is done inside each function. And as long as these functions don't get called in rapid succession, it probably wouldn't matter much anyway.
It is a good question though. In some cases it's best not to break code into multiple functions. For example, when working with math intensive tasks with nested loops it's best to make as few function calls as possible in the inner loop. That's because the simple math operations themselves are very cheap, and next to that the function-call-overhead can cause a noticeable performance penalty.
Years ago I discovered the hypot (hypotenuse) function in the math library I was using in a VC++ app was very slow. It seemed ridiculous to me because it's such a simple set of functionality -- return sqrt(a * a + b * b) -- how hard is that? So I wrote my own and managed to improve performance 16X over. Then I added the "inline" keyword to the function and made it 3X faster than that (about 50X faster at this point). Then I took the code out of the function and put it in my loop itself and saw yet another small performance increase. So... yeah, those are the types of scenarios where you can see a difference.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I speed up a loop that queries a kd-tree? - python

Related

Python: Use of threading for fitness evaluation in Genetic Algorithms

Python: Reduce vs For loop vs built in functions

Will multiprocessing be a good solution for this operation?

Delayed execution in python for big data

Does creating separate functions instead of one big one slow processing time?

Categories

Resources