Python inefficient loop despite its simplicity - python

I've tried to run this little code, it just takes random points (here 50k, near of what I have in reality) and returns the 10th nearest points for each point randomly selected.
But unfortunately this is (really !) long because of the loop for sure.
As I'm pretty new in 'code optimization', is there a trick to make this mush faster ? (Faster at Python scale I know I'm not coding in C++).
Here is a reproducible example with data size close to what I have:
import time
import numpy as np
from numpy import random
from scipy.spatial import distance
# USEFUL FUNCTION
start_time = time.time()
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum("ij,ij->i", deltas, deltas)
ndx = dist_2.argsort()
return data[ndx[:10]]
# REPRODUCIBLE DATA
mean = np.array([0.0, 0.0, 0.0])
cov = np.array([[1.0, -0.5, 0.8], [-0.5, 1.1, 0.0], [0.8, 0.0, 1.0]])
data = np.random.multivariate_normal(mean, cov, 500000)
# START RUNNING
points = data[np.random.choice(data.shape[0], int(np.round(0.1 * len(data), 0)))]
print(len(points))
for w in points:
closest_node(w, data)
print("--- %s seconds ---" % (time.time() - start_time))

The time it takes to run argsort every loop on your 500000 element array is huge. The only improvement I can think of is to use something that can return the smallest 10 elements without fully sorting the whole array.
A fast way to find the largest N elements in an numpy array
So instead of
ndx = dist_2.argsort()
return data[ndx[:10]]
It would be
ndx = np.argpartition(dist_2, 10)[:10]
return data[ndx[:10]]
I only benchmarked on 500 points because it already took quite some time to run on my PC.
N=500
Using argsort: 25.625439167022705 seconds
Using argpartition: 6.637120485305786 seconds

You would be probably best off analyzing the slowest points via profiler: How do I find out what parts of my code are inefficient in Python
One thing that might be possible at first glance, is that you should probably move as much as possible outside loop. If you are going to convert points via np.asarray(), it possibly might be better to do it once for all points before the loop, and use the result in function, rather than doing np.asarray() in each loop run.

Related

Is there a way to make my 1D random walk code more time efficient here?

So my code plots the average distance from equilibrium of a 1D random walk over 1000 steps. My code works, but takes an inordinate amount of time, I think probably due to the loop inside a loop of the system. Is there a way to make this more efficient or am I stuck with this? Thanks :)
nsteps = 1000
ndim = 1
numpy.seterr(invalid="ignore")
for i in range(100):
w = walker(numpy.zeros(1))
ys = w.doSteps(nsteps)
avgpos = []
for i in range(0, len(ys)):
avgpos.append(sum(ys[:i+1])/i+1)
plt.plot(range(nsteps+1),avgpos)
The ys are the results from doing n steps. I'm sure the inefficiency is from something within the loop rather than a problem in the earlier code
I'd suggest using the built in method for doing cumulative sums. I'd also suggest fixing the warnings from Numpy, I think you need some brackets around sum(...)/i+1. Python, like most languages, would evaluate this as (sum(...)/i)+1 because division binds more tightly than addition.
A minimal working example would thus be:
import numpy as np
import matplotlib.pyplot as plt
nsteps = 1000
for i in range(100):
ys = np.cumsum(np.random.standard_normal(nsteps))
avgpos = []
for i in range(0, len(ys)):
avgpos.append(sum(ys[:i+1])/(i+1)) # note brackets
plt.plot(np.array(avgpos))
which takes my laptop ~8 seconds.
I could instead use the Numpy cumsum method like this:
for i in range(100):
ys = np.cumsum(np.random.standard_normal(nsteps))
avgpos = np.cumsum(ys) / (np.arange(nsteps)+1)
plt.plot(avgpos)
which only takes ~0.1 seconds.

Distance matrix between two point layers

I have two arrays containing point coordinates as shapely.geometry.Point with different sizes.
Eg:
[Point(X Y), Point(X Y)...]
[Point(X Y), Point(X Y)...]
I would like to create a "cross product" of these two arrays with a distance function. Distance function is from shapely.geometry, which is a simple geometry vector distance calculation. I am tryibg to create distance matrix between M:N points:
Right now I have this function:
source = gpd.read_file(source)
near = gpd.read_file(near)
source_list = source.geometry.values.tolist()
near_list = near.geometry.values.tolist()
array = np.empty((len(source.ID_SOURCE), len(near.ID_NEAR)))
for index_source, item_source in enumerate(source_list):
for index_near, item_near in enumerate(near_list):
array[index_source, index_near] = item_source.distance(item_near)
df_matrix = pd.DataFrame(array, index=source.ID_SOURCE, columns = near.ID_NEAR)
Which does the job fine, but is slow. 4000 x 4000 points is around 100 seconds (I have datasets which are way bigger, so speed is main issue). I would like to avoid this double loop if possible. I tried to do in in pandas dataframe as in (which has terrible speed):
for index_source, item_source in source.iterrows():
for index_near, item_near in near.iterrows():
df_matrix.at[index_source, index_near] = item_source.geometry.distance(item_near.geometry)
A bit faster is (but still 4x slower than numpy):
for index_source, item_source in enumerate(source_list):
for index_near, item_near in enumerate(near_list):
df_matrix.at[index_source, index_near] = item_source.distance(item_near)
Is there a faster way to do this? I guess there is, but I have no idea how to proceed. I might be able to chunk the dataframe into smaller pieces and send the chunk onto different core and concat the results - this is the last resort. If somehow we can use numpy only with some indexing only magic, I can send it to GPU and be done with it in no time. But the double for loop is a no no right now. Also I would like to not use any other library than Pandas/Numpy. I can use SAGA processing and its Point distances module (http://www.saga-gis.org/saga_tool_doc/2.2.2/shapes_points_3.html), which is pretty damn fast, but I am looking for Python only solution.
If you can get the coordinates in separate vectors, I would try this:
import numpy as np
x = np.asarray([5.6, 2.1, 6.9, 3.1]) # Replace with data
y = np.asarray([7.2, 8.3, 0.5, 4.5]) # Replace with data
x_i = x[:, np.newaxis]
x_j = x[np.newaxis, :]
y_i = y[:, np.newaxis]
y_j = y[np.newaxis, :]
d = (x_i-x_j)**2+(y_i-y_j)**2
np.sqrt(d, out=d)

speed up finite difference model

I have a complex finite difference model which is written in python using the same general structure as the below example code. It has two for loops one for each iteration and then within each iteration a loop for each position along the x array. Currently the code takes two long to run (probably due to the for loops). Is there a simple technique to use numpy to remove the second for loop?
Below is a simple example of the general structure I have used.
import numpy as np
def f(x,dt, i):
xn = (x[i-1]-x[i+1])/dt # a simple finite difference function
return xn
x = np.linspace(1,10,10) #create initial conditions with x[0] and x[-1] boundaries
dt = 10 #time step
iterations = 100 # number of iterations
for j in range(iterations):
for i in range(1,9): #length of x minus the boundaries
x[i] = f(x, dt, i) #return new value for x[i]
Does anyone have any ideas or comments on how I could make this more efficient?
Thanks,
Robin
For starters, this little change to the structure improves efficiency by roughly 15%. I would not be surprised if this code can be further optimized but that will most likely be algorithmic inside the function, i.e. some way to simplify the array element operation. Using a generator may likely help, too.
import numpy as np
import time
time0 = time.time()
def fd(x, dt, n): # x is an array, n is the order of central diff
for i in range(len(x)-(n+1)):
x[i+1] = (x[i]-x[i+2])/dt # a simple finite difference function
return x
x = np.linspace(1, 10, 10) # create initial conditions with x[0] and x[-1] boundaries
dt = 10 # time step
iterations = 1000000 # number of iterations
for __ in range(iterations):
x = fd(x, dt, 1)
print(x)
print('time elapsed: ', time.time() - time0)

How can I speed up closest point comparison using cdist or tensorflow?

I have two sets of points, one is a map consisting of x,y coordinates, and the second is a path of x,y coordinates. I'm trying to find the closest map points to my path points, pretty simple. Except my map is 380000 points and my paths (of which I have several) each consist of ~ 350000 points themselves.
Other than sampling my data to get smaller datasets, I'm trying to find a faster way to accomplish this task.
base algorithm:
import pandas as pd
from scipy.spatial.distance import cdist
...
def closeset_point(point, points):
return points[cdist([point], points).argmin()]
# log['point'].shape; 333000
# map_data['point'].shape; 380000
closest = [closest_point(log_p, list(map_data['point'])) for log_p in log['point']]
as per this example: Find closest point in Pandas DataFrames
After converting this to a tqdm progress bar to see how long it would take (as it was taking a while, obviously), I noticed it would take about 10hrs to complete.
tqdm loop:
for i in trange(len(log), desc='finding closest points'):
closest.append(closest_point(log['point'].loc[i], list(map_data['point'])))
>> finding closest points: 5%| | 16432/333456 [32:11<10:13:52], 8.60it/s
While 10 hours is not impossible, I wonder if there is a way to speed this up? I have a solid gpu/cpu/ram at my disposal so I feel this should be doable. I'm also learning tensorflow (but honestly my math is atrocious so I'm very in the dark with it)
Any ideas on how to speed this up with either multi-threading, gpu computation, tensorflow or some other sort of wizardry?
inb4 python is slow ;)
*edit: image shows what i'm trying to do. green is path, blue is map, orange is what I'm trying to find.
The following is a mini example of what you're trying to do. Considers the variable coords1 as your variable log['point'] and coords2 as your variable log['point']. The end result is the index of the coord2 closest to coord1.
from scipy.spatial import distance
import numpy as np
coords1 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(36.1667, -86.7833)]
coords2 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(34.9728, -83.9422),
(36.1667, -86.7833)]
tmp = distance.cdist(coords1, coords2, "sqeuclidean") # sqeuclidean based on Mark Setchell comment to improve speed further
result = np.argmin(tmp,1)
# result: array([0, 1, 2, 4])
This should be way faster, because it's done everything in one iteration.
After 3 years, but if anyone is looking at this issue... You may want to try Numba I get almost a 9x speed reduction from scipy distance.cdist on a 1.5 Million set of points to a 1.5 K set of path points. Also, as #
Mark Setchell said if you want to remove the np.sqrt in a big enough set of points could be considerable saved time.
Results
size: (1459383, 2)
numba: 0.06402060508728027
cdist: 0.5371212959289551
Code
# EUCLEDIAN DISTANCE
#numba.njit('(float64[:,::1], float64[::1], float64[::1])', parallel=True, fastmath=True)
def pz_dist(p_array, x_flat, y_flat):
m = p_array.shape[0]
n = x_flat.shape[0]
d = np.empty(shape=(m, n), dtype=np.float64)
for i in numba.prange(m):
p1 = p_array[i, :]
for j in range(n):
_x = x_flat[j] - p1[0]
_y = y_flat[j] - p1[1]
_d = np.sqrt(_x**2 + _y**2)
d[i, j] = _d
return d

Python optimization of 3D grid search with nested loops

Imagine we have a 3D grid with discrete points. The ranges for the 3 dimensions (interval endpoints included) are:
in x: [4000, 7000], stepsize 1000 in y: [0.0, 2.0], stepsize 1.0 in z: [-0.75, 0.75], stepsize 0.25
The task below should now be done for all points in the range [4000, 7000, 100], [0.0, 2.0, 0.1], [-0.75, 0.75, 0.05] (roughly 20000 = 31 * 21 * 31 points):
Find the smallest cuboid that contains the point. However there are holes in the grid (each point should have a "physical" counterpart as a file, but some do not). I tried the following really simple code (where I called cuboid 'cube'):
def findcubesnew(startvalues, endvalues, resols, \
loopvalues, refvalues, modelpath):
cubearray = []
startvalue1 = startvalues[0]
endvalue1 = endvalues[0]
resol1 = resols[0]
refvalue1 = refvalues[0]
loopstop1 = loopvalues[0][0]
loopstart1 = loopvalues[1][0]
startvalue2 = startvalues[1]
endvalue2 = endvalues[1]
resol2 = resols[1]
refvalue2 = refvalues[1]
loopstop2 = loopvalues[0][1]
loopstart2 = loopvalues[1][1]
startvalue3 = startvalues[2]
endvalue3 = endvalues[2]
resol3 = resols[2]
refvalue3 = refvalues[2]
loopstop3 = loopvalues[0][2]
loopstart3 = loopvalues[1][2]
refmass = refvalues[3]
refveloc = refvalues[4]
for start1 in numpy.arange(startvalue1, loopstop1 + resol1, resol1):
for end1 in numpy.arange(loopstart1, endvalue1 + resol1, resol1):
for start2 in numpy.arange(startvalue2, loopstop2 + resol2, resol2):
for end2 in numpy.arange(loopstart2, endvalue2 + resol2, resol2):
for start3 in numpy.arange(startvalue3, loopstop3 + resol3, resol3):
for end3 in numpy.arange(loopstart3, endvalue3 + resol3, resol3):
if glob.glob(*start1*start2*start3) and \
if glob.glob(modelpath/*start1*start2*end3) and \
if glob.glob(modelpath/*start1*end2*start3) and \
if glob.glob(modelpath/*start1*end2*end3) and \
if glob.glob(modelpath/*end1*start2*start3) and \
if glob.glob(modelpath/*end1*start2*end3) and \
if glob.glob(modelpath/*end1*end2*start3) and \
if glob.glob(modelpath/*end1*end2*end3):
cubearray.append((start1, end1, start2, end2, start3, end3))
else:
pass
return cubearray
foundcubearray = findcubesnew([metalstart, tempstart, loggstart], \
[metalend, tempend, loggend], [metalresol, tempresol, loggresol], \
looplimitarray, [refmetal, reftemp, reflogg, refmass, refveloc], \
modelpath)
if foundcubearray:
bestcube = findsmallestcubenew(foundcubearray, \
[metalresol, tempresol, loggresol])
....
Hence I go in a loop in x direction from the lower grid border to the biggest value below the desired point where we want to get the cuboid, and in another loop from the smallest value after the point to the higher grid border. Similarly for the y- and the z-direction and the loops are nested one in each other. The if part is pseudocode without the format strings etc. and checks that all file with these values (and other quantities may be as well in the filename) does exist (that all corners of the cuboid are present).
This code also finds points or lines or rectangles if one or multiple coordinates of the point coincide with values in our grid, but it's not a problem (and actually desired).
The bottleneck here is that the search for the cuboids takes quite some time (the smallest cuboid can be found easily and quickly then, if there are multiple with the same (smallest) size I do not care which one to choose). Also I need to read in the start and end values of the grid, the stepsizes, the reference values (coordinates) of my point and some other variables. Any way to optimize the code? It takes roughly 1.4 sec per point so ~ 8 hours for the ~ 20000 points and this is too long.
I know that if I found the smallest cuboid e.g. for the point 4500, 0.5, 0.1, I can immediately tell that all other points inside the cube with limits [4000, 5000; 0.0, 1.0; 0, 0.25] have the same smallest cuboid. Still I'm interested in a solution that optimizes the computation time for all 20000 runs. The application of this is an interpolation routine for stellar models where 8 grid points with a cuboid shape around the interpolation point are required.
P.S.: I hope that the continuation line breaks and indents are correct. In my code they do not exist, although it's not a good style to go beyond 80 chars per line :).
My suggestion is not to use glob. If I'm reading the numbers right, the modelpath directory could contain up to 20,000 files, and there might be 8 globs on these in the inner loop body. I'm surprised it only takes 1.4 seconds per point!
The file names are just being used as booleans, right? All that matters is whether the file exists or not.
I would create a 3D array of booleans with the same dimensions as your 3D grid, initialised to False. Then read through the contents of the directory, converting each filename into a 3D index and set that to True.
Then do your search over the points using the array of booleans, not the file system.
Hope this helps.

Categories

Resources