Have 2 sets of data, 1 which contains coordinates of fixed location called locations
And a secondary table of vehicle movements called movements
What would be the fastest way to iterate through both tables to find if any of the movements are within a certain distance of a location, e.g. the Euclidean distance between a point on the movements and a point on any of the locations?
Currently am using a nested loop which is incredibly slow. Both pandas df have been converted using
locations_dict=locations.to_dict('records')
movements_dict=movements.to_dict('records')
then iterated via:
for movement in movements_dict:
visit='no visit'
for location in locations_dict:
distance = np.sqrt((location['Latitude']-movement['Lat'])**2+(location['Longitude']-movement['Lng'])**2)
if distance < 0.05:
visit=location['Location']
break
else:
continue
movement['distance']=distance
movement['visit']=visit
Any way to make this faster? The main issue is this operation is a cartesian product, and any inserts will increase the complexity of the operation significantly.
You can export the pandas data directly to numpy for example like this:
loc_lat=locations['Latitude' ].to_numpy()
loc_lon=locations['Longitude'].to_numpy()
mov_lat=movements['Lat' ].to_numpy()
mov_lon=movements['Lon' ].to_numpy()
From now on there is no need to use loops to obtain results as you can rely on numpy working an entire arrays at once. This should give a great speedup over the approach using Python looping over dictionary values.
Check out following code example showing how to get an array with all pairs from two arrays:
import numpy as np
a = np.array([1,2,3])
b = np.array([4,5])
print( np.transpose([np.tile(a, len(b)), np.repeat(b,len(a))]) )
gives_as_print = """
[[1 4]
[2 4]
[3 4]
[1 5]
[2 5]
[3 5]]"""
Related
I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:
ID Array1
1 [2.4252 ... 5.6363]
2 [3.1242 ... 9.0091]
3 [6.6775 ... 12.958]
...
300000 [0.1260 ... 5.3323]
I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:
from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)
Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.
I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.
Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?
You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).
import faiss
dimension = 100
array1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():
index.add(row)
k= len(df)
D, I = index.search(array1, k)
Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).
I wonder if there's a way to transpose PyArrow tables without e.g. converting them to pandas dataframes or python objects in between.
Right now I'm using something similar to the following example, which I don't think is very efficient (I left out the schema for conciseness):
import numpy as np
import pyarrow as pa
np.random.seed(1234) # For reproducibility
N, M = 3, 4
arrays = [pa.array(np.random.randint(0, 4, N)) for _ in range(M)]
names = [str(x) for x in range(M)]
table = pa.Table.from_arrays(arrays, names)
print("Original:\n", table.to_pandas().values)
transposed = table.from_pandas(table.to_pandas().T)
print("\nTransposed:\n", transposed.to_pandas().values)
Resulting nicely in:
Original:
[[3 1 0 1]
[3 0 1 3]
[2 0 3 1]]
Transposed:
[[3 3 2]
[1 0 0]
[0 1 3]
[1 3 1]]
In the program I'm working on currently, I'm using PyArrow to prevent what seems to be a memory leaking issue I encountered using pandas dataframes, of which I couldn't pin down the exact source/cause beyond the use of dataframes being the origin.
Hence, besides efficiency reasons, not wanting to use pandas objects here was the reason to use PyArrow data structures in the first place.
Is there a more direct way to do this?
If so, would the transposed result have contiguous memory blocks if the original table is also contiguous?
Also, would calling transposed.combine_chuncks() reorder memory for this table to be contiguous along the columnar axis?
Is there a more direct way to do this?
No. It's not possible today. You're welcome to file a JIRA ticket. I couldn't find one.
The C++ API has array builders which would make this pretty straightforward but there is no python support for these at the moment (there is a JIRA for that https://issues.apache.org/jira/browse/ARROW-3917 but the marshaling overhead would probably become a bottleneck even if that was available).
If so, would the transposed result have contiguous memory blocks if the original table is also contiguous?
Also, would calling transposed.combine_chuncks() reorder memory for this table to be contiguous along the columnar axis?
Arrow arrays are always contiguous along the columnar axis. Are you asking if the entire table would be represented as one contiguous memory region? In that case the answer is no. Arrow does not try and represent entire tables as a single contiguous range.
Arrow being a columnar format, it doesn't lend itself well to this king of workload (tables with uniform types that are more like tensor/matrices).
The same could be said to pandas (to a lesser extent), and numpy is better suited for this type of payload. So instead of converting to pandas and transposing, you could convert to numpy and transpose.
It requires a bit more code, because the conversion from arrow to numpy only works at array/column level (not at table level). See the doc
transposed_matrix = np.array([col.to_numpy() for col in table]).T
transposed_arrays = [pa.array(col) for col in transposed_matrix]
transposed_names = [str(x) for x in range(len(transposed_arrays))]
transposed_table = table.from_arrays(transposed_arrays, names=transposed_names)
I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().
Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.
Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.
point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []
# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
for i,j in np.nditer([compare_x,compare_y]):
dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
# Clear dummy array (otherwise it will continuously append to)
dumarr = []
This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?
The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.
# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2
# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)
# Not sure what is needed from the indices, this get the values
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
I'm gonna give you the approach :
Create DataFrame with columns being ->pointID,CoordX,CoordY
Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
This offset value needs to be looped from 1 till the number of coordinates-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
Append this tempDF to a list
Reasons why this will be faster:
Only one loop to iterate offset from 1 till number of coordinates-1
Vectorization has been taken care off by step 4
Utilize numpy squareroot and square functions to ensure best results
Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:
min(myList, key=lambda x:abs(x-myNumber))
from list of integers, get number closest to a given value
EDIT:
Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:
for k,m in np.nditer([point_x, point_y]):
min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )
Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.
Here’s an example using scipy cdist, which is ideal for this type of problem:
import numpy as np
from scipy.spatial.distance import cdist
point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])
# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
I'm using HC-SR04 sensor on raspberry pi and I want to compare the datas I have read. When I tried to store the datas in an array it just store one of them and it updates itself constantly. How can I store all of them or compare one data to another?
distance = (pulse_duration * 34320)*0.5
distance = round(distance,2)
array = []
array.append(distance)
output of this code is:
distance: 10.7cm
array: [10.7]
distance: 10.63cm
array: [10.63]
Your first issue of array-recreation is fixed by initializing your array outside of the loop that code resides in. However, for data analysis, I would suggest placing these values into a numpy array or a pandas Series/DataFrame. This will allow you to perform quick (and low time complexity) analysis on frequent sensor data. For example, instead of:
array = []
sum = 0
for i in array:
sum += i
mean = sum / len(array)
You can use the C optimized numpy function:
np.mean(sensor_matrix)
I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))