In a dataframe df containing points (row) and coordinates (columns), I want to compute, for each point, the n closest neighbors points and the corresponding distances.
I did something like this:
df = pd.DataFrame(np.random.rand(4, 6))
def dist(p, q):
return ((p - q)**2).sum(axis=1)
def f(s):
closest = dist(s, df).nsmallest(3)
return list(closest.index) + list(closest)
df.apply(f, axis=1, result_type="expand")
which gives:
0 1 2 3 4 5
0 0.0 3.0 2.0 0.0 0.743722 1.140251
1 1.0 2.0 0.0 0.0 1.548676 1.695104
2 2.0 3.0 0.0 0.0 0.702797 1.140251
3 3.0 2.0 0.0 0.0 0.702797 0.743722
(first 3 columns are the indices of the closest points, the next 3 columns are the corresponding distances)
However, I would prefer to get a dataframe with 3 columns: point, closest point to it, distance between them.
Put another way: I want one column per distance, and not one column per point.
I tried pd.melt, pd.pivot but without finding any good way to do it...
Option 1: Scikit-learn NearestNeighbors class
To find k-nearest-neighbors (kNN), sklearn.neighbors.NearestNeighbors serves the purpose.
Data
import numpy as np
import pandas as pd
np.random.seed(52) # reproducibility
df = pd.DataFrame(np.random.rand(4, 6))
print(df)
0 1 2 3 4 5
0 0.823110 0.026118 0.210771 0.618422 0.098284 0.620131
1 0.053890 0.960654 0.980429 0.521128 0.636553 0.764757
2 0.764955 0.417686 0.768805 0.423202 0.926104 0.681926
3 0.368456 0.858910 0.380496 0.094954 0.324891 0.415112
Code
from sklearn.neighbors import NearestNeighbors
k = 3
dist, indices = NearestNeighbors(n_neighbors=k).fit(df).kneighbors(df)
Result
print(dist)
array([[0.00000000e+00, 1.09330867e+00, 1.13862254e+00],
[0.00000000e+00, 9.32862532e-01, 9.72369661e-01],
[0.00000000e+00, 9.72369661e-01, 1.02130721e+00],
[2.10734243e-08, 9.32862532e-01, 1.02130721e+00]])
print(indices)
array([[0, 2, 3],
[1, 3, 2],
[2, 1, 3],
[3, 1, 2]])
The obtained distances and indices can be easily rearranged.
Option 2: compute manually (nearest except self)
sklearn.metrics has a built-in Euclidean distance function, which outputs an array of shape [#rows x #rows]. You can exclude the diagonal elements (distance to itself, namely 0) from min() and argmin() by filling it with infinity.
Code
from sklearn.metrics import euclidean_distances
dist = euclidean_distances(df.values, df.values)
np.fill_diagonal(dist, np.inf) # exclude self from min()
df_want = pd.DataFrame({
"point": range(df.shape[0]),
"closest_point": dist.argmin(axis=1),
"distance": dist.min(axis=1)
})
Result
print(df_want)
point closest_point distance
0 0 2 1.093309
1 1 3 0.932863
2 2 1 0.972370
3 3 1 0.932863
Related
I'm trying to remove rows from a DataFrame that are within a Euclidean distance threshold of other points listed in the DataFrame. So for example, in the small DataFrame provided below, two rows would be removed if a threshold value was set equal to 0.001 (1 mm: thresh = 0.001), where X and Y are spatial coordinates:
import pandas as pd
data = {'X': [0.075, 0.0791667,0.0749543,0.0791184,0.075,0.0833333, 0.0749543],
'Y': [1e-15, 0,-0.00261746,-0.00276288, -1e-15,0,-0.00261756],
'T': [12.57,12.302,12.56,12.292,12.57,12.052,12.56]}
df = pd.DataFrame(data)
df
# X Y T
# 0 0.075000 1.000000e-15 12.570
# 1 0.079167 0.000000e+00 12.302
# 2 0.074954 -2.617460e-03 12.560
# 3 0.079118 -2.762880e-03 12.292
# 4 0.075000 -1.000000e-15 12.570
# 5 0.083333 0.000000e+00 12.052
# 6 0.074954 -2.617560e-03 12.560
The rows with indices 4 and 6 need to be removed because they are spatial duplicates of rows 0 and 2, respectively, since they are within the specified threshold distance of previously listed points. Also, I always want to remove the 2nd occurrence of a point that is within the threshold distance of a previous point. What's the best way to approach this?
Let's try it with this one. Calculate the Euclidean distance for each pair of (X,Y), which creates a symmetric matrix. Then mask the upper half; then for the lower half, filter out the rows where there is a value less than thresh:
import numpy as np
m = np.tril(np.sqrt(np.power(df[['X']].to_numpy() - df['X'].to_numpy(), 2) +
np.power(df[['Y']].to_numpy() - df['Y'].to_numpy(), 2)))
m[np.triu_indices(m.shape[0])] = np.nan
out = df[~np.any(m < thresh, axis=1)]
We could also write it a bit more concisely and legibly (taking a leaf out of #BENY's elegant solution) by using k parameter in numpy.tril to directly mask the upper half of the symmetric matrix:
distances = np.sqrt(np.sum([(df[[c]].to_numpy() - df[c].to_numpy())**2
for c in ('X','Y')], axis=0))
msk = np.tril(distances < thresh, k=-1).any(axis=1)
out = df[~msk]
Output:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e+00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e+00 12.052
You mentioned the key words distance , so we do cdist from scipy
from scipy.spatial.distance import cdist
v = df[['X','Y']]
ary = cdist(v, v, metric='euclidean')
df[~np.tril(ary<0.001,k = -1).any(1)]
Out[100]:
X Y T
0 0.075000 1.000000e-15 12.570
1 0.079167 0.000000e+00 12.302
2 0.074954 -2.617460e-03 12.560
3 0.079118 -2.762880e-03 12.292
5 0.083333 0.000000e+00 12.052
I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!
I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0
I am trying to find a solution to do the following operation using either numpy or pandas:
For instance, the result matrix has [0, 0, 0] as its first column which is a result of [a x a] elementwise, more specifically it is equal to: [0 x 0.5, 0 x 0.4, 0 x 0.1].
If there is no solution method for such a problem, I might just expand the series to a dataframe by duplicating its values to just multiply two dataframes..
input data:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
This is actually very simple. Because the Series' index aligns with the DataFrame's columns, you only need to do:
series*df
output:
a b c d e
0 0.0 4.0 0.0 70.0 0.8
1 0.0 5.0 0.0 10.0 0.5
2 0.0 9.0 0.0 30.0 0.8
input:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.
On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)
I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN
I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?
You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5
You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)
If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64