I have a long list of H-points with known coordinates. I have also a list of TP-points. I'd like to know if the H-points fall within any(!) TP-point with certain radius (e.g. r=5).
dfPoints = pd.DataFrame({'H-points' : ['a','b','c','d','e'],
'Xh' :[10, 35, 52, 78, 9],
'Yh' : [15,5,11,20,10]})
dfTrafaPostaje = pd.DataFrame({'TP-points' : ['a','b','c','d','e'],
'Xt' :[15,25,35],
'Yt' : [15,25,35],
'M' : [5,2,3]})
def inside_circle(x, y, a, b, r):
return (x - a)*(x - a) + (y - b)*(y - b) < r*r
I've started but.. it would be much easier to check this for only one TP point. But if I have e.g. 1500 of them and 30.000 H-points, then i need more general solution.
Can anyone help?
Another option is to use distance_matrix from scipy.spatial:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
dfPoints [np.min(dist_mat,axis=1)<5]
Took about 2s for 1500 dfPoints and 30000 dfTrafaPostje.
Update: to get the index of the reference points with highest score:
dist_mat = distance_matrix(dfPoints [['Xh','Yh']], dfTrafaPostaje [['Xt','Yt']])
# get the M scores of those within range
M_mat = pd.DataFrame(np.where(dist_mat <= 5, dfTrafaPosaje['M'].values[None, :], np.nan),
index=dfPoints['H-points'] ,
columns=dfTrafaPostaje['TP-points'])
# get the points with largest M values
# mask with np.nan for those outside range
dfPoints['M'] = np.where(M_mat.notnull().any(1), M_mat.idxmax(1), np.nan)
For the included sample data:
H-points Xh Yh TP
0 a 10 15 a
1 b 35 5 NaN
2 c 52 11 NaN
3 d 78 20 NaN
4 e 9 10 NaN
You could use cdist from scipy to compute the pairwise distances, then create a mask with True where distance is less than radius, and finally filter:
import pandas as pd
from scipy.spatial.distance import cdist
dfPoints = pd.DataFrame({'H-points': ['a', 'b', 'c', 'd', 'e'],
'Xh': [10, 35, 52, 78, 9],
'Yh': [15, 5, 11, 20, 10]})
dfTrafaPostaje = pd.DataFrame({'TP-points': ['a', 'b', 'c'],
'Xt': [15, 25, 35],
'Yt': [15, 25, 35]})
radius = 5
distances = cdist(dfPoints[['Xh', 'Yh']].values, dfTrafaPostaje[['Xt', 'Yt']].values, 'sqeuclidean')
mask = (distances <= radius*radius).sum(axis=1) > 0 # create mask
print(dfPoints[mask])
Output
H-points Xh Yh
0 a 10 15
Related
I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN
Consider the following list
import numpy as np
import pandas as pd
l = [1,4,6,np.NaN,20,np.Nan,24]
I know I can replace the nan values using simple linear interpolation using pandas interpolate as follows
pd.Series([1,4,6,np.NaN,20,np.NaN,24]).interpolate()
Out[38]:
0 1.0
1 4.0
2 6.0
3 13.0
4 20.0
5 22.0
6 24.0
dtype: float64
My question is: how can I get the same result by only using list comprehensions, standard numpy functions, but no built-in interpolation function (pd.interpolate() or numpy.interp()`)? That is, using directly the formula for linear interpolation between two points.
l = [1,4,6,np.nan,20,np.nan,24]
res = [l[i] if not np.isnan(l[i]) else (l[i-1]+l[i+1])/2 for i in range(len(l))]
print(res)
Not sure if it is really a fit for this question since it is not just a list comprehension, but I've figured out a solution that works for gaps with more than 1 consecutive nan:
import numpy as np
l = [1,4,6,np.nan,20,np.nan,24, 30, 31, np.nan, np.nan, 70, 75]
# 1 -> entry is nan
nans = np.isnan(l)
# 1 -> from number to nan, -1 -> from nan to number
diffs = np.diff(list(map(int, nans)))
# get "gap of nans" begin and end indices
gap_starts = np.where(diffs == 1)[0]
gap_ends = np.where(diffs == -1)[0]
for begin, end in zip(gap_starts, gap_ends):
# number of nans in the gap
nans_n = end - begin
# difference of gap extrema
nan_diff = abs(l[begin] - l[end+1])
# step to add at each nan
step = round(nan_diff / (nans_n + 1))
# interpolate section from begin to end
filling = [l[begin] + (step * n) for n in range(1, nans_n + 1)]
# fix l with interpolated values
l[begin+1:end+1] = filling
print(l)
produces
[1, 4, 6, 13, 20, 22, 24, 30, 31, 44, 57, 70, 75]
I have a data fram that contains two columns with numbers and a third column with repeating letters. Let's say somthing like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('xy'))
letters = ['A', 'B', 'C', 'D'] * int(len(df.index) / 4)
df['letters'] = letters
I want to create two new columns, which compares the number in columns 'x' and 'y' to the average of their corresponding letters. For example one new column will just contain the number 10 (if 20% or better than the mean), -10 (if 20% worse than the mean) or else 0.
I wrote the function below:
def scoreFunHigh(dataField, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if dataField > upper:
return multip * 1
elif dataField < lower:
return multip * (-1)
else:
return 0
And then created the column as follows:
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = np.vectorize(scoreFunHigh)(df['y'], letterMeanY, 0.3, 5)
This works. However, I am getting the below runtime waring:
C:\Users\ ... \Python\Python38\lib\site-packages\numpy\lib\function_base.py:2167: RuntimeWarning: invalid value encountered in ? (vectorized)
outputs = ufunc(*inputs)
(Please note, that if I am running the exact same code as above I am not getting the error. My real dataframe is much larger and I am using several functions for different data)
What is the problem here? Is there a better way to set this up?
Thank you very much
The sample you give does not produce the runtimewarning, so we can't do anything to help you diagnose it. I don't know if a fuller traceback provides any useful information.
But lets look at the calculations:
In [70]: np.vectorize(scoreFunHigh)(df['x'], letterMeanX, 0.2, 10)
Out[70]:
array([-10, 0, 10, -10, 0, 0, -10, -10, 10, 0, 0, 10, -10,
-10, 0, 10, 10, -10, 0, 10, -10, -10, -10, 10, 10, -10,
...
-10, 10, -10, 0, 0, 10, 10, 0, 10])
and with the df assignment:
In [74]: df['letter x score'] = np.vectorize(scoreFunHigh)(df['x'], letterMeanX,
...: 0.2, 10)
...:
In [75]: df
Out[75]:
x y letters letter x score
0 33 98 A -10
1 38 49 B 0
2 78 46 C 10
3 31 46 D -10
4 41 74 A 0
.. .. .. ... ...
95 51 4 D 0
96 70 4 A 10
97 74 74 B 10
98 54 70 C 0
99 87 44 D 10
Often np.vectorize gives problems because of the otypes issue (read the docs); if the trial calculation produces an integer, then the return dtype is set to that, giving problems if other values are floats. However in this case the result can only have one of three values, [-10,0,10] (the last parameter).
The warning, such as you provide, suggests that some value(s) in the larger dataframe are wrong for the calculations in your scoreFunHigh function. But the warning doesn't give enough detail to say what.
It is relatively easy to apply real numpy vectorization to this problem, since it depends on two Series, df['x] an letterMeanX and 2 scalars.
In [111]: letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
In [112]: letterMeanX.shape
Out[112]: (100,)
In [113]: df['x'].shape
Out[113]: (100,)
In [114]: upper = letterMeanX *(1+0.2)
In [115]: lower = letterMeanX *(1-0.2)
In [116]: res = np.zeros(letterMeanX.shape,int)
In [117]: res[df['x']>upper] = 10
In [118]: res[df['x']<lower] = -10
In [119]: np.allclose(res, Out[70])
Out[119]: True
In other words, rather than applying the upper/lower comparison row by row, it applies it to the whole Series. It is still iterating, but in compiled numpy methods, which are much faster. np.vectorize is just a wrapper around an iteration. It still calls your python function once for each row. Hopefully the performance disclaimer is clear enough.
Consider directly calling your function with slight adjustment to method to handle conditional logic using numpy.select (or numpy.where). With this approach no loops are run but vectorized operations on the Series and scalar parameters:
def scoreFunHigh(dataField, mean, diff, multip):
conds = [dataField > mean * (1 + diff),
dataField < mean * (1 - diff)]
vals = [multip * 1, multip * (-1)]
return np.select(conds, vals, default=0)
letterMeanX = df.groupby('letters')['x'].transform(np.nanmean)
df['letter x score'] = scoreFunHigh(df['x'], letterMeanX, 0.2, 10)
letterMeanY = df.groupby('letters')['y'].transform(np.nanmean)
df['letter y score'] = scoreFunHigh(df['y'], letterMeanY, 0.3, 5)
Here is version that doesn't use np.vectorize
def scoreFunHigh(val, mean, diff, multip):
upper = mean * (1 + diff)
lower = mean * (1 - diff)
if val > upper:
return multip * 1
elif val < lower:
return multip * (-1)
else:
return 0
letterMeanX = df.groupby('letters')['x'].apply(lambda x: np.nanmean(x))
df['letter x score'] = df.apply(
lambda row: scoreFunHigh(row['x'], letterMeanX[row['letters']], 0.2, 10), axis=1)
Output
x y letters letter x score
0 52 76 A 0
1 90 99 B 10
2 87 43 C 10
3 44 73 D 0
4 49 3 A 0
.. .. .. ... ...
95 16 51 D -10
96 38 3 A 0
97 43 47 B 0
98 58 39 C 0
99 41 26 D 0
I have a basic conditional data extraction issue. I have already written a code in Python. I am learning R; and I would like to replicate the same code in R.
I tried to put conditional arguments using which, but that doesn't seem to work. I am not yet fully versed with R syntax.
I have a dataframe with 2 columns: x and y
The idea is to extract a list of maximum 5 x-values multiplied by 2 corresponding to the maximum y-values with a condition that we will select only those values of y which are at least 0.45 times the peak y-value.
So, the algorithm will have the following steps:
We find the peak value of y: max_y
We define the threshold = 0.45 * max_y
We apply a filter, to get the list of all y-values that are greater than the threshold value: y_filt
We get a list of x-values corresponding to the y-values in step 3: x_filt
If the number of values in x_filt is less than or equal to 5, then our result would be the values in x_filt multiplied by 2
If x_filt has more than 5 values, we only select the 5 values corresponding to the 5 maximum y-values in the list. Then we multiply by 2 to get our result
Python Code
max_y = max(y)
max_x = x[y.argmax()]
print (max_x, max_y)
threshold = 0.45 * max_y
y_filt = y [y > threshold]
x_filt = x [y > threshold]
if len(y_filt) > 4:
n_highest = 5
else:
n_highest = len(y_filt)
y_filt_highest = y_filt.argsort()[-n_highest:][::-1]
result = [x_filt[i]*2 for i in range(len(x_filt)) if i in y_filt_highest]
For Example Data-set
x y
1 20
2 7
3 5
4 11
5 0
6 8
7 3
8 10
9 2
10 6
11 15
12 18
13 0
14 1
15 12
The above code will give the following results
max_y = 20
max_x = 1
threshold = 9
y_filt = [20, 11, 10, 15, 18, 12]
x_filt = [1, 4, 8, 11, 12, 15]
n_highest = 5
y_filt_highest = [20, 11, 15, 18, 12]
result = [2, 8, 22, 24, 30]
I wish to do the same in R.
One of the reasons that R is so powerful/easy to use for statistical work is that the built in data.frame is foundational. Using one here simplifies things:
# Create a dataframe with the toy data
df <- data.frame(x = 1:10, y = c(20, 7, 5, 11, 0, 8, 3, 10, 2, 6))
# Refer to columns with the $ notation
max_y <- max(df$y)
max_x <- df$x[which(df$y == max_y)]
# If you want to print both values, you need to create a list with c()
print(c(max_x, max_y))
# But you could also just call the values directly, as in python
max_x
max_y
# Calculate a threshold and then create a filtered data.frame
threshold <- 0.45 * max_y
df_filt <- df[which(df$y > threshold), ]
df_filt <- df_filt[order(-df_filt$y), ]
if(nrow(df_filt) > 5){
df_filt <- df_filt[1:5, ]
}
# Calculate the result
result <- df_filt$x * 2
# Alternatively, you may want the result to be part of your data.frame
df_filt$result <- df_filt$x*2
# Should show identical results
max_y
max_x
threshold
df_filt # Probably don't want to print a df if it is large
result
Of course if you really need separate vectors for y_filt and x_filt, you could create them easily after the fact:
y_filt <- df_filt$y
x_filt <- df_filt$x
Note that like numpy.argmax, which(df$y == max(y)) will return multiple values if your maximum is not unique.
I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
The end result I would like to have is:
df2
average distance
between cars
time
0 1.55
1 10.05
2 53.04
any idea on how to proceed? I've been trying apply the scipy.spatial.distance function to the dataframe, but I'm not sure how to apply it to df.groupby('time'), and then get the mean value of all those distances.
Any help appreciated!
You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.
import numpy as np
from scipy import spatial
df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())
Outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
For me using apply or for loop does not have much different
l1=[]
l2=[]
for y,x in df.groupby('time'):
v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)
v = np.ma.masked_equal(v, 0)
l2.append(np.mean(v))
l1.append(y)
pd.DataFrame({'ave':l2},index=l1)
Out[250]:
ave
0 1.550094
1 10.049876
2 53.037722
building this up from the first principles:
For each point at index n, it is necessary to compute the distance with all the points with index > n.
if the distance between two points is given by formula:
np.sqrt((x0 - x1)**2 + (y0 - y1)**2)
then for an array of points in a dataframe, we can get all the distances & then calculate its mean:
distances = []
for i in range(len(df)-1):
distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()
np.mean(distances)
expressing the same logic using pd.concat & a couple of helper functions
def diff_sq(x, i):
return (x.iloc[i+1:] - x.iloc[i])**2
def dist_df(x, y, i):
d_sq = diff_sq(x, i) + diff_sq(y, i)
return np.sqrt(d_sq)
def avg_dist(df):
return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()
then it is possible to use the avg_dist function with groupby
df.groupby('time').apply(avg_dist)
# outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
You could also use the itertools package to define your own function as follow:
import itertools
import numpy as np
def combinations(series):
l = list()
for item in itertools.combinations(series,2):
l.append(((item[0] - item[1])**2))
return l
df2 = df.groupby('time').agg(combinations)
df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) +
pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]
df2.avg_distance.to_frame()
Then, the output is:
avg_distance
time
0 1.550094
1 10.049876
2 53.037722