Well, I have following columns:
Id PlayId X Y
0 0 2.3 3.4
1 0 5.4 3.2
2 1 3.2 5.1
3 1 4.2 1.7
If I have two rows groupped by one PlayId, I want to add two columns of Distance and Angle:
Id PlayId X Y Distance_0 Distance_1 Angle_0 Angle_1
0 0 2.3 3.4 0.0 ? 0.0 ?
1 0 5.4 3.2 ? 0.0 ? 0.0
2 1 3.2 5.1
3 1 4.2 1.7
Every Distance-column describes Euclidean distance between i-th and j-th element in a group:
dist(x0, x1, y0, y1) = sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
Similar way, the angle between i-th and j-th element is calculated.
So, how can I perform this efficiently, without processing elements one-by-one?
You can compute the pairwise distances by using the pdist function from SciPy:
df = pd.DataFrame({'X': [5, 6, 7], 'Y': [3, 4, 5]})
# df
# X Y
# 0 5 3
# 1 6 4
# 2 7 5
from scipy.spatial.distance import pdist, squareform
cols = [f'Distance_{i}' for i in range(len(df))]
pd.DataFrame(squareform(pdist(df.values)), columns=cols)
which produces the following DataFrame:
Distance_0 Distance_1 Distance_2
0 0.000000 1.638991 2.828427
1 1.638991 0.000000 1.638991
2 2.828427 1.638991 0.000000
This works, since pdist takes an array of size m * n, where m is the number of observations (=rows) and n the dimension of said observations (in this case: two - X and Y)
You could subsequently concat the original DataFrame with the newly created one if needed (using pd.concat).
For the angle, you could use pdist as well, using metric='cosine' to compute the cosine distance. See this post for more information.
Related
I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:
Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)),
da.dot(Users, X))[0].T.compute()
Items = np.where(Items < 0, 0, Items)
Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)),
da.dot(Items.T, X.T))[0].compute()
Users = np.where(Users < 0, 0, Users)
But I don't think this works correctly, because MSE is not decreasing.
Example input:
n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items
Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:
1 0 0 0 5 2 1 0 0 0 0.8 1.3 1.1 0.2 4.1 1.6
0 0 0 0 4 0 0 0 1 1 3.9 4.3 3.5 2.7 4.3 0.5
0 3 0 0 4 0 0 0 0 0 2.9 1.5
0 3 0 0 0 0 0 0 0 0 0.2 4.7
1 1 1 0 0.9 1.1
1 0 0 0 4.8 3.0
EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.
X_train = da.nan_to_num(X_train)
Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.
Any help would be highly appreciated. <3
One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.
Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:
data_set = da.array([[1, 2], [3, 4]])
masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]
masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]
masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]
In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:
masked_X = da.ma.masked_where(X != X, X)
Now you can easily perform dot product like:
da.ma.getdata(da.dot(Users,masked_X))
This is a simplified version of my data. I have a dataframe of coordinates, and an empty dataframe which should be filled with the distance of each pair using the function provided.
What is the quickest method to fill this dataframe? As much as possible, I want to stay away from nested for loops (slow!). Can I use apply or applymap?
You may modify the function or other parts accordingly. Thanks.
import pandas as pd
def get_distance(point1, point2):
"""Gets the coordinates of two points as two lists, and outputs their distance"""
return (((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2 + (point1[2] - point2[2]) ** 2) ** 0.5)
#Dataframe of coordinates.
df = pd.DataFrame({"No.": [25, 36, 70, 95, 112, 101, 121, 201], "x": [1,2,3,4,2,3,4,5], "y": [2,3,4,5,3,4,5,6], "z": [3,4,5,6,4,5,6,7]})
df.set_index("No.", inplace = True)
#Dataframe to be filled with each pair distance.
df_dist = pd.DataFrame({'target': [112, 101, 121, 201]}, columns=["target", 25, 36, 70, 95])
df_dist.set_index("target", inplace = True)
AFAIK there are no clear speed benefit of lambda over a for loop - and it's very hard to write a double lambda, usually that is reserved for straightforward row operations.
However with some engineering, we can reduce our code to a few simple and self explanatory lines:
import numpy as np
get = lambda i: df.loc[i,:].values
dist = lambda i, j: np.sqrt(sum((get(i) - get(j))**2))
# Fills your df_dist
for i in df_dist.columns:
for j in df_dist.index:
df_dist.loc[j,i] = dist(i, j)
The resulting df_dist:
25 36 70 95
target
112 1.732051 0.000000 1.732051 3.464102
101 3.464102 1.732051 0.000000 1.732051
121 5.196152 3.464102 1.732051 0.000000
201 6.928203 5.196152 3.464102 1.732051
If you don't want to use for loops, you can compute the distances between all the possible pairs in the following way.
You first need to do the cartesian product of df with itself to have all the possible pairs of point.
i, j = np.where(1 - np.eye(len(df)))
df=df.iloc[i].reset_index(drop=True).join(
df.iloc[j].reset_index(drop=True), rsuffix='_2')
Where i and j are the boolean indexes of the upper and lower triangles of a square matrix of size len(df). After you did this you just need to apply your distance function
df['distance'] = get_distance([df['x'],df['y'],df['z']], [df['x_2'],df['y_2'],df['z_2']])
df.head()
No. x y z No._2 x_2 y_2 z_2 distance
0 25 1 2 3 36 2 3 4 1.732051
1 25 1 2 3 70 3 4 5 3.464102
2 25 1 2 3 95 4 5 6 5.196152
3 25 1 2 3 112 2 3 4 1.732051
4 25 1 2 3 101 3 4 5 3.464102
If you wanted to compute only the points from df_dist you can modify accordingly the matrix 1 - np.eye(len(df)).
I am struggling to find the coefficients for b1, b2 and b3. My model has 3 independent variable x1, x2 and x3 and one dependent variable y.
x1,x2,x3,y
89,4,3.84,7
66,1,3.19,5.4
78,3,3.78,6.6
111,6,3.89,7.4
44,1,3.57,4.8
77,3,3.57,6.4
80,3,3.03,7
66,2,3.51,5.6
109,5,3.54,7.3
76,3,3.25,6.4
I want to use the matrix method to find out the coefficients for b1, b2 and b3. From the tutorial that I am following the value for b1 is 0.0141, b2 is 0.383 and b3 is -0.607.
I am not sure about how to achieve those values mentioned above, when I tried to inverse the matrix containing x1, x2, x3 values I am getting the below error.
raise LinAlgError('Last 2 dimensions of the array must be square')
numpy.linalg.linalg.LinAlgError: Last 2 dimensions of the array must be square
Please someone help me solve this matrix so that I can get the desired values.
In matrix form, the regression coefficients are given by
Where x is your data matrix of predictors, and y is a vector of outcome values
In python (numpy), that looks something like this:
import numpy as np
b = np.dot(x.T, x)
b = np.linalg.inv(b)
b = np.dot(b, x.T)
b = np.dot(b, y)
Using that on your data you get the following coefficients:
0.0589514 , -0.25211869, 0.70097577
Those values don't match your expected output, and it's because the tutorial you're following must also be modelling an intercept. To do that we add a column of ones to the data matrix so it looks like this:
x.insert(loc=0, column='x0', value=np.ones(10))
x0 x1 x2 x3
0 1.0 89 4 3.84
1 1.0 66 1 3.19
2 1.0 78 3 3.78
3 1.0 111 6 3.89
4 1.0 44 1 3.57
5 1.0 77 3 3.57
6 1.0 80 3 3.03
7 1.0 66 2 3.51
8 1.0 109 5 3.54
9 1.0 76 3 3.25
Now we get the expected regression coefficients (plus an additional value for the intercept):
6.21137766, 0.01412189, 0.38315024, -0.60655271
i have a dataframe
id lat long
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
I want to find the euclidean distance of these coordinates from a particulat location saved in a list L1
L1 = [11.344,7.234]
i want to create a new column in df where i have the distances
id lat long distance
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
i know to find euclidean distance between two points using math.hypot():
dist = math.hypot(x2 - x1, y2 - y1)
How do i write a function using apply or iterate over rows to give me distances.
Use vectorized approach
In [5463]: (df[['lat', 'long']] - np.array(L1)).pow(2).sum(1).pow(0.5)
Out[5463]:
0 8.369161
1 18.523838
2 26.066777
3 18.632320
4 22.546096
dtype: float64
Which can also be
In [5468]: df['distance'] = df[['lat', 'long']].sub(np.array(L1)).pow(2).sum(1).pow(0.5)
In [5469]: df
Out[5469]:
id lat long distance
0 1 12.654 15.50 8.369161
1 2 14.364 25.51 18.523838
2 3 17.636 32.53 26.066777
3 5 12.334 25.84 18.632320
4 9 32.224 15.74 22.546096
Option 2 Use Numpy's built-in np.linalg.norm vector norm.
In [5473]: np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Out[5473]: array([ 8.36916101, 18.52383805, 26.06677732, 18.63231966, 22.5460958 ])
In [5485]: df['distance'] = np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Translating [(x2 - x1)2 + (y2 - y1)2]1/2 into pandas vectorised operations, you have:
df['distance'] = (df.lat.sub(11.344).pow(2).add(df.long.sub(7.234).pow(2))).pow(.5)
df
lat long distance
id
1 12.654 15.50 8.369161
2 14.364 25.51 18.523838
3 17.636 32.53 26.066777
5 12.334 25.84 18.632320
9 32.224 15.74 22.546096
Alternatively, using arithmetic operators:
(((df.lat - 11.344) ** 2) + (df.long - 7.234) ** 2) ** .5
I have a dataframe net that contains the distance d between two locations A and B.
net =
A B d
0 5 3 3.5
1 2 0 2.3
2 3 2 1.2
3 4 5 2.2
4 0 1 3.2
5 0 3 4.5
Then I have a symmetric matrix M that contains all the possible distances between two pairs, so:
M =
0 1 2 3 4 5
0 0 3.2 2.3 4.5 1.7 5.2
1 3.2 0 2.1 0.7 3.9 3.8
2 2.3 2.1 0 1.2 1.5 4.7
3 4.5 0.7 1.2 0 3.2 3.5
4 1.7 3.9 1.5 3.2 0 2.2
5 5.2 3.8 4.7 3.5 2.2 0
I want to generate a new dataframe df1 that contains two random different locations A and B in the same distance interval ds > np.floor(d) & ds < np.floor(d)+1.
This is what I am doing
H = []
W = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( (M > np.floor(tmp)) & (M < np.floor(tmp)+1) )
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
H.append(h)
W.append(w)
df1 = pd.DataFrame()
df1['A'] = H
df1['B'] = W
group M by floor division of 1. Then use that to query and sample
g = M.stack().index.to_series().groupby(M.stack() // 1)
net.d.apply(lambda x: pd.Series(g.get_group(x // 1).sample(1).iloc[0], list('AB')))