i have a dataframe
id lat long
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
I want to find the euclidean distance of these coordinates from a particulat location saved in a list L1
L1 = [11.344,7.234]
i want to create a new column in df where i have the distances
id lat long distance
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
i know to find euclidean distance between two points using math.hypot():
dist = math.hypot(x2 - x1, y2 - y1)
How do i write a function using apply or iterate over rows to give me distances.
Use vectorized approach
In [5463]: (df[['lat', 'long']] - np.array(L1)).pow(2).sum(1).pow(0.5)
Out[5463]:
0 8.369161
1 18.523838
2 26.066777
3 18.632320
4 22.546096
dtype: float64
Which can also be
In [5468]: df['distance'] = df[['lat', 'long']].sub(np.array(L1)).pow(2).sum(1).pow(0.5)
In [5469]: df
Out[5469]:
id lat long distance
0 1 12.654 15.50 8.369161
1 2 14.364 25.51 18.523838
2 3 17.636 32.53 26.066777
3 5 12.334 25.84 18.632320
4 9 32.224 15.74 22.546096
Option 2 Use Numpy's built-in np.linalg.norm vector norm.
In [5473]: np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Out[5473]: array([ 8.36916101, 18.52383805, 26.06677732, 18.63231966, 22.5460958 ])
In [5485]: df['distance'] = np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Translating [(x2 - x1)2 + (y2 - y1)2]1/2 into pandas vectorised operations, you have:
df['distance'] = (df.lat.sub(11.344).pow(2).add(df.long.sub(7.234).pow(2))).pow(.5)
df
lat long distance
id
1 12.654 15.50 8.369161
2 14.364 25.51 18.523838
3 17.636 32.53 26.066777
5 12.334 25.84 18.632320
9 32.224 15.74 22.546096
Alternatively, using arithmetic operators:
(((df.lat - 11.344) ** 2) + (df.long - 7.234) ** 2) ** .5
Related
My dataset df looks like this:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
I want to find the total distance covered by each main_id item for each day. To calculate the distance between 2 set of coordinates, I can use this function:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
but I am not sure how I can sum it by grouping for main_id and day. The end result could be a new df like this:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
Where the derived time is any or the first value from the respective main_id and day time's column.
total_dist calculation:
For example, for the first row, for main_id == 1 and day 1, the total_dist would be calculated like this:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)
Note that your function is not vectorized hence making the work difficult.
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
Another option will be to use reduce:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
EDIT:
If count is needed:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775
Just another approach to convert the lat lon into utm without using find_kms function:
import pandas as pd
import numpy as np
import utm
u= utm.from_latlon(df.lat.values,df.long.values)
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)
a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))
df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km
df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})
df1 output:
day_count dist
main_id day
1 1 3 13.494554
2 2 57.145276
2 4 2 6.745386
I have a data frame like below
STORE_ID LATITUDE LONGITUDE GROUP
1 18.2738 28.38833 2
2 18.3849 28.29374 1
3 18.3948 28.29303 1
4 18.1949 28.28248 1
5 18.2947 28.47392 1
6 18.7493 28.29475 2
7 18.4729 28.38392 3
8 18.1927 28.29485 2
9 18.2948 28.29384 1
10 18.1038 28.29489 3
11 18.7482 28.29374 1
12 18.9283 28.28484 2
And a second data frame like below
Tele_Booth_ID LATITUDE LONGITUDE
1 18.5638 28.19374
2 18.2947 28.03727
3 18.3849 28.26395
4 18.9482 28.91847
The first data frame has longitudes and latitudes of stores in a certain area. The store are grouped together into clusters represented by the GROUP field.
The second dataframe has longitudes and latitudes for telephone booths in that same area.
Using both these data frames I want to find the optimal locations to place more telephone booths.
If a store group has no telephone booths in the cluster or near the cluster, I would want to put a booth there. If a store group has a booth within the cluster, I would not want another booth there.
Using python how can I calculate the center point for each store group and then calculate the distance of each store group to the nearest booth?
While I am unsure of how accurate the computation of the centroid of point might be, using the ability to group on store groups You can create a new dataframe which contains the mean of the Lat and Lon of each group as follows:
Given a df1 as shown:
Store Lat Lon Group
0 1 18.2738 28.38833 2
1 3 18.3948 28.29303 1
2 4 18.1949 28.28248 1
3 5 18.2947 28.47392 1
4 6 18.7493 28.29475 2
5 7 18.4729 28.38392 3
6 8 18.1927 28.29485 2
7 9 18.2948 28.29384 1
8 10 18.1038 28.29489 3
9 11 18.7482 28.29374 1
10 12 18.9283 28.28484 2
Create a Dataframe of centered Lat and Lon as follows:
dfc = df1.groupby('Group').Lat.mean().to_frame().join(df1.groupby('Group').Lon.mean().to_frame())
This will yield the dfc dataframe shown below:
Lat Lon
Group
1 18.385480 28.327402
2 18.536025 28.315693
3 18.288350 28.339405
You can now utilize these mean Lat/Lon points to compute distance to Telephone booths as follows:
The booth dataframe is df2 shown below:
Booth Lat Lon
0 1 18.5638 28.19374
1 2 18.2947 28.03727
2 3 18.3849 28.26395
3 4 18.9482 28.91847
# to compute the distance between two sets of coordinates
def dist(loc1: tuple[float], loc2: tuple[float]) -> float:
dx = loc1[0] - loc2[0]
dy = loc1[1] - loc2[1]
return (dx**2 + dy**2)**0.5
Using the above compute distance from each group centroid to each booth as follows:
for i in range(df2.shape[0]):
dfc[f'B-{i+1:02}'] = dfc.apply(lambda row: dist((row.Lat, row.Lon), tuple(df2.iloc[i].to_list()[1:])), axis=1)
This yields the following:
Lat Lon B-01 B-02 B-03 B-04
Group
1 18.385480 28.327402 0.222853 0.304003 0.063455 0.816098
2 18.536025 28.315693 0.125075 0.368452 0.159737 0.730225
3 18.288350 28.339405 0.311594 0.302202 0.122537 0.877906
I have a df which has many columns. I am currently using following command output = df.join(df.expanding().std().fillna(0).add_prefix("SD")) to generate a Standard deviation column for column A based on cumulative values like this:
A SDA
1 x1
2 x2
3 x3
4 x4
5 x5
Where x1 is the SD of 1, x2 is the SD of 1,2 ; x5 is the SD of 1,2,3,4,5 and so on.
I want to move the window in such a way that after it moves to 11, the SD will be calculated on the values 2 to 11.
A SDA
1 x1
2 x2
3 x3
.. ..
9 x9
10 x10
11 x11
12 x12
13 x13
.. ..
20 x20
21 x21
22 x22
So, here x11 will be calculating Standard deviation of values from 2,3,4..11and **x12 ** will be of 2 to 12 .Thus, x20 will be based on 2 to 20. After 20 values, it will again move one step and x21 will be the SD of 3,4,5,6...21`` .x22will be based on values from3 to 21``` and so on. I want to do such operation for multiple columns and generate multiple SD column at a time.
I am not sure how to use expanding function for this kind of moving windows.
For calculating mean in such same way, shall I just use mean function in the place of std() ?
You only need to determine the lower- and upper-bound for each row, then it's easy:
from datar.all import (
f, across, c, tibble, mutate, row_number, std, rnorm
)
# Create a example df
df = tibble(A=rnorm(22), B=rnorm(22))
def stdev(col, lower, upper):
"""Calculate the stdev with the lower- and upper-bound of current column"""
return [std(col[low:up]) for low, up in zip(lower, upper)]
(
df
>> mutate(
# create the lower- and upper-bound
# note it's 0-based
upper=row_number(),
lower=(f.upper - 1) // 10,
)
>> mutate(
across(
# Apply stdev func to each column except the lower and upper columns
~c(f.lower, f.upper),
stdev,
lower=f.lower,
upper=f.upper,
_names="SD{_col}",
)
)
)
A B upper lower SDA SDB
<float64> <float64> <int64> <int64> <float64> <float64>
0 -0.399324 0.740135 1 0 NaN NaN
1 -0.023364 0.468155 2 0 0.265844 0.192318
2 0.839819 -0.899893 3 0 0.635335 0.878940
3 -0.788705 0.497236 4 0 0.695902 0.744258
4 1.838374 -0.153098 5 0 1.053171 0.663758
5 0.174278 -0.938773 6 0 0.943238 0.736899
6 0.265525 1.906103 7 0 0.861060 0.998927
7 0.484971 1.687661 8 0 0.800723 1.058484
8 0.238861 1.378369 9 0 0.749275 1.041054
9 1.068637 -0.075925 10 0 0.747869 0.999481
10 -1.742042 -0.192375 11 1 0.984440 1.013941
11 -0.779599 -1.944827 12 1 0.982807 1.188045
12 -0.478696 0.211798 13 1 0.954120 1.132865
13 -2.598185 -0.747964 14 1 1.179397 1.113613
14 -0.308082 0.171333 15 1 1.134297 1.070135
15 0.700852 -2.719584 16 1 1.113848 1.261954
16 0.917145 0.375815 17 1 1.104229 1.224715
17 1.343796 -0.796525 18 1 1.118582 1.199169
18 1.024335 -0.943663 19 1 1.108354 1.180068
19 -0.877742 -0.431288 20 1 1.101227 1.148623
20 -0.584439 -0.555945 21 2 1.111302 1.141233
21 -0.946391 -1.550432 22 2 1.103871 1.149968
You can finally remove the lower and upper columns using select():
df >> select(~c(f.lower, f.upper))
Disclaimer: I am the author of datar, which is a wrapper around pandas to implement some features from dplyr/tidyr in R.
Well, I have following columns:
Id PlayId X Y
0 0 2.3 3.4
1 0 5.4 3.2
2 1 3.2 5.1
3 1 4.2 1.7
If I have two rows groupped by one PlayId, I want to add two columns of Distance and Angle:
Id PlayId X Y Distance_0 Distance_1 Angle_0 Angle_1
0 0 2.3 3.4 0.0 ? 0.0 ?
1 0 5.4 3.2 ? 0.0 ? 0.0
2 1 3.2 5.1
3 1 4.2 1.7
Every Distance-column describes Euclidean distance between i-th and j-th element in a group:
dist(x0, x1, y0, y1) = sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
Similar way, the angle between i-th and j-th element is calculated.
So, how can I perform this efficiently, without processing elements one-by-one?
You can compute the pairwise distances by using the pdist function from SciPy:
df = pd.DataFrame({'X': [5, 6, 7], 'Y': [3, 4, 5]})
# df
# X Y
# 0 5 3
# 1 6 4
# 2 7 5
from scipy.spatial.distance import pdist, squareform
cols = [f'Distance_{i}' for i in range(len(df))]
pd.DataFrame(squareform(pdist(df.values)), columns=cols)
which produces the following DataFrame:
Distance_0 Distance_1 Distance_2
0 0.000000 1.638991 2.828427
1 1.638991 0.000000 1.638991
2 2.828427 1.638991 0.000000
This works, since pdist takes an array of size m * n, where m is the number of observations (=rows) and n the dimension of said observations (in this case: two - X and Y)
You could subsequently concat the original DataFrame with the newly created one if needed (using pd.concat).
For the angle, you could use pdist as well, using metric='cosine' to compute the cosine distance. See this post for more information.
I've got to print percentages, but the trick is that i have to round the values to 4 decimals.
It is in a DataFrame where each column represent the percentages for one allocation.
Sometimes, the sum of the percentages does not give 1, but 0.9999 or 1.0001 (which makes sense). But how do you make sure it does ?
You have to arbitrary pick a row and put the delta in it.
I've come up with this solution, but i've got to iterate through each column and do the modification on the Series.
Code
df = abs(pd.DataFrame(np.random.randn(4, 4), columns=range(0,4)))
# Making sure the sum of allocation is 1.
df = df / df.sum()
# Rounding the allocation
df = df.round(4)
print("-- before --")
print(df)
print(df.sum())
# It can happen that after rounding your number, the sum is not equal to 1. (imagine rounding 1/3 three times...)
# So check for the sum of each col and then put the delta in in the fund with the lowest value.
for p in df:
if df[p].sum() != 1:
# get the id of the fund with the lowest percentage (but not 0)
low_id = (df[p][df[p] != 0].idxmin())
df[p][low_id] += (1 - df[p].sum())
print("-- after --")
print(df)
print(df.sum())
Output
-- before --
0 1 2 3
0 0.0116 0.1256 0.4980 0.3738
1 0.2562 0.5458 0.3086 0.1221
2 0.4853 0.0009 0.0588 0.0078
3 0.2470 0.3277 0.1346 0.4962
0 1.0001
1 1.0000
2 1.0000
3 0.9999
dtype: float64
-- after --
0 1 2 3
0 0.0115 0.1256 0.4980 0.3738
1 0.2562 0.5458 0.3086 0.1221
2 0.4853 0.0009 0.0588 0.0079
3 0.2470 0.3277 0.1346 0.4962
0 1.0
1 1.0
2 1.0
3 1.0
dtype: float64
Is there any faster solution ?
Thanks a lot,
Regards,
Julien
It is always better to avoid loops.
df = abs(pd.DataFrame(np.random.randn(4, 4) ))
df = df / df.sum()
df = df.round(4)
columns = ['Sum','Min', 'submin']
dftemp = pd.DataFrame(columns=columns)
dftemp['Sum']= df.sum(axis=0) # sum columns
dftemp['Min']= df[df!=0].min(axis=0) # non zero minimum of column
dftemp['submin']= dftemp['Min']+(1-dftemp['Sum']) # (1 -sum of columns) + minimum value
dftemp['FinalValue']= np.where (dftemp['Sum']!=1,dftemp.submin,dftemp.Min) # decide weather to use existing miinimum value or delta
print('\n\nBefore \n\n ',df,'\n\n ', df.sum())
df=df.mask(df.eq(df.min(0),1),df.eq(df.min(0),1).mul(dftemp['FinalValue'].tolist())) # Replace the minmum value with delta values
print('After \n\n ',df,'\n\n ', df.sum())
Output
Output
Before
0 1 2 3
0 0.1686 0.0029 0.1055 0.1739
1 0.5721 0.5576 0.2904 0.2205
2 0.0715 0.2749 0.4404 0.5014
3 0.1878 0.1647 0.1637 0.1042
0 1.0000
1 1.0001
2 1.0000
3 1.0000
dtype: float64
After
0 1 2 3
0 0.1686 0.0028 0.1055 0.1739
1 0.5721 0.5576 0.2904 0.2205
2 0.0715 0.2749 0.4404 0.5014
3 0.1878 0.1647 0.1637 0.1042
0 1.0
1 1.0
2 1.0
3 1.0
dtype: float64