Calculating distance from centroid to another point in Python - python

I have a data frame like below
STORE_ID LATITUDE LONGITUDE GROUP
1 18.2738 28.38833 2
2 18.3849 28.29374 1
3 18.3948 28.29303 1
4 18.1949 28.28248 1
5 18.2947 28.47392 1
6 18.7493 28.29475 2
7 18.4729 28.38392 3
8 18.1927 28.29485 2
9 18.2948 28.29384 1
10 18.1038 28.29489 3
11 18.7482 28.29374 1
12 18.9283 28.28484 2
And a second data frame like below
Tele_Booth_ID LATITUDE LONGITUDE
1 18.5638 28.19374
2 18.2947 28.03727
3 18.3849 28.26395
4 18.9482 28.91847
The first data frame has longitudes and latitudes of stores in a certain area. The store are grouped together into clusters represented by the GROUP field.
The second dataframe has longitudes and latitudes for telephone booths in that same area.
Using both these data frames I want to find the optimal locations to place more telephone booths.
If a store group has no telephone booths in the cluster or near the cluster, I would want to put a booth there. If a store group has a booth within the cluster, I would not want another booth there.
Using python how can I calculate the center point for each store group and then calculate the distance of each store group to the nearest booth?

While I am unsure of how accurate the computation of the centroid of point might be, using the ability to group on store groups You can create a new dataframe which contains the mean of the Lat and Lon of each group as follows:
Given a df1 as shown:
Store Lat Lon Group
0 1 18.2738 28.38833 2
1 3 18.3948 28.29303 1
2 4 18.1949 28.28248 1
3 5 18.2947 28.47392 1
4 6 18.7493 28.29475 2
5 7 18.4729 28.38392 3
6 8 18.1927 28.29485 2
7 9 18.2948 28.29384 1
8 10 18.1038 28.29489 3
9 11 18.7482 28.29374 1
10 12 18.9283 28.28484 2
Create a Dataframe of centered Lat and Lon as follows:
dfc = df1.groupby('Group').Lat.mean().to_frame().join(df1.groupby('Group').Lon.mean().to_frame())
This will yield the dfc dataframe shown below:
Lat Lon
Group
1 18.385480 28.327402
2 18.536025 28.315693
3 18.288350 28.339405
You can now utilize these mean Lat/Lon points to compute distance to Telephone booths as follows:
The booth dataframe is df2 shown below:
Booth Lat Lon
0 1 18.5638 28.19374
1 2 18.2947 28.03727
2 3 18.3849 28.26395
3 4 18.9482 28.91847
# to compute the distance between two sets of coordinates
def dist(loc1: tuple[float], loc2: tuple[float]) -> float:
dx = loc1[0] - loc2[0]
dy = loc1[1] - loc2[1]
return (dx**2 + dy**2)**0.5
Using the above compute distance from each group centroid to each booth as follows:
for i in range(df2.shape[0]):
dfc[f'B-{i+1:02}'] = dfc.apply(lambda row: dist((row.Lat, row.Lon), tuple(df2.iloc[i].to_list()[1:])), axis=1)
This yields the following:
Lat Lon B-01 B-02 B-03 B-04
Group
1 18.385480 28.327402 0.222853 0.304003 0.063455 0.816098
2 18.536025 28.315693 0.125075 0.368452 0.159737 0.730225
3 18.288350 28.339405 0.311594 0.302202 0.122537 0.877906

Related

run a function and do sum with next row

My dataset df looks like this:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
I want to find the total distance covered by each main_id item for each day. To calculate the distance between 2 set of coordinates, I can use this function:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
but I am not sure how I can sum it by grouping for main_id and day. The end result could be a new df like this:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
Where the derived time is any or the first value from the respective main_id and day time's column.
total_dist calculation:
For example, for the first row, for main_id == 1 and day 1, the total_dist would be calculated like this:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)
Note that your function is not vectorized hence making the work difficult.
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
Another option will be to use reduce:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
EDIT:
If count is needed:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775
Just another approach to convert the lat lon into utm without using find_kms function:
import pandas as pd
import numpy as np
import utm
u= utm.from_latlon(df.lat.values,df.long.values)
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)
a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))
df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km
df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})
df1 output:
day_count dist
main_id day
1 1 3 13.494554
2 2 57.145276
2 4 2 6.745386

Errorbar plot for Likert scale confidence values

I have the following dataset, for 36 fragments in total (36 rows × 3 columns):
Fragment lower upper
0 1 1 5
1 2 2 5
2 3 3 5
3 4 2 5
4 5 1 5
5 6 1 5
I've calculated these lower and upper bounds from this dataset (966 rows × 2 columns):
Fragment Confidence Value
0 33 4
1 26 4
2 23 3
3 16 2
4 36 3
which contains multiple instance of a fragment and an associated Confidence value.
The confidence values are data from a Likert scale, i.e. 1-5. I want to create an error bar plot, for example like this:
So on the y-axis to have each fragment 1-36 and on the x-axis to show the range/std/mean (?) of the confidence values for each fragment.
I've tried this, but it's not exactly what I want, I think using the lower and upper bounds isn't the best idea, maybe I need std/range...
#confpd is the second dataset from above
meanconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].mean()
minconfs = confpd.groupby(Fragment', as_index=False)['Confidence Value'].min()
maxconfs = confpd.groupby('Fragment', as_index=False)['Confidence Value'].max()
data_dict = {}
data_dict['Fragment'] = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18',
'19','20','21','22','23','24','25','26','27','28','29','39','31','32','33','34','35','36']
data_dict['lower'] = minconfs['Confidence Value']
data_dict['upper'] = maxconfs['Confidence Value']
dataset = pd.DataFrame(data_dict)
##dataset is the first dataset I show above
for lower,upper,y in zip(dataset['lower'],dataset['upper'],range(len(dataset))):
plt.plot((lower,upper),(y,y),'ro-',color='orange')
plt.yticks(range(len(dataset)),list(dataset['Fragment']))
The result of this code is this, which is not what I want.
Any help is greatly appreciated!!

Using df.apply on a function with multiple inputs to generate multiple outputs

I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks
IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)

Cluster Rows in Data Subgroups

I have a dataset df of object components in 3-d space - each ID represents an object which has various components:
ID Comp x y z
A 1 2 2 1
A 2 2 1 -1
A 3 -10 1 -10
A 4 -1 3 -5
B 1 3 0 0
B 2 3 0 -5
...
I would like to loop through each ID, using a clustering technique in sklearn to create clusters of components (Comp) based on each component's (x,y,z) co-ordinates - to achieve something like this:
ID Comp x y z cluster
A 1 2 2 1 1
A 2 2 1 -1 1
A 3 -10 1 -10 2
A 4 -1 3 -5 3
B 1 3 0 0 1
B 2 3 0 -5 1
...
As an example - ID:A,Comp:1 is incluster1, whereasID:A, Comp:4 is in cluster 3. (I plan to then concatenate ID and cluster later).
I'm having no luck with the following groupby + apply:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation()
df['cluster']=df.groupby(['ID','Comp']).apply(lambda x: ap.fit_predict(np.array([x.x,x.y,x.z]).T))
I could brute-force it by using a for loop over the ID but my dataset is large (~ 150k ID) and I'm worried about resource and time constraints. Any help would be great!
IIUC, I think you could try something like this:
def ap_fit_pred(x):
ap = AffinityPropagation()
return pd.Series(ap.fit_predict(x.loc[:,['x','y','z']]))
df['cluster'] = df.groupby('ID').apply(ap_fit_pred).reset_index(drop=True)

Find euclidean distance from a point to rows in pandas dataframe

i have a dataframe
id lat long
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
I want to find the euclidean distance of these coordinates from a particulat location saved in a list L1
L1 = [11.344,7.234]
i want to create a new column in df where i have the distances
id lat long distance
1 12.654 15.50
2 14.364 25.51
3 17.636 32.53
5 12.334 25.84
9 32.224 15.74
i know to find euclidean distance between two points using math.hypot():
dist = math.hypot(x2 - x1, y2 - y1)
How do i write a function using apply or iterate over rows to give me distances.
Use vectorized approach
In [5463]: (df[['lat', 'long']] - np.array(L1)).pow(2).sum(1).pow(0.5)
Out[5463]:
0 8.369161
1 18.523838
2 26.066777
3 18.632320
4 22.546096
dtype: float64
Which can also be
In [5468]: df['distance'] = df[['lat', 'long']].sub(np.array(L1)).pow(2).sum(1).pow(0.5)
In [5469]: df
Out[5469]:
id lat long distance
0 1 12.654 15.50 8.369161
1 2 14.364 25.51 18.523838
2 3 17.636 32.53 26.066777
3 5 12.334 25.84 18.632320
4 9 32.224 15.74 22.546096
Option 2 Use Numpy's built-in np.linalg.norm vector norm.
In [5473]: np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Out[5473]: array([ 8.36916101, 18.52383805, 26.06677732, 18.63231966, 22.5460958 ])
In [5485]: df['distance'] = np.linalg.norm(df[['lat', 'long']].sub(np.array(L1)), axis=1)
Translating [(x2 - x1)2 + (y2 - y1)2]1/2 into pandas vectorised operations, you have:
df['distance'] = (df.lat.sub(11.344).pow(2).add(df.long.sub(7.234).pow(2))).pow(.5)
df
lat long distance
id
1 12.654 15.50 8.369161
2 14.364 25.51 18.523838
3 17.636 32.53 26.066777
5 12.334 25.84 18.632320
9 32.224 15.74 22.546096
Alternatively, using arithmetic operators:
(((df.lat - 11.344) ** 2) + (df.long - 7.234) ** 2) ** .5

Categories

Resources