Vectorization to calculate many distances

Vectorization to calculate many distances - python

I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.
For example, my dataset1 may look like below:
id lon lat varA
1 20.11 19.88 100
2 20.87 18.65 90
3 18.99 20.75 120
and my dataset2 may look like below:
placeid lon lat
a 18.75 20.77
b 19.77 22.56
c 20.86 23.76
d 17.55 20.74
Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.
I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.
dataset2['count'] = dataset1.apply(lambda x:
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis
= 1)
However, this gives the total number of rows, but not the ones that satisfy my requirements.
Would anyone be able to point me how to make the code work?

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.
ds1
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
ds2
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
from scipy.spatial import distance
# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11, 19.88],
# [ 20.87, 18.65],
# [ 18.99, 20.75]])
distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074, 2.70148108, 3.95182236, 2.70059253],
# [ 2.99813275, 4.06178532, 5.11000978, 3.92307278],
# [ 0.24083189, 1.97091349, 3.54358575, 1.44003472]])
distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well.
For the sake of this example, let's assume vara is:
vara = np.array([2,4.5,2])
(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):
vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926, 0.70148108, 1.95182236, 0.70059253],
# [-1.50186725, -0.43821468, 0.61000978, -0.57692722],
# [-1.75916811, -0.02908651, 1.54358575, -0.55996528]])
then setting positive values to zero and making negative values positive will give us the final array:
res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926, 0. , 0. , 0. ],
# [ 1.50186725, 0.43821468, 0. , 0.57692722],
# [ 1.75916811, 0.02908651, 0. , 0.55996528]])
Now, to sum over each row:
sum_ = res.sum(axis=1)
#out: array([ 0.37466926, 2.51700915, 2.34821989])
and to count the items in each row:
count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])
This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.
x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]
query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.
one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.
UPDATE:
To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.
lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)
Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)
You can do df.apply() and use Proj_to_... to project df.

IIUC:
Source DFs:
In [160]: d1
Out[160]:
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [161]: d2
Out[161]:
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
Vectorized haversine function:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])
a = pd.np.sin((lat2-lat1)/2.0)**2 + \
pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))
Solution:
x = d2.assign(x=1) \
.merge(d1.loc[d1['id']==1, ['lat','lon']].assign(x=1),
on='x', suffixes=['','2']) \
.drop(['x'], 1)
x['dist'] = haversine(x.lat, x.lon, x.lat2, x.lon2)
yields:
In [163]: x
Out[163]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
2 c 20.86 23.76 19.88 20.11 438.324033
3 d 17.55 20.74 19.88 20.11 283.565975
filtering:
In [164]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[164]:
Empty DataFrame
Columns: [placeid, lon, lat, lat2, lon2, dist]
Index: []
let's change d1, so a few rows would satisfy the criteria:
In [171]: d1.loc[0, 'varA'] = 350
In [172]: d1
Out[172]:
id lon lat varA
0 1 20.11 19.88 350 # changed: 100 --> 350
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [173]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[173]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
3 d 17.55 20.74 19.88 20.11 283.565975

Use scipy.spatial.distance.cdist with your user-defined distance algorithm as the metric
h = lambda u, v: haversine(u['lon'], u['lat'], v['lon'], v['lat'])
dist_mtx = scipy.spatial.distance.cdist(dataset1, dataset2, metric = h)
Then to check the number in the area, just broadcast it
dataset2['count'] = np.sum(dataset1['A'][:, None] > dist_mtx, axis = 0)

Related

Performing calculations on DataFrames of different lengths

I have two different DataFrames that look something like this:
Lat
Lon
28.13
-87.62
28.12
-87.65
......
......
Calculated_Dist_m
34.5
101.7
..............
The first DataFrame (name=df) (consisting of the Lat and Lon columns) has just over 1000 rows (values) in it. The second DataFrame (name=new_calc_dist) (consisting of the Calculated_Dist_m column) has over 30000 rows (values) in it. I want to determine the new longitude and latitude coordinates using the Lat, Lon, and Calculated_Dist_m columns. Here is the code I've tried:
r_earth = 6371000
new_lat = df['Lat'] + (new_calc_dist['Calculated_Dist_m'] / r_earth) * (180/np.pi)
new_lon = df['Lon'] + (new_calc_dist['Calculated_Dist_m'] / r_earth) * (180/np.pi) / np.cos(df['Lat'] * np.pi/180)
When I run the code, however, it only gives me new calculations for certain index values, and gives me NaNs for the rest. I'm not entirely sure how I should go about writing the code so that new longitude and latitude points are calculated for each of over 30000 row values based on the initial 1000 longitude and latitude points. Any suggestions?
EDIT
Here would be some sample outputs. Note that these are not exact figures, but give the idea.
Lat
Lon
28.13
-87.62
28.12
-87.65
28.12
-87.63
.....
......
Calculated_Dist_m
34.5
101.7
28.6
30.8
76.5
.................
And so the sample out put would be:
Lat
Lon
28.125
-87.625
28.15
-87.61
28.127
-87.623
28.128
-87.623
28.14
-87.615
28.115
-87.655
28.14
-87.64
28.117
-87.653
28.118
-87.653
28.15
-87.645
28.115
-87.635
28.14
-87.62
28.115
-87.613
28.117
-87.633
28.118
-87.633
......
.......
Again, these are just random outputs (I tried getting the exact calculations, but could not get it to work). But overall, this gives an idea of what would be wanted: taking the coordinates from the first dataframe and calculating new coordinates based on each of the calculated distances from the second dataframe.

If I understood correctly and assuming df1 and df2 as input, you can perform a cross merge to get all combinations of df1 and df2 rows, then apply your computation (here as new columns Lat2/Lon2):
df = df1.merge(df2, how='cross')
r_earth = 6371000
df['Lat2'] = df['Lat'] + (df['Calculated_Dist_m'] / r_earth) * (180/np.pi)
df['Lon2'] = df['Lon'] + (df['Calculated_Dist_m'] / r_earth) * (180/np.pi) / np.cos(df['Lat'] * np.pi/180)
output:
Lat Lon Calculated_Dist_m Lat2 Lon2
0 28.13 -87.62 34.5 28.130310 -87.619648
1 28.13 -87.62 101.7 28.130915 -87.618963
2 28.13 -87.62 28.6 28.130257 -87.619708
3 28.13 -87.62 30.8 28.130277 -87.619686
4 28.13 -87.62 76.5 28.130688 -87.619220
5 28.12 -87.65 34.5 28.120310 -87.649648
6 28.12 -87.65 101.7 28.120915 -87.648963
7 28.12 -87.65 28.6 28.120257 -87.649708
8 28.12 -87.65 30.8 28.120277 -87.649686
9 28.12 -87.65 76.5 28.120688 -87.649220
10 28.12 -87.63 34.5 28.120310 -87.629648
11 28.12 -87.63 101.7 28.120915 -87.628963
12 28.12 -87.63 28.6 28.120257 -87.629708
13 28.12 -87.63 30.8 28.120277 -87.629686
14 28.12 -87.63 76.5 28.120688 -87.629220

In case you just want the result as two 2D arrays (without repeats of the input, so also O[m*n] in memory but 2/5 of the requirement from the result of cross-join):
r_earth = 6371000
z = 180 / np.pi * new_calc_dist['Calculated_Dist_m'].values / r_earth
lat = df['Lat'].values
lon = df['Lon'].values
new_lat = lat[:, None] + z
new_lon = lon[:, None] + z / lat[:, None]
Example:
df = pd.DataFrame([[28.13, -87.62], [28.12, -87.65]], columns=['Lat', 'Lon'])
new_calc_dist = pd.DataFrame([[34.5], [101.7], [60.0]], columns=['Calculated_Dist_m'])
# result of above
>>> new_lat
array([[28.13031027, 28.13091461, 28.13053959],
[28.12031027, 28.12091461, 28.12053959]])
>>> new_lon
array([[-87.61998897, -87.61996749, -87.61998082],
[-87.64998897, -87.64996747, -87.64998081]])
If you do want those results as DataFrames:
kwargs = dict(index=df.index, columns=new_calc_dist.index)
new_lat = pd.DataFrame(new_lat, **kwargs)
new_lon = pd.DataFrame(new_lon, **kwargs)

Python: Computing the distance between two point coordinates using two columns

I would like to compute the distance between two coordinates. I know I can compute the haversine distance between two points. However, I was wondering if there is an easier way of doing it instead of creating a loop using the formula iterating over the entire columns (also getting errors in the loop).
Here's some data for the example
# Random values for the duration from one point to another
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
df
duration latitude longitude
0 5 11.923855 57.723843
1 2 11.923862 57.723831
2 10 11.923851 57.723839
3 19 11.923847 57.723831
4 16 11.923865 57.723827
5 4 11.923841 57.723831
6 13 11.923860 57.723835
7 3 11.923846 57.723827
To compute the distance this is what I've attempted:
# Looping over each row to compute the Haversine distance between two points
# Earth's radius (in m)
R = 6373.0 * 1000
lat = df["latitude"]
lon = df["longitude"]
for i in lat:
lat1 = lat[i]
lat2 = lat[i+1]
for j in lon:
lon1 = lon[i]
lon2 = lon[i+1]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m
However, this is the error I get:
The two points to compute the distance should be taken from the same column.
first distance value:
11.923855 57.723843 (point1/observation1)
11.923862 57.723831 (point2/observation2)
second distance value:
11.923862 57.723831 (point1/observation2)
11.923851 57.723839(point2/observation3)
third distance value:
11.923851 57.723839(point1/observation3)
11.923847 57.723831 (point1/observation4)
... (and so on)

OK, first you can create a dataframe that combine each measurement with the previous one:
df2 = pd.concat([df.add_suffix('_pre').shift(), df], axis=1)
df2
This outputs:
duration_pre latitude_pre longitude_pre duration latitude longitude
0 NaN NaN NaN 5 11.923855 57.723843
1 5.0 11.923855 57.723843 2 11.923862 57.723831
2 2.0 11.923862 57.723831 10 11.923851 57.723839
…
Then create a haversine function and apply it to the rows:
def haversine(lat1, lon1, lat2, lon2):
import math
R = 6373.0 * 1000
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
return R *2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
df2.apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
which computes for each row the distance with the previous row (first one is thus NaN).
0 NaN
1 75.754755
2 81.120210
3 48.123604
…
And, if you want to include a new column in the original dataframe in one line:
df['distance'] = pd.concat([df.add_suffix('_pre').shift(), df], axis=1).apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
Output:
duration latitude longitude distance
0 5 11.923855 57.723843 NaN
1 2 11.923862 57.723831 75.754755
2 10 11.923851 57.723839 81.120210
3 19 11.923847 57.723831 48.123604
4 16 11.923865 57.723827 116.515304
5 4 11.923841 57.723831 154.307571
6 13 11.923860 57.723835 122.794838
7 3 11.923846 57.723827 98.115312

I understood that you want to get the pairwise haversine distance between all points in your df. Here's how this could be done:
Be careful when using this approach with a lot of points as it generates a lot of columns quickly
Setup
import random
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
Get radians
import math
df['lat_rad'] = df.latitude.apply(math.radians)
df['long_rad'] = df.latitude.apply(math.radians)
Calculate pairwise distances
from sklearn.metrics.pairwise import haversine_distances
for idx_from, from_point in df.iterrows():
for idx_to, to_point in df.iterrows():
column_name = f"Distance_to_point_{idx_from}"
haversine_matrix = haversine_distances([[from_point.lat_rad, from_point.long_rad], [to_point.lat_rad, to_point.long_rad]])
point_distance = haversine_matrix[0][1] * 6371000/1000
df.loc[idx_to, column_name] = point_distance
df
duration latitude longitude lat_rad long_rad Distance_to_point_0 Distance_to_point_1 Distance_to_point_2 Distance_to_point_3 Distance_to_point_4 Distance_to_point_5 Distance_to_point_6 Distance_to_point_7
0 3 11.923855 57.723843 0.20811052928038845 0.20811052928038845 0.0 0.0010889626934743966 0.0006222644021223135 0.001244528808978787 0.0015556609862946524 0.002177925427923575 0.000777830496776312 0.0014000949117650525
1 13 11.923862 57.723831 0.2081106514534361 0.2081106514534361 0.0010889626934743966 0.0 0.0017112270955967099 0.002333491502453183 0.0004666982928202561 0.00326688812139797 0.00031113219669808446 0.0024890576052394482
2 14 11.923851 57.723839 0.2081104594672184 0.2081104594672184 0.0006222644021223135 0.0017112270955967099 0.0 0.0006222644068564735 0.002177925388416966 0.0015556610258012616 0.0014000948988986254 0.0007778305096427389
3 4 11.923847 57.723831 0.20811038965404832 0.20811038965404832 0.001244528808978787 0.002333491502453183 0.0006222644068564735 0.0 0.0028001897952734385 0.0009333966189447881 0.002022359305755099 0.0001555661027862654
4 5 11.923865 57.723827 0.20811070381331365 0.20811070381331365 0.0015556609862946524 0.0004666982928202561 0.002177925388416966 0.0028001897952734385 0.0 0.003733586414218225 0.0007778304895183407 0.002955755898059704
5 7 11.923841 57.723831 0.20811028493429318 0.20811028493429318 0.002177925427923575 0.00326688812139797 0.0015556610258012616 0.0009333966189447881 0.003733586414218225 0.0 0.002955755924699886 0.0007778305161585227
6 9 11.92386 57.723835 0.20811061654685106 0.20811061654685106 0.000777830496776312 0.00031113219669808446 0.0014000948988986254 0.002022359305755099 0.0007778304895183407 0.002955755924699886 0.0 0.002177925408541364
7 8 11.923846 57.723827 0.20811037220075576 0.20811037220075576 0.0014000949117650525 0.0024890576052394482 0.0007778305096427389 0.0001555661027862654 0.002955755898059704 0.0007778305161585227 0.002177925408541364 0.0

You are confusing the index versus the values themselves, so you are getting a key error because there is no lat[i] (e.g., lat[11.923855]) in your example. After fixing i to be the index, your code would go beyond the last row of lat and lon with your [i+1]. Since you want to compare each row to the previous row, how about starting at index 1 and looking back by 1, then you won't go out of range. This edited version of your code does not crash:
for i in range(1, len(lat)):
lat1 = lat[i - 1]
lat2 = lat[i]
for j in range(1, len(lon)):
lon1 = lon[i - 1]
lon2 = lon[i]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m

Replace grouped columns' outliers with mean of the group based on defined zscore

I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the mean value for the group.
df =
[data][1]
I have tried the zscore values table without success
<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

You can use haversine_distances from scikit-learn to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.
Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):
name county latitude longitude
0 Aaron's Hill Surrey 51.18291 -0.63098
1 Abbas Combe Somerset 51.00283 -2.41825
2 Abberley Worcestershire 52.30522 -2.37574
3 Abberton Essex 51.83440 0.91066
4 Abberton Worcestershire 52.17955 -2.00817
5 Abberwick Northumberland 55.41325 -1.79720
6 Abbess End Essex 51.78000 0.28172
7 Abbess Roding Essex 51.77815 0.27685
8 Abbey Devon 50.88896 -3.22276
9 Abbeycwmhir / Abaty Cwm-hir Powys 52.33104 -3.38988
And here the code to change to solve your problem:
from math import radians
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances
df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])
# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))
# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)
# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
lambda x: haversine_distances(
[x[['latitude_radians', 'longitude_radians']].values],
[dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
)[0][0] * 6371000/1000, # multiply by Earth radius to get kilometers,
axis=1
)
# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})
# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
axis=1
)
# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values
The resulting DataFrame df looks like:
name county latitude longitude latitude_radians longitude_radians dist zscore
0 Aaron's Hill Surrey 51.18291 -0.63098 0.893310 -0.011013 12.479147 -0.293419
1 Abbas Combe Somerset 51.00283 -2.41825 0.890167 -0.042206 35.205157 1.088695
2 Abberley Worcestershire 52.30522 -2.37574 0.912898 -0.041464 17.014249 0.266168
3 Abberton Essex 51.83440 0.91066 0.904681 0.015894 24.504285 -0.254400
4 Abberton Worcestershire 52.17955 -2.00817 0.910705 -0.035049 11.906150 -0.663460
... ... ... ... ... ... ... ... ...
1795 Ayton Berwickshire 55.84232 -2.12285 0.974632 -0.037051 5.899085 0.007876
1796 Ayton Tyne and Wear 54.89416 -1.55643 0.958084 -0.027165 3.192591 -0.935937
If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):
name county latitude longitude
414 Aimes Green Essex 51.846594 0.554532
1721 Aveley Essex 51.846594 0.554532

Efficiently mask and calculate means for multiple groups in `xr.Dataset` xarray

I have two xr.Dataset objects. One is a continuous map of some variable (here precipitation). The other is a categorical map of a set of regions
['region_1', 'region_2', 'region_3', 'region_4'].
I want to calculate the mean precip in each region at each timestep by masking by region/time and then outputting a dataframe looking like the below.
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
I have some code but it runs very slowly for the real datasets. Can anyone help me optimize?
A minimum reproducible example
Initalising our objects, two variables of the same shape. The region object will have been read from a shapefile and will have more than two regions.
import xarray as xr
import pandas as pd
import numpy as np
def make_dataset(
variable_name='precip',
size=(30, 30),
start_date='2008-01-01',
end_date='2010-01-01',
lonmin=-180.0,
lonmax=180.0,
latmin=-55.152,
latmax=75.024,
):
# create 2D lat/lon dimension
lat_len, lon_len = size
longitudes = np.linspace(lonmin, lonmax, lon_len)
latitudes = np.linspace(latmin, latmax, lat_len)
dims = ["lat", "lon"]
coords = {"lat": latitudes, "lon": longitudes}
# add time dimension
times = pd.date_range(start_date, end_date, name="time", freq="M")
size = (len(times), size[0], size[1])
dims.insert(0, "time")
coords["time"] = times
# create values
var = np.random.randint(100, size=size)
return xr.Dataset({variable_name: (dims, var)}, coords=coords), size
ds, size = make_dataset()
# create dummy regions (not contiguous but doesn't matter for this example)
region_ds = xr.ones_like(ds).rename({'precip': 'region'})
array = np.random.choice([0, 1, 2, 3], size=size)
region_ds = region_ds * array
# create a dictionary explaining what the regions area
region_lookup = {
0: 'region_1',
1: 'region_2',
2: 'region_3',
3: 'region_4',
}
 What do these objects look like?
In[]: ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
Data variables:
precip (time, lat, lon) int64 51 92 14 71 60 20 82 ... 16 33 34 98 23 53
In[]: region_ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
Data variables:
region (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
Current Implementation
In order to calculate the mean of the variable in ds in each of the regions ['region_1', 'region_2', ...] in region_ds at each time, I need to loop over the TIME and the REGION.
I loop over each REGION, and then each TIMESTEP in the da object. This operation is pretty slow as the dataset gets larger (more pixels and more timesteps). Is there a more efficient / vectorized implementation anyone can think of.
My current implementation is super slow for all the regions and times that I need. Is there a more efficient use of numpy / xarray that will get me my desired result faster?
def drop_nans_and_flatten(dataArray: xr.DataArray) -> np.ndarray:
"""flatten the array and drop nans from that array. Useful for plotting histograms.
Arguments:
---------
: dataArray (xr.DataArray)
the DataArray of your value you want to flatten
"""
# drop NaNs and flatten
return dataArray.values[~np.isnan(dataArray.values)]
#
da = ds.precip
region_da = region_ds.region
valid_region_ids = [k for k in region_lookup.keys()]
# initialise empty lists
region_names = []
datetimes = []
mean_values = []
for valid_region_id in valid_region_ids:
for time in da.time.values:
region_names.append(region_lookup[valid_region_id])
datetimes.append(time)
# extract all non-nan values for that time-region
mean_values.append(
da.sel(time=time).where(region_da == valid_region_id).mean().values
)
df = pd.DataFrame(
{
"datetime": datetimes,
"region_name": region_names,
"mean_value": mean_values,
}
)
The output:
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
In [7]: df.tail()
Out[7]:
datetime region_name mean_value
43 2009-08-31 region_4 50.83111111111111
44 2009-09-30 region_4 48.40888888888889
45 2009-10-31 region_4 51.56148148148148
46 2009-11-30 region_4 48.961481481481485
47 2009-12-31 region_4 48.36296296296296
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
datetime 96 non-null datetime64[ns]
region_name 96 non-null object
mean_value 96 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.4+ KB
In [21]: df.describe()
Out[21]:
datetime region_name mean_value
count 96 96 96
unique 24 4 96
top 2008-10-31 00:00:00 region_1 48.88984800150122
freq 4 24 1
first 2008-01-31 00:00:00 NaN NaN
last 2009-12-31 00:00:00 NaN NaN
Any help would be very much appreciated ! Thankyou

It's hard to avoid iterating to generate the masks for the regions given how they are defined, but once you have those constructed (e.g. with the code below), I think the following would be pretty efficient:
regions = xr.concat(
[(region_ds.region == region_id).expand_dims(region=[region])
for region_id, region in region_lookup.items()],
dim='region'
)
result = ds.precip.where(regions).mean(['lat', 'lon'])
This generates a DataArray with 'time' and 'region' dimensions, where the value at each point is the mean at a given time over a given region. It would be straightforward to extend this to an area-weighted average if that were desired too.
An alternative option that generates the same result would be:
regions = xr.DataArray(
list(region_lookup.keys()),
coords=[list(region_lookup.values())],
dims=['region']
)
result = ds.precip.where(regions == region_ds.region).mean(['lat', 'lon'])
Here regions is basically just a DataArray representation of the region_lookup dictionary.

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Please see attached snapshot as I ask my question.
I have a data frame with a pair of Latitude / Longitudes that have been geocoded from different program and I am trying to generate a distance matrix between (Latitude1/Longitude1) and (Latitude2/Longitude2) and so on to find distance between locations.
My program below doesn't seem to read all the rows.
import pandas as pd
import googlemaps
import requests, json
gmaps = googlemaps.Client(key='123)
source = pd.DataFrame({'Latitude': df['Latitude1'] ,'Longitude': df['Longitude1']})
destination = pd.DataFrame({'Latitude': df['Latitude2'] ,'Longitude': df['Longitude2']})
source = source.reset_index(drop=True)
destination = destination.reset_index(drop=True)
for i in range(len(source)):
result = gmaps.distance_matrix(source, destination)
print(result)
Expected Output
Distance
12 Miles
10 Miles
5 Miles
1 Mile
DataFrame
Key Latitude1 Longitude1 Latitude2 Longitude#2
1 42 -91 40 -92
2 39 -94.35 38 -94
3 37 -120 36 -120
4 28.7 -90 35 -90
5 40 -94 38 -90
6 30 -90 25 -90

I haven't used gmaps, but this is a simple formula for calculating distance.
This is just maths, so I won't explain it here.
Just know you need 2 locations in the format (lat, lon) as the arguments and need to import math
def distance(origin, destination):
lat1, lon1 = origin
lat2, lon2 = destination
radius = 3959 # mi
dlat = math.radians(lat2-lat1)
dlon = math.radians(lon2-lon1)
a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
* math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = radius * c
return d
Now we need to merge the 2 dataframes More detail here
maindf = pd.merge(source, destination , left_index=True, right_index=True)
Next you need to apply that to each row
maindf['Distance'] = maindf.apply(lambda row: distance((row.Latitude1,row.Longditude1),(row.Latitude2,row.Longditude2)), axis=1)
Apply loops over the dataframe and applies the function.
In this case it is applies 'distance' to every row based on the 2 lat/long pairs in each row.
This adds a new column 'Distance' with the distance in miles between the 2 locations.
I would also add, if that is your full code, you don't actually add any data to the dataframes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorization to calculate many distances - python

Related

Performing calculations on DataFrames of different lengths

Python: Computing the distance between two point coordinates using two columns

Replace grouped columns' outliers with mean of the group based on defined zscore

Efficiently mask and calculate means for multiple groups in `xr.Dataset` xarray

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Categories

Resources