Python: Computing the distance between two point coordinates using two columns

Python: Computing the distance between two point coordinates using two columns - python

I would like to compute the distance between two coordinates. I know I can compute the haversine distance between two points. However, I was wondering if there is an easier way of doing it instead of creating a loop using the formula iterating over the entire columns (also getting errors in the loop).
Here's some data for the example
# Random values for the duration from one point to another
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
df
duration latitude longitude
0 5 11.923855 57.723843
1 2 11.923862 57.723831
2 10 11.923851 57.723839
3 19 11.923847 57.723831
4 16 11.923865 57.723827
5 4 11.923841 57.723831
6 13 11.923860 57.723835
7 3 11.923846 57.723827
To compute the distance this is what I've attempted:
# Looping over each row to compute the Haversine distance between two points
# Earth's radius (in m)
R = 6373.0 * 1000
lat = df["latitude"]
lon = df["longitude"]
for i in lat:
lat1 = lat[i]
lat2 = lat[i+1]
for j in lon:
lon1 = lon[i]
lon2 = lon[i+1]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m
However, this is the error I get:
The two points to compute the distance should be taken from the same column.
first distance value:
11.923855 57.723843 (point1/observation1)
11.923862 57.723831 (point2/observation2)
second distance value:
11.923862 57.723831 (point1/observation2)
11.923851 57.723839(point2/observation3)
third distance value:
11.923851 57.723839(point1/observation3)
11.923847 57.723831 (point1/observation4)
... (and so on)

OK, first you can create a dataframe that combine each measurement with the previous one:
df2 = pd.concat([df.add_suffix('_pre').shift(), df], axis=1)
df2
This outputs:
duration_pre latitude_pre longitude_pre duration latitude longitude
0 NaN NaN NaN 5 11.923855 57.723843
1 5.0 11.923855 57.723843 2 11.923862 57.723831
2 2.0 11.923862 57.723831 10 11.923851 57.723839
…
Then create a haversine function and apply it to the rows:
def haversine(lat1, lon1, lat2, lon2):
import math
R = 6373.0 * 1000
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
return R *2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
df2.apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
which computes for each row the distance with the previous row (first one is thus NaN).
0 NaN
1 75.754755
2 81.120210
3 48.123604
…
And, if you want to include a new column in the original dataframe in one line:
df['distance'] = pd.concat([df.add_suffix('_pre').shift(), df], axis=1).apply(lambda x: haversine(x['latitude_pre'], x['longitude_pre'], x['latitude'], x['longitude']), axis=1)
Output:
duration latitude longitude distance
0 5 11.923855 57.723843 NaN
1 2 11.923862 57.723831 75.754755
2 10 11.923851 57.723839 81.120210
3 19 11.923847 57.723831 48.123604
4 16 11.923865 57.723827 116.515304
5 4 11.923841 57.723831 154.307571
6 13 11.923860 57.723835 122.794838
7 3 11.923846 57.723827 98.115312

I understood that you want to get the pairwise haversine distance between all points in your df. Here's how this could be done:
Be careful when using this approach with a lot of points as it generates a lot of columns quickly
Setup
import random
random_values = random.sample(range(2,20), 8)
random_values
# Creating arrays for the coordinates
lat_coor = [11.923855, 11.923862, 11.923851, 11.923847, 11.923865, 11.923841, 11.923860, 11.923846]
lon_coor = [57.723843, 57.723831, 57.723839, 57.723831, 57.723827, 57.723831, 57.723835, 57.723827]
df = pd.DataFrame(
{'duration': random_values,
'latitude': lat_coor,
'longitude': lon_coor
})
Get radians
import math
df['lat_rad'] = df.latitude.apply(math.radians)
df['long_rad'] = df.latitude.apply(math.radians)
Calculate pairwise distances
from sklearn.metrics.pairwise import haversine_distances
for idx_from, from_point in df.iterrows():
for idx_to, to_point in df.iterrows():
column_name = f"Distance_to_point_{idx_from}"
haversine_matrix = haversine_distances([[from_point.lat_rad, from_point.long_rad], [to_point.lat_rad, to_point.long_rad]])
point_distance = haversine_matrix[0][1] * 6371000/1000
df.loc[idx_to, column_name] = point_distance
df
duration latitude longitude lat_rad long_rad Distance_to_point_0 Distance_to_point_1 Distance_to_point_2 Distance_to_point_3 Distance_to_point_4 Distance_to_point_5 Distance_to_point_6 Distance_to_point_7
0 3 11.923855 57.723843 0.20811052928038845 0.20811052928038845 0.0 0.0010889626934743966 0.0006222644021223135 0.001244528808978787 0.0015556609862946524 0.002177925427923575 0.000777830496776312 0.0014000949117650525
1 13 11.923862 57.723831 0.2081106514534361 0.2081106514534361 0.0010889626934743966 0.0 0.0017112270955967099 0.002333491502453183 0.0004666982928202561 0.00326688812139797 0.00031113219669808446 0.0024890576052394482
2 14 11.923851 57.723839 0.2081104594672184 0.2081104594672184 0.0006222644021223135 0.0017112270955967099 0.0 0.0006222644068564735 0.002177925388416966 0.0015556610258012616 0.0014000948988986254 0.0007778305096427389
3 4 11.923847 57.723831 0.20811038965404832 0.20811038965404832 0.001244528808978787 0.002333491502453183 0.0006222644068564735 0.0 0.0028001897952734385 0.0009333966189447881 0.002022359305755099 0.0001555661027862654
4 5 11.923865 57.723827 0.20811070381331365 0.20811070381331365 0.0015556609862946524 0.0004666982928202561 0.002177925388416966 0.0028001897952734385 0.0 0.003733586414218225 0.0007778304895183407 0.002955755898059704
5 7 11.923841 57.723831 0.20811028493429318 0.20811028493429318 0.002177925427923575 0.00326688812139797 0.0015556610258012616 0.0009333966189447881 0.003733586414218225 0.0 0.002955755924699886 0.0007778305161585227
6 9 11.92386 57.723835 0.20811061654685106 0.20811061654685106 0.000777830496776312 0.00031113219669808446 0.0014000948988986254 0.002022359305755099 0.0007778304895183407 0.002955755924699886 0.0 0.002177925408541364
7 8 11.923846 57.723827 0.20811037220075576 0.20811037220075576 0.0014000949117650525 0.0024890576052394482 0.0007778305096427389 0.0001555661027862654 0.002955755898059704 0.0007778305161585227 0.002177925408541364 0.0

You are confusing the index versus the values themselves, so you are getting a key error because there is no lat[i] (e.g., lat[11.923855]) in your example. After fixing i to be the index, your code would go beyond the last row of lat and lon with your [i+1]. Since you want to compare each row to the previous row, how about starting at index 1 and looking back by 1, then you won't go out of range. This edited version of your code does not crash:
for i in range(1, len(lat)):
lat1 = lat[i - 1]
lat2 = lat[i]
for j in range(1, len(lon)):
lon1 = lon[i - 1]
lon2 = lon[i]
dlon = lon2 - lon1
dlat = lat2 - lat1
# Haversine formula
a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
distance = R * c
print(distance) # in m

Related

Error from calculating the distance between points with latitiude and longitude in python

I am trying to calculate the distance (in km) between different geolocations with latitude and longitude. I tried to use the code from this thread: Pandas Latitude-Longitude to distance between successive rows. However, I run into this error:
Does anyone know how to fix this issue?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
AttributeError: 'Series' object has no attribute 'radians'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-56-3c590360590e> in <module>
11
12 df['dist'] = haversine(df.latitude.shift(), df.longitude.shift(),
---> 13 df.loc[1:, 'latitude'], df.loc[1:, 'longitude'])
14
15
<ipython-input-56-3c590360590e> in haversine(lat1, lon1, lat2, lon2, to_radians, earth_radius)
2 def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
3 if to_radians:
----> 4 lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
5
6 a = np.sin((lat2-lat1)/2.0)**2 + \
TypeError: loop of ufunc does not support argument 0 of type Series which has no callable radians method
Here is the data frame:
>>> df_latlon
latitude longitude
0 37.405548 -122.078481
1 34.080610 -84.200785
2 37.770830 -122.395463
3 37.773792 -122.409865
4 41.441269 -96.494304
5 41.441269 -96.494304
6 41.441269 -96.494304
7 41.883784 -87.637668
8 26.140780 -80.124434
9 39.960000 -85.983660
Here is the code:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df_latlon['dist'] = haversine(df_latlon.latitude.shift(), df_latlon.longitude.shift(),
df_latlon.loc[1:, 'latitude'], df_latlon.loc[1:, 'longitude'])

You're passing in a Series to the haversine function rather than a simple number for the lat and lon attributes.
I think you can use the apply function to apply the haversine to each row in the dataframe, however, I'm not too sure what the best way is for apply to be able to get hold of the next or previous row.
So, I'd just add a couple of extra columns 'from lat' and 'from lon'. Then you will have all the data you need on each row.
# add the from lat and lon as extra columns
df_latlon['from lat'] = df_latlon['latitude'].shift(1)
df_latlon['from lon'] = df_latlon['longitude'].shift(1)
def calculate_distance(df_row):
return haversine(df_row['from lat'], df_row['from lon'], df_row['latitude'], df_row['longitude'])
# pass each row through the haversine function via the calculate_distance
df_latlon['dist'] = df_latlon.apply(calculate_distance, axis=1)

I think the issue is you want to calculate row by row, but sending the series into the function like doesn't seem to be working.
Try:
data='''
latitude longitude
0 37.405548 -122.078481
1 34.080610 -84.200785
2 37.770830 -122.395463
3 37.773792 -122.409865
4 41.441269 -96.494304
5 41.441269 -96.494304
6 41.441269 -96.494304
7 41.883784 -87.637668
8 26.140780 -80.124434
9 39.960000 -85.983660'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df[['lat2', 'lon2']] = df[['latitude', 'longitude']].shift()
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df_latlon['dist'] = df.apply(lambda x: haversine(x['lat2'], x['lon2'], x['latitude'], x['longitude']), axis=1)
latitude longitude lat2 lon2 dist
0 37.405548 -122.078481 NaN NaN NaN
1 34.080610 -84.200785 37.405548 -122.078481 3415.495909
2 37.770830 -122.395463 34.080610 -84.200785 3439.656694
3 37.773792 -122.409865 37.770830 -122.395463 1.307998
4 41.441269 -96.494304 37.773792 -122.409865 2248.480322
5 41.441269 -96.494304 41.441269 -96.494304 0.000000
6 41.441269 -96.494304 41.441269 -96.494304 0.000000
7 41.883784 -87.637668 41.441269 -96.494304 737.041395
8 26.140780 -80.124434 41.883784 -87.637668 1880.578726
9 39.960000 -85.983660 26.140780 -80.124434 1629.746292

Finding each 3 subset of closest coordinates (lan and long), Python

I have data set of bike sharing. The data has lan and long for each station. A sample of data is like below. I want to find each 3 station that are close to each other in term of coordinate and sum up the count for each of subcategory (3 closest points).
I know how we can calculate the distance between two point. but I don't know how to program this, in term of finding each 3 subset of closest coordinates.
The code for calculating distance between 2 point:
from math import cos, asin, sqrt, pi
def distance(lat1, lon1, lat2, lon2):
p = pi/180
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
return 12742 * asin(sqrt(a))
The data :
start_station_name start_station_latitude start_station_longitude. count
0 Schous plass 59.920259 10.760629. 2
1 Pilestredet 59.926224 10.729625. 4
2 Kirkeveien 59.933558 10.726426. 8
3 Hans Nielsen Hauges plass 59.939244 10.774319. 0
4 Fredensborg 59.920995 10.750358. 8
5 Marienlyst 59.932454 10.721769. 9
6 Sofienbergparken nord 59.923229 10.766171. 3
7 Stensparken 59.927140 10.730981. 4
8 Vålerenga 59.908576 10.786856. 6
9 Schous plass trikkestopp 59.920728 10.759486. 5
10 Griffenfeldts gate 59.933703 10.751930. 4
11 Hallénparken 59.931530 10.762169. 8
12 Alexander Kiellands Plass 59.928058 10.751397. 3
13 Uranienborgparken 59.922485 10.720896. 2
14 Sommerfrydhagen 59.911453 10.776072 1
15 Vestkanttorvet 59.924403 10.713069. 8
16 Bislettgata 59.923834 10.734638 9
17 Biskop Gunnerus' gate 59.912334 10.752292 1
18 Botanisk Hage sør 59.915282 10.769620 1
19 Hydroparken. 59.914145 10.715505 1
20 Bøkkerveien 59.927375 10.796015 1
what I want is :
closest count_sum
Schous plass, Pilestredet, Kirkeveien. 14
.
.
.
The Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-49-1a4d3a72c23d> in <module>
7 for idx_1, idx_2 in [(0, 1), (1, 2), (0, 2)]:
8 total_distance += distance(
----> 9 combination[idx_1]['start_station_latitude'],
10 combination[idx_1]['start_station_longitude'],
11 combination[idx_2]['start_station_latitude'],
TypeError: 'int' object is not subscriptable

You could try all possible combinations with itertools.combinations() and save station pairs with shortest total distance.
from itertools import combinations
best = (float('inf'), None)
for combination in combinations(data, 3):
total_distance = 0
for idx_1, idx_2 in [(0, 1), (1, 2), (0, 2)]:
total_distance += distance(
combination[idx_1]['start_station_latitude'],
combination[idx_1]['start_station_longitude'],
combination[idx_2]['start_station_latitude'],
combination[idx_2]['start_station_longitude'],
)
if total_distance < best[0]:
best = (total_distance, combination)
print(f'Best combination is {best[1]}, total distance: {best[0]}')
Keep in mind that there's still room for optimization, for example caching distance between two stations like
lru_cache(maxsize=None)
def distance(lat1, lon1, lat2, lon2):
p = pi/180
...

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Please see attached snapshot as I ask my question.
I have a data frame with a pair of Latitude / Longitudes that have been geocoded from different program and I am trying to generate a distance matrix between (Latitude1/Longitude1) and (Latitude2/Longitude2) and so on to find distance between locations.
My program below doesn't seem to read all the rows.
import pandas as pd
import googlemaps
import requests, json
gmaps = googlemaps.Client(key='123)
source = pd.DataFrame({'Latitude': df['Latitude1'] ,'Longitude': df['Longitude1']})
destination = pd.DataFrame({'Latitude': df['Latitude2'] ,'Longitude': df['Longitude2']})
source = source.reset_index(drop=True)
destination = destination.reset_index(drop=True)
for i in range(len(source)):
result = gmaps.distance_matrix(source, destination)
print(result)
Expected Output
Distance
12 Miles
10 Miles
5 Miles
1 Mile
DataFrame
Key Latitude1 Longitude1 Latitude2 Longitude#2
1 42 -91 40 -92
2 39 -94.35 38 -94
3 37 -120 36 -120
4 28.7 -90 35 -90
5 40 -94 38 -90
6 30 -90 25 -90

I haven't used gmaps, but this is a simple formula for calculating distance.
This is just maths, so I won't explain it here.
Just know you need 2 locations in the format (lat, lon) as the arguments and need to import math
def distance(origin, destination):
lat1, lon1 = origin
lat2, lon2 = destination
radius = 3959 # mi
dlat = math.radians(lat2-lat1)
dlon = math.radians(lon2-lon1)
a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
* math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = radius * c
return d
Now we need to merge the 2 dataframes More detail here
maindf = pd.merge(source, destination , left_index=True, right_index=True)
Next you need to apply that to each row
maindf['Distance'] = maindf.apply(lambda row: distance((row.Latitude1,row.Longditude1),(row.Latitude2,row.Longditude2)), axis=1)
Apply loops over the dataframe and applies the function.
In this case it is applies 'distance' to every row based on the 2 lat/long pairs in each row.
This adds a new column 'Distance' with the distance in miles between the 2 locations.
I would also add, if that is your full code, you don't actually add any data to the dataframes.

Vectorization to calculate many distances

I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.
For example, my dataset1 may look like below:
id lon lat varA
1 20.11 19.88 100
2 20.87 18.65 90
3 18.99 20.75 120
and my dataset2 may look like below:
placeid lon lat
a 18.75 20.77
b 19.77 22.56
c 20.86 23.76
d 17.55 20.74
Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.
I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.
dataset2['count'] = dataset1.apply(lambda x:
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis
= 1)
However, this gives the total number of rows, but not the ones that satisfy my requirements.
Would anyone be able to point me how to make the code work?

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.
ds1
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
ds2
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
from scipy.spatial import distance
# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11, 19.88],
# [ 20.87, 18.65],
# [ 18.99, 20.75]])
distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074, 2.70148108, 3.95182236, 2.70059253],
# [ 2.99813275, 4.06178532, 5.11000978, 3.92307278],
# [ 0.24083189, 1.97091349, 3.54358575, 1.44003472]])
distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well.
For the sake of this example, let's assume vara is:
vara = np.array([2,4.5,2])
(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):
vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926, 0.70148108, 1.95182236, 0.70059253],
# [-1.50186725, -0.43821468, 0.61000978, -0.57692722],
# [-1.75916811, -0.02908651, 1.54358575, -0.55996528]])
then setting positive values to zero and making negative values positive will give us the final array:
res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926, 0. , 0. , 0. ],
# [ 1.50186725, 0.43821468, 0. , 0.57692722],
# [ 1.75916811, 0.02908651, 0. , 0.55996528]])
Now, to sum over each row:
sum_ = res.sum(axis=1)
#out: array([ 0.37466926, 2.51700915, 2.34821989])
and to count the items in each row:
count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])
This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.
x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]
query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.
one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.
UPDATE:
To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.
lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)
Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)
You can do df.apply() and use Proj_to_... to project df.

IIUC:
Source DFs:
In [160]: d1
Out[160]:
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [161]: d2
Out[161]:
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
Vectorized haversine function:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])
a = pd.np.sin((lat2-lat1)/2.0)**2 + \
pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))
Solution:
x = d2.assign(x=1) \
.merge(d1.loc[d1['id']==1, ['lat','lon']].assign(x=1),
on='x', suffixes=['','2']) \
.drop(['x'], 1)
x['dist'] = haversine(x.lat, x.lon, x.lat2, x.lon2)
yields:
In [163]: x
Out[163]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
2 c 20.86 23.76 19.88 20.11 438.324033
3 d 17.55 20.74 19.88 20.11 283.565975
filtering:
In [164]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[164]:
Empty DataFrame
Columns: [placeid, lon, lat, lat2, lon2, dist]
Index: []
let's change d1, so a few rows would satisfy the criteria:
In [171]: d1.loc[0, 'varA'] = 350
In [172]: d1
Out[172]:
id lon lat varA
0 1 20.11 19.88 350 # changed: 100 --> 350
1 2 20.87 18.65 90
2 3 18.99 20.75 120
In [173]: x.loc[x.dist < d1.loc[d1['id']==1, 'varA'].iat[0]]
Out[173]:
placeid lon lat lat2 lon2 dist
0 a 18.75 20.77 19.88 20.11 172.924852
1 b 19.77 22.56 19.88 20.11 300.078600
3 d 17.55 20.74 19.88 20.11 283.565975

Use scipy.spatial.distance.cdist with your user-defined distance algorithm as the metric
h = lambda u, v: haversine(u['lon'], u['lat'], v['lon'], v['lat'])
dist_mtx = scipy.spatial.distance.cdist(dataset1, dataset2, metric = h)
Then to check the number in the area, just broadcast it
dataset2['count'] = np.sum(dataset1['A'][:, None] > dist_mtx, axis = 0)

python pandas datatype error float is required

I am trying to read data from a csv file and calculate bearing from coordinates.But I get error 'a float is required'. The function itself works if I run the function by itself (not using loop) with only two coordinates. So I think the issue is related to the last 'for' statement. I wonder if anybody has an idea where and how I should set float datatype? Thank you.
from math import *
import pandas as p
import numpy as np
bearingdata = 'xxxxxx.csv'
data = p.read_csv(bearingdata)
lat = [float(i) for i in data.Lat]
lon = [float(j) for j in data.Lon]
lat1 = lat[0: (len(lat) -2)]
lon1 = lon[0: (len(lon) -2)]
lat2 = lat[1: (len(lat) -1)]
lon2 = lon[1: (len(lon) -1)]
def bearing(lon_1, lat_1, lon_2, lat_2):
# convert decimal degrees to radians
lon_1, lat_1, lon_2, lat_2 = map(radians, [lon_1, lat_1, lon_2, lat_2])
#now calcuate bearing
Bearing = atan2(cos(lat1)*sin(lat2)-sin(lat1)*cos(lat2)*cos(lon2-lon1),sin(lon2-lon1)*cos(lat2))
Bearing = degrees(Bearing)
Bearing = (Bearing + 360) % 360
return Bearing
count = 0
for x in lat1:
print str(count) + "\n"
angle = bearing(lon1[count], lat1[count], lon2[count], lat2[count])
print "the bearing between " + str(lat1[count]) + "," + str(lon1[count]) + " and " + str(lat2[count]) + "," + str(lon2[count]) + " is: " + str(angle) + "degrees \n"
count = count + 1
* trace back *
Traceback (most recent call last):<br>
File "bearing.py", line 34, in <module><br>
angle = bearing(lon1[count], lat1[count], lon2[count], lat2[count])<br>
File "bearing.py", line 23, in bearing<br>
Bearing = atan2(cos(lat1)*sin(lat2)-sin(lat1)*cos(lat2)*cos(lon2-lon1),sin(lon2-lon1)*cos(lat2))<br>
TypeError: a float is required
* original data looks like this *
Lat (column name)
42.xxxxxx
... many rows
Lon (column name)
78.xxxx
... many rows

You almost never need to loop in pandas. Try something like this:
Generate some fake coordinates
import pandas
import numpy as np
N = 10
np.random.seed(0)
degree_data = pandas.DataFrame({
'lat': np.random.uniform(low=-90, high=90, size=N),
'lon': np.random.uniform(low=-180, high=180, size=N),
})
degree_data looks like this:
lat lon
0 8.786431 105.021014
1 38.734086 10.402171
2 18.497408 24.496042
3 8.078973 153.214790
4 -13.742136 -154.427019
5 26.260940 -148.633452
6 -11.234302 -172.721377
7 70.519140 119.743144
8 83.459297 100.136430
9 -20.980527 133.204373
convert to radians and shift everything up 1 row so that we can look ahead
radian_data = np.round(np.radians(degree_data), 2)
radian_data = radian_data.join(radian_data.shift(-1), lsuffix='1', rsuffix='2')
print(radian_data)
so now radian_data is:
lat1 lon1 lat2 lon2
0 0.15 1.83 0.68 0.18
1 0.68 0.18 0.32 0.43
2 0.32 0.43 0.14 2.67
3 0.14 2.67 -0.24 -2.70
4 -0.24 -2.70 0.46 -2.59
5 0.46 -2.59 -0.20 -3.01
6 -0.20 -3.01 1.23 2.09
7 1.23 2.09 1.46 1.75
8 1.46 1.75 -0.37 2.32
9 -0.37 2.32 NaN NaN
Define a bearing function that takes row from our dataframe
def bearing(row):
x = np.cos(row.lat1)*np.sin(row.lat2) - \
np.sin(row.lat1)*np.cos(row.lat2)*np.cos(row.lon2-row.lon1)
y = np.sin(row.lon2-row.lon1)*np.cos(row.lat2)
Bearing = np.degrees(np.arctan2(x, y))
return (Bearing + 360) % 360
apply that function to each row, save in the original dataframe
degree_data['bearing'] = radian_data.apply(bearing, axis=1)
# so now we have
lat lon bearing
0 8.786431 105.021014 140.855914
1 38.734086 10.402171 305.134809
2 18.497408 24.496042 22.751374
3 8.078973 153.214790 337.513363
4 -13.742136 -154.427019 81.301311
5 26.260940 -148.633452 235.214299
6 -11.234302 -172.721377 108.063240
7 70.519140 119.743144 98.957144
8 83.459297 100.136430 301.528278
9 -20.980527 133.204373 NaN

Add a decimal to your math in the function:
Bearing = (Bearing + 360) % 360.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Computing the distance between two point coordinates using two columns - python

Related

Error from calculating the distance between points with latitiude and longitude in python

Finding each 3 subset of closest coordinates (lan and long), Python

Gmaps Distance Matrix : How to iterate over sequence of rows in data frame and calculate distance

Vectorization to calculate many distances

python pandas datatype error float is required

Categories

Resources