How to calculate Distance between two ZIPs? - python

I have a list of US ZIP codes and I have to calculate distance between all the ZIP Code Points. Its a 6k ZIPs long list, each entity has ZIP, City, State, Lat, Long, Area and Population.
So, I have to calculate distance between all the points, ie; 6000C2 combinations.
Here is a sample of my data
I've tried this in SAS but its too slow and inefficient, hence I'm looking for a way using Python or R.
Any leads would be appreciated.

Python Solution
If you have the corresponding latitudes and longitudes for the Zip codes, you can directly calculate the distance between them by using Haversine formula using 'mpu' library which determines the great-circle distance between two points on a sphere.
Example Code :
import mpu
zip_00501 =(40.817923,-73.045317)
zip_00544 =(40.788827,-73.039405)
dist =round(mpu.haversine_distance(zip_00501,zip_00544),2)
print(dist)
You will get the resultant distance in kms.
Output:
3.27
PS. If you don't have the corresponding coordinates for the zip codes, you can get the same using 'SearchEngine' module of 'uszipcode' library (only for US zip codes)
from uszipcode import SearchEngine
#for extensive list of zipcodes, set simple_zipcode =False
search = SearchEngine(simple_zipcode=True)
zip1 = search.by_zipcode('92708')
lat1 =zip1.lat
long1 =zip1.lng
zip2 =search.by_zipcode('53404')
lat2 =zip2.lat
long2 =zip2.lng
mpu.haversine_distance((lat1,long1),(lat2,long2))
Hope this helps!!

In SAS, use the GEODIST function.
GEODIST Function
Returns the geodetic distance between two latitude and longitude coordinates.
…
Syntax
GEODIST(latitude-1, longitude-1, latitude-2, longitude-2 <, options>)

R solution
#sample data: first three rows of data provided
df <- data.frame( zip = c( "00501", "00544", "00601" ),
longitude = c( -73.045075, -73.045147, -66.750909 ),
latitude = c( 40.816799, 40.817225, 18.181189 ),
stringsAsFactors = FALSE )
library( sf )
#create a spatial data.frame
spdf <- st_as_sf( x = df,
coords = c( "longitude", "latitude"),
crs = "+proj=longlat +datum=WGS84" )
#create the distance matrix (in meters), round to 0 decimals
m <- round( st_distance( spdf ), digits = 0 )
#set row and column names of matrix
colnames( m ) <- df$zip
rownames( m ) <- df$zip
#show distance matrix in meters
m
# Units: m
# 00501 00544 00601
# 00501 0 48 2580481
# 00544 48 0 2580528
# 00601 2580481 2580528 0

Related

For loop for minimum distance between points in dataframe and polygon in another dataframe

I want to calculate the distance from each point of dataframe geosearch_crs to the polygons in the gelb_crs dataframe, returning only the minimum distance.
I have tried this code:
for i in range(len(geosearch_crs)):
point = geosearch_crs['geometry'].iloc[i]
for j in range(len(gelb_crs)):
poly = gelb_crs['geometry'].iloc[j]
print(point.distance(poly).min())
it returns this error:
AttributeError: 'float' object has no attribute 'min'
I somehow don't get how to return what i want, the points.distance(poly).min() function should work though.
This is part of the data frames (around 180000 entries):
geosearch_crs:
count
geometry
12
POINT (6.92334 50.91695)
524
POINT (6.91970 50.93167)
5
POINT (6.96946 50.91469)
gelb_crs (35 entries):
name
geometry
Polygon 1
POLYGON Z ((6.95712 50.92851 0.00000, 6.95772 ...
Polygon 2
POLYGON Z ((6.91896 50.92094 0.00000, 6.92211 ...
I'm not sure about the 'distance' method, but maybe you could try adding the distances to a list:
distances = list()
for i in geosearch_crs:
for j in gelb_crs:
distances.append(i.distance(j))
print(min(distances))
your sample polygon data is unusable as it's truncated with ellipses. Have used two other polygons to deomonstrate as a MWE
need to ensure that CRS in both data frames is compatible. Your sample data is clearly in two different CRS, points look like epsg:4326, polygons are either a UTM CRS or EPSG:3857 from the range of values
geopandas sjoin_nearest() is simple way to find nearest polygon and get distance. Have use UTM CRS so that distance is in meters rather than degrees
import geopandas as gpd
import pandas as pd
import shapely
import io
df = pd.read_csv(
io.StringIO(
"""count,geometry
12,POINT (6.92334 50.91695)
524,POINT (6.91970 50.93167)
5,POINT (6.96946 50.91469)"""
)
)
geosearch_crs = gpd.GeoDataFrame(
df, geometry=df["geometry"].apply(shapely.wkt.loads), crs="epsg:4326"
)
# generated as sample in question unusable
df = pd.read_csv(
io.StringIO(
'''name,geometry
Polygon 1,"POLYGON ((6.9176561 50.8949742, 6.9171649 50.8951417, 6.9156967 50.8957149, 6.9111788 50.897751, 6.9100077 50.8989409, 6.9101989 50.8991319, 6.9120049 50.9009167, 6.9190374 50.9078591, 6.9258157 50.9143227, 6.9258714 50.9143691, 6.9259546 50.9144355, 6.9273598 50.915413, 6.9325715 50.9136438, 6.9331018 50.9134553, 6.9331452 50.9134397, 6.9255391 50.9018725, 6.922309 50.8988869, 6.9176561 50.8949742))"
Polygon 2,"POLYGON ((6.9044955 50.9340428, 6.8894236 50.9344297, 6.8829359 50.9375553, 6.8862995 50.9409307, 6.889446 50.9423764, 6.9038401 50.9436598, 6.909518 50.9383374, 6.908634 50.9369064, 6.9046363 50.9340648, 6.9045721 50.9340431, 6.9044955 50.9340428))"'''
)
)
gelb_crs = gpd.GeoDataFrame(
df, geometry=df["geometry"].apply(shapely.wkt.loads), crs="epsg:4326"
)
geosearch_crs.to_crs(geosearch_crs.estimate_utm_crs()).sjoin_nearest(
gelb_crs.to_crs(geosearch_crs.estimate_utm_crs()), distance_col="distance"
)
count
geometry
index_right
name
distance
0
12
POINT (354028.1446652143 5642643.287732874)
0
Polygon 1
324.158
2
5
POINT (357262.7994182631 5642301.777981625)
0
Polygon 1
2557.33
1
524
POINT (353818.4585403281 5644287.172541857)
1
Polygon 2
971.712

Calculate distances from geo coordinates in a 'pythonic' way [duplicate]

I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.

Python: generate distance matrix for large number of locations

I want to generate a distance matrix 500X500 based on latitude and longitude of 500 locations, using Haversine formula.
Here is the sample data "coordinate.csv" for 10 locations:
Name,Latitude,Longitude
depot1,35.492807,139.6681689
depot2,33.6625572,130.4096027
depot3,35.6159881,139.7805445
customer1,35.622632,139.732631
customer2,35.857287,139.821461
customer3,35.955313,139.615387
customer4,35.16073,136.926239
customer5,36.118163,139.509548
customer6,35.937351,139.909783
customer7,35.949508,139.676462
After getting the distance matrix, I want to find the closest depot to each customer based on the distance matrix, and then save the output (Distance from each customer to the closet depot & Name of the closest depot) to Pandas DataFrame.
Expected outputs:
// Distance matrix
[ [..],[..],[..],[..],[..],[..],[..],[..],[..],[..] ]
// Closet depot to each customer (just an example)
Name,Latitude,Longitude,Distance_to_closest_depot,Closest_depot
depot1,35.492807,139.6681689,,
depot2,33.6625572,130.4096027,,
depot3,35.6159881,139.7805445,,
customer1,35.622632,139.732631,10,depot1
customer2,35.857287,139.821461,20,depot3
customer3,35.955313,139.615387,15,depot2
customer4,35.16073,136.926239,12,depot3
customer5,36.118163,139.509548,25,depot1
customer6,35.937351,139.909783,22,depot2
customer7,35.949508,139.676462,15,depot1
There are a couple of library functions that can help you with this:
cdist from scipy can be used to generate a distance matrix using whichever distance metric you like.
There is also a haversine function which you can pass to cdist.
After that it's just a case of finding the row-wise minimums from the distance matrix and adding them to your DataFrame. Full code below:
import pandas as pd
from scipy.spatial.distance import cdist
from haversine import haversine
df = pd.read_clipboard(sep=',')
df.set_index('Name', inplace=True)
customers = df[df.index.str.startswith('customer')]
depots = df[df.index.str.startswith('depot')]
dm = cdist(customers, depots, metric=haversine)
closest = dm.argmin(axis=1)
distances = dm.min(axis=1)
customers['Closest Depot'] = depots.index[closest]
customers['Distance'] = distances
Results:
Latitude Longitude Closest Depot Distance
Name
customer1 35.622632 139.732631 depot3 4.393506
customer2 35.857287 139.821461 depot3 27.084212
customer3 35.955313 139.615387 depot3 40.565820
customer4 35.160730 136.926239 depot1 251.466152
customer5 36.118163 139.509548 depot3 60.945377
customer6 35.937351 139.909783 depot3 37.587862
customer7 35.949508 139.676462 depot3 38.255776
As per comment, I have created an alternative solution which instead uses a square distance matrix. The original solution is better in my opinion, as the question stated that we only want to find the closest depot for each customer, so calculating distances between customers and between depots isn't necessary. However, if you need the square distance matrix for some other purpose, here is how you would create it:
import pandas as pd
import numpy as np
from scipy.spatial.distance import squareform, pdist
from haversine import haversine
df = pd.read_clipboard(sep=',')
df.set_index('Name', inplace=True)
dm = pd.DataFrame(squareform(pdist(df, metric=haversine)), index=df.index, columns=df.index)
np.fill_diagonal(dm.values, np.inf) # Makes it easier to find minimums
customers = df[df.index.str.startswith('customer')]
depots = df[df.index.str.startswith('depot')]
customers['Closest Depot'] = dm.loc[depots.index, customers.index].idxmin()
customers['Distance'] = dm.loc[depots.index, customers.index].min()
The final results are the same as before, except you now have a square distance matrix. You can put the 0s back on the diagonal after you have extracted the minimum values if you like:
np.fill_diagonal(dm.values, 0)
If you need a very big matrix and have access to a NVIDIA GPU with CUDA you can use this numba function:
from numba import cuda
import math
#cuda.jit
def haversine_gpu_distance_matrix(p, G):
i,j = cuda.grid(2)
if i < p.shape[0] == G.shape[0] and j < p.shape[0] == G.shape[1]:
if i == j:
G[i][j] = 0
else:
longit_a = math.radians(p[i][0])
latit_a = math.radians(p[i][1])
longit_b = math.radians(p[j][0])
latit_b = math.radians(p[j][1])
dist_longit_add = longit_b - longit_a
dist_latit_sub = latit_b - latit_a
dist_latit_add = latit_b + latit_a
pre_comp = math.sin(dist_latit_sub/2)**2
area = pre_comp + ((1 - pre_comp - math.sin(dist_latit_add/2)**2) * math.sin(dist_longit_add/2)**2)
central_angle = 2 * math.asin(math.sqrt(area))
radius = 3958
G[i][j] = math.fabs(central_angle * radius)
You can call this function using the following commands:
# 10k [lon, lat] elements, replace this with your [lon, lat] array
# if your data is in a Pandas DataFrame, please convert it to a numpy array
geo_array = np.ones((10000, 2))
# allocate an empty distance matrix to fill when the function is called
dm_global_mem = cuda.device_array((geo_array.shape[0], geo_array.shape[0]))
# move the data in geo_array onto the GPU
geo_array_global_mem = cuda.to_device(geo_array)
# specify kernel dimensions, this can/should be further optimized for your hardware
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(geo_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(geo_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
# run the function, which will transform dm_global_mem inplace
haversine_gpu_distance_matrix[blockspergrid, threadsperblock](geo_array_global_mem, dm_global_mem)
Note that this can be further optimized for your hardware. The runtime on a g4dn.xlarge instance on 10k geographical coordinate pairs (i.e. 100M distance measurements) is less than 0.01 seconds after compiling. The radius value is set such that the distance matrix is in unit miles, you can change it to 6371 for meters.

How to iterate between two geo locations with a certain speed in Python(3)

I want to simulate a movement on a real world map (spherical) and represent the actual position on (google|openStreet)maps.
I have an initial lat/long pair e.g. (51.506314, -0.088455) and want to move to e.g. (51.509359, -0.087221) on a certain speed by getting interpolated coordinates periodically.
Pseudocode for clarification:
loc_init = (51.509359, -0.087221)
loc_target = (51.509359, -0.087221)
move_path = Something.path(loc_init, loc_target, speed=50)
for loc in move_path.get_current_loc():
map.move_to(loc)
device.notify_new_loc(loc)
...
time.sleep(1)
Retrieving the current interpolated position can happen in different ways e.g. calculating with a fixed refresh time (1 sec) or maybe running in a thread holding and calculating continuously new positions.
Unfortunately I never worked with geo data before and can't find something useful on the internet. Maybe there is already a module or an implementation doing that?
Solved my problem:
Found a C++ library geographiclib which was ported to Python doing exactly what I was looking for.
Example code to calculate a inverse geodesic line and get positions for a specific distance:
from geographiclib.geodesic import Geodesic
import math
# define the WGS84 ellipsoid
geod = Geodesic.WGS84
loc_init = (51.501218, -0.093773)
loc_target = (51.511020, -0.086563)
g = geod.Inverse(loc_init[0], loc_init[1], loc_target[0], loc_target[1])
l = geod.InverseLine(loc_init[0], loc_init[1], loc_target[0], loc_target[1])
print ("The distance is {:.3f} m.".format(g['s12']))
# interval in m for interpolated line between locations
interval = 500
step = int(math.ceil(l.s13 / interval))
for i in range(step + 1):
if i == 0:
print ("distance latitude longitude azimuth")
s = min(interval * i, l.s13)
loc = l.Position(s, Geodesic.STANDARD | Geodesic.LONG_UNROLL)
print ("{:.0f} {:.5f} {:.5f} {:.5f}".format(
loc['s12'], loc['lat2'], loc['lon2'], loc['azi2']))
Gives:
The distance is 1199.958 m.
distance latitude longitude azimuth
0 51.50122 -0.09377 24.65388
500 51.50530 -0.09077 24.65623
1000 51.50939 -0.08776 24.65858
1200 51.51102 -0.08656 24.65953

Calculating distance between *multiple* sets of geo coordinates in python

I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.

Categories

Resources