Related
As seen in the picture I have an outlier and I would like to remove it(not the red one but the one above it in green, which is not aligned with other points) and hence I am trying to find the min distance and then try to eliminate it. But given the huge dataset it takes an eternity to execute. This is my code below. Appreciate any solution that helps, thanks! enter image description here
import math
#list of 11600 points
dataset = [[2478, 3534], [4217, 953],......,11600 points]
copy_dataset = dataset
Indices =[]
Min_Dists =[]
Distance = []
Copy_Dist=[]
for p1 in range(len(dataset)):
p1_x= dataset[p1][0]
p1_y= dataset[p1][1]
for p2 in range(len(copy_dataset)):
p2_x= copy_dataset[p2][0]
p2_y= copy_dataset[p2][1]
dist = math.sqrt((p1_x - p2_x) ** 2 + (p1_y - p2_y) ** 2)
Distance.append(dist)
Copy_Dist.append(dist)
min_dist_1= min(Distance)
Distance.remove(min_dist_1)
if(min_dist_1 !=0):
Min_Dists.append(min_dist_1)
ind_1 = Copy_Dist.index(min_dist_1)
Indices.append(ind_1)
min_dist_2=min(Distance)
Distance.remove(min_dist_2)
if(min_dist_2 !=0):
Min_Dists.append(min_dist_2)
ind_2 = Copy_Dist.index(min_dist_2)
Indices.append(ind_2)
To_Remove = copy_dataset.index([p1_x, p1_y])
copy_dataset.remove(copy_dataset[To_Remove])
Not sure how to solve this problem in general, but it's probably a lot faster to compute the distances in a vectorized fashion.
dataset_copy = dataset.copy()
dataset_copy = dataset_copy[:, np.newaxis]
distance = np.sqrt(np.sum(np.square(dataset - dataset_copy), axis=~0))
Thank you for the answers mates! I tried the below way to solve the issue it worked pretty quick.
from statistics import mean
from scipy.spatial import distance
D = distance.squareform(distance.pdist(dataset))
closest = np.argsort(D, axis=1)
d1 =[]
for i in range(len(dataset)):
d1.append(D[i][closest[i][1]])
avg_dist = int(mean(d1))
for i in range(len(dataset)):
d1= D[i][closest[i][1]]
d2= D[i][closest[i][2]]
if(abs(avg_dist-d1)>2):
if(abs(avg_dist-d2)>2):
print(dataset[i])
dataset.remove(dataset[i])
If you need all distances at once:
distances = scipy.spatial.distance_matrix(dataset, dataset)
If you need distances of one point to all others:
for pt in dataset:
distances = scipy.spatial.distance_matrix([pt], dataset)[0]
# distances.min() will be 0 because the point has 0 distance to itself
# the nearest neighbor will be the second element in sorted order
indices = np.argpartition(distances, 1) # or use argsort for a complete sort
nearest_neighbor = indices[1]
Documentation: distance_matrix, argpartition
I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.
I have a following dataframe, the lat and lon are the latitudes and longitudes in Geographic coordinates system. I am trying to convert these coordinate system into native (x, y) projection.
I have tried pyproj for single points, but how do I proceed for the whole dataframe with thousands of rows.
time lat lon
0 2011-01-31 02:41:00 18.504273 -66.009332
1 2011-01-31 02:42:00 18.504673 -66.006225
I am trying to get something like this:
time lat lon x_Projn y_Projn
0 2011-01-31 02:41:00 18.504273 -66.009332 resp_x_val resp_y_val
1 2011-01-31 02:42:00 18.504673 -66.006225 resp_x_val resp_y_val
and so on...
Following is the code I tried for lat/lon to x,y system:
from pyproj import Proj, transform
inProj = Proj(init='epsg:4326')
outProj = Proj(init='epsg:3857')
x1,y1 = -105.150271116, 39.7278572773
x2,y2 = transform(inProj,outProj,x1,y1)
print (x2,y2)
Output:
-11705274.637407782 4826473.692203013
Thanks for any kind of help.
Unfortunately, pyproj only converts point by point. I guess something like this should work:
import pandas as pd
from pyproj import Proj, transform
inProj = Proj(init='epsg:4326')
outProj = Proj(init='epsg:3857')
def towgs84(row):
return pd.Series(transform(inProj, outProj, row["lat"], row["lon"]))
wsg84_df = df.apply(towgs84, axis=1) # new coord dataframe with two columns
You can iterate through the rows in a pandas data frame, transform the Longitude and Latitude values for each row, make two lists with the 1st and second coordinate values, and then turn the lists into new columns in original your data frame. Maybe not the prettiest, but this got the job done for me.
from pyproj import Proj, transform
M1s = [] #initiate empty list for 1st coordinate value
M2s = [] #initiate empty list for 2nd coordinate value
for index, row in df.iterrows(): #iterate over rows in the dataframe
long = row["Longitude (decimal degrees)"] #get the longitude for one row
lat = row["Latitude (decimal degrees)"] #get the latitude for one row
M1 = (transform(Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat))[0] #get the 1st coordinate
M2 = (transform(Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat))[1] #get the second coordinate
M1s.append(M1) #append 1st coordinate to list
M2s.append(M2) #append second coordinate to list
df['M1'] = M1s #new dataframe column with 1st coordinate
df['M2'] = M2s #new dataframe columne with second coordinate
I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.
I have datasets that look like the following: data0, data1, data2 (analogous to time versus voltage data)
If I load and plot the datasets using code like:
import pandas as pd
import numpy as np
from scipy import signal
from matplotlib import pylab as plt
data0 = pd.read_csv('data0.csv')
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
plt.plot(data0.x, data0.y, data1.x, data1.y, data2.x, data2.y)
I get something like:
now I try to correlate data0 with data1:
shft01 = np.argmax(signal.correlate(data0.y, data1.y)) - len(data1.y)
print shft01
plt.figure()
plt.plot(data0.x, data0.y,
data1.x.shift(-shft01), data1.y)
fig = plt.gcf()
with output:
-99
and
which works just as expected! but if I try it the same thing with data2, I get a plot that looks like:
with a positive shift of 410. I think I am just not understanding how pd.shift() works, but I was hoping that I could use pd.shift() to align my data sets. As far as I understand, the return from correlate() tells me how far off my data sets are, so I should be able to use shift to overlap them.
panda.shift() is not the correct method to shift curve along x-axis. You should adjust X values of the points:
plt.plot(data0.x, data0.y)
for target in [data1, data2]:
dx = np.mean(np.diff(data0.x.values))
shift = (np.argmax(signal.correlate(data0.y, target.y)) - len(target.y)) * dx
plt.plot(target.x + shift, target.y)
here is the output:
#HYRY one correction to your answer: there is an indexing mismatch between len(), which is one-based, and np.argmax(), which is zero-based. The line should read:
shift = (np.argmax(signal.correlate(data0.y, target.y)) - (len(target.y)-1)) * dx
For example, in the case where your signals are already aligned:
len(target.y) = N (one-based)
The cross-correlation function has length 2N-1, so the center value, for aligned data, is:
np.argmax(signal.correlate(data0.y, target.y) = N - 1 (zero-based)
shift = ((N-1) - N) * dx = (-1) * dx, when we really want 0 * dx