calculate distance between latitude longitude columns for pandas data frame

calculate distance between latitude longitude columns for pandas data frame - python

I am trying to apply this code:
import h3
coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)
distance = h3.point_dist(coords_1, coords_2, unit='km')
distance
to a pandas dataframe. This is my not working attempt.
data = {'lat1':[52.2296756],
'long1':[21.0122287],
'lat2':[52.406374],
'long2':[16.9251681],
}
df = pd.DataFrame(data)
df
df['distance'] = = h3.point_dist((df['lat1'], df['long1']), (df['lat2'], df['long2']), unit='km')
Any help would be very much appreciated. Thanks!

Assuming you have more than a single row for which you would like to compute the distance you can use apply as follows:
df['Dist'] = df.apply(lambda row: h3.point_dist((row['lat1'], row['long1']), (row['lat2'], row['long2'])), axis=1)
Which will add a column to your dataframe simi9lar to the following:
lat1 long1 lat2 long2 Dist
0 52.229676 21.012229 52.406374 16.925168 2.796556
1 57.229676 30.001176 48.421365 17.256314 6.565542
Please note, my distance calculations may not agree with yours, since I used a dummy function for h3.point_dist computation

It's working you need to just delete the second "="
data = {'lat1':[52.2296756],
'long1':[21.0122287],
'lat2':[52.406374],
'long2':[16.9251681],
}
df = pd.DataFrame(data)
df
df['distance'] = h3.point_dist((df['lat1'], df['long1']), (df['lat2'], df['long2']), unit='km')

Related

Matching geographic coordinates between two data frames

I have two data frames that have Longitude and Latitude columns.
DF1 and DF2:
DF1 = pd.DataFrame([[19.827658,-20.372238,8614], [19.825407,-20.362608,7412], [19.081514,-17.134456,8121]], columns=['Longitude1', 'Latitude1','Echo_top_height'])
DF2 = pd.DataFrame([[19.083727, -17.151207, 285.319994], [19.169403, -17.154144, 284.349994], [19.081514,-17.154456, 285.349994]], columns=['Longitude2', 'Latitude2','BT'])
I need to find a match for long and lat in DF1 with a long and lat in DF2. And where data match, add the corresponding value from the BT column from DF2 to DF1.
I used the code from here and managed to check if there is a match:
from sklearn.metrics.pairwise import haversine_distances
threshold = 5000 # meters
earth_radius = 6371000 # meters
DF1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=DF1[['Latitude1','Longitude1']].to_numpy()*np.pi/180,
Y=DF2[['Latitude2','Longitude2']].to_numpy()*np.pi/180)
*earth_radius < threshold).any(axis=1).astype(int)
So the result I need would look like this:
Longitude1 Latitude1 Echo_top_height BT
19.82 -20.37 8614 290.345
19.82 -20.36 7412 289.235
and so on...

You can use BallTree:
# Update: for newer versions of sklearn
from sklearn.neighbors import BallTree
from sklearn.metrics import DistanceMetric
# from sklearn.neighbors import BallTree, DistanceMetric
# DF1
coords = np.radians(df2[['Latitude2', 'Longitude2']])
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)
# DF2
coords = np.radians(df1[['Latitude1', 'Longitude1']])
distances, indices = tree.query(coords, k=1)
df1['BT'] = df2['BT'].iloc[indices.flatten()].values
df1['Distance'] = distances.flatten()
Output:
Longitude1
Latitude1
Echo_top_height
BT
Distance
19.8277
-20.3722
8614
284.35
0.0572097
19.8254
-20.3626
7412
284.35
0.0570377
19.0815
-17.1345
8121
285.32
0.000294681

It looks like you are comparing the dataframes by index, so you can use join and drop the unnecessary rows and columns:
DF3 = DF1.join(DF2[['BT']])
DF3 = DF3[DF3['nearby'].eq(1)].drop('nearby', axis=1)
DF3
full reproducible code:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
DF1 = pd.DataFrame([[19.827658,-20.372238,8614], [19.825407,-20.362608,7412], [19.081514,-17.134456,8121]], columns=['Longitude1', 'Latitude1','Echo_top_height'])
DF2 = pd.DataFrame([[19.083727, -17.151207, 285.319994], [19.169403, -17.154144, 284.349994], [19.081514,-17.154456, 285.349994]], columns=['Longitude2', 'Latitude2','BT'])
DF1, DF2
threshold = 5000 # meters
earth_radius = 6371000 # meters
DF1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=DF1[['Latitude1','Longitude1']].to_numpy()*np.pi/180,
Y=DF2[['Latitude2','Longitude2']].to_numpy()*np.pi/180)
*earth_radius < threshold).any(axis=1).astype(int)
DF3 = DF1.join(DF2[['BT']])
DF3 = DF3[DF3['nearby'].eq(1)].drop('nearby', axis=1)
DF3
Output:
Out[1]:
Longitude1 Latitude1 Echo_top_height BT
2 19.081514 -17.134456 8121 285.349994

Ordinary kriging questions

really newbie here. I would like to do an ordinary kriging on missing rainfall value.Here is my code.
from pykrige.ok import OrdinaryKriging
import numpy as np
import pandas as pd
fname = "C:/Users/Tan/Desktop/sample1.csv"
df = pd.read_csv(fname)
fname1 = "C:/Users/Tan/Desktop/sample2.csv"
df1 = pd.read_csv(fname1)
z = []
ss = []
for column in df1:
data = df1[column]
complete = []
lon1 = []
lat1 = []
lon2 = []
lat2 = []
for i in range(0,len(df)):
if data[i] != "" :
complete.append(data[i])
lon1.append(df['longitude'][i])
lat1.append(df['latitude'][i])
else:
lon2.append(df['longitude'][i])
lat2.append(df['latitude'][i])
OK = OrdinaryKriging(lon1, lat1, complete, variogram_model='linear', verbose=False,
enable_plotting=False, coordinates_type='geographic')
z, ss= OK.execute('grid', lon2, lat2)
z.append(z)
But I keep received [ValueError: zero-size array to reduction operation maximum which has no identity]
Please advise if there is another better way to solve this question. Thanks!

Remove Null values from your data and then try again. I hope this will work for you. Thanks!

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland

I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

How to check that a point is inside the given radius?

I have the following code that takes very long time to execute. The pandas DataFrames df and df_plants are very small (less than 1Mb). I wonder if there is any way to optimise this code:
import pandas as pd
import geopy.distance
import re
def is_inside_radius(latitude, longitude, df_plants, radius):
if (latitude != None and longitude != None):
lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
for index, row in df_plants.iterrows():
coords_1 = (lat, lon)
coords_2 = (row["latitude"], row["longitude"])
dist = geopy.distance.distance(coords_1, coords_2).km
if dist <= radius:
return 1
return 0
df["inside"] = df.apply(lambda row: is_inside_radius(row["latitude"],row["longitude"],df_plants,10), axis=1)
I use regex to process latitude and longitude in df because the values contain some errors (characters) which should be deleted.
The function is_inside_radius verifies if row[latitude] and row[longitude] are inside the radius of 10 km from any of the points in df_plants.

Can you try this?
import pandas as pd
from geopy import distance
import re
def is_inside_radius(latitude, longitude, df_plants, radius):
if (latitude != None and longitude != None):
lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
coords_1 = (lat, lon)
for row in df_plants.itertuples():
coords_2 = (row["latitude"], row["longitude"])
if distance.distance(coords_1, coords_2).km <= radius:
return 1
return 0
df["inside"] = df.map(
lambda row: is_inside_radius(
row["latitude"],
row["longitude"],
df_plants,
10),
axis=1)
From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows, pandas.DataFrame.itertuples() returns namedtuples of the values which is generally faster than pandas.DataFrame.iterrows(), and preserve dtypes across returned rows.

I've encountered such a problem before, and I see one simple optimisation: try to avoid the floating point calculation as much a possible, which you can do as follows:
Imagine:
You have a circle, defined by Mx and My (center coordinates) and R (radius).
You have a point, defined by is coordinates X and Y.
If your point (X,Y) is not even within the square, defined by (Mx, My) and size 2*R, then it will also not be within the circle, defined by (Mx, My) and radius R.
In pseudo-code:
function is_inside(X,Y,Mx,My,R):
if (abs(Mx-X) >= R) OR (abs(My-Y) >= R)
then return false
else:
// and only here you perform the floating point calculation

Appending function created column to an existing data frame

I currently have a dataframe as below:
and wish to add a column, E, that is calculated based on the following function.
def geometric_brownian_motion(T = 1, N = 100, mu = 0.1, sigma = 0.01, S0 = 20):
dt = float(T)/N
t = np.linspace(0, T, N)
W = np.random.standard_normal(size = N)
W = np.cumsum(W)*np.sqrt(dt) ### standard brownian motion ###
X = (mu-0.5*sigma**2)*t + sigma*W
S = S0*np.exp(X) ### geometric brownian motion ###
return S
(originating from here)
How to i create a time-series for all of the dates contained within the data-frame and append it?
The function input parameters are as follows:
T = (#days between df row 1 and df last)/365
N = # rows in data frame
S0 = 100

As i understand the essense of question is how to apply some method to every column, taking into account, the fact that to calculate a new value you need an index from dataframe:
I suggest you to extract index as separate column and use apply as usually.
from functools import partial
df['index'] = df.index
T = # precalculate T here
N = df.shape[0]
applying_method = partial(geometric_brownian_motion,T=T,N=N, S0=100)
df['E'] = df.apply(lambda row: applying_method(*row),axis=1)
Or if you rename columns of dataframe accroding to you function arguments:
df['E'] = df.apply(lambda row: applying_method(**row),axis=1)
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculate distance between latitude longitude columns for pandas data frame - python

It's working you need to just delete the second "=" data = {'lat1':[52.2296756], 'long1':[21.0122287], 'lat2':[52.406374], 'long2':[16.9251681], } df = pd.DataFrame(data) df df['distance'] = h3.point_dist((df['lat1'], df['long1']), (df['lat2'], df['long2']), unit='km')

Related

Matching geographic coordinates between two data frames

Ordinary kriging questions

Statsmodels OLS with rolling window problem

How to check that a point is inside the given radius?

Appending function created column to an existing data frame

Categories

Resources