Matching geographic coordinates between two data frames - python

I have two data frames that have Longitude and Latitude columns.
DF1 and DF2:
DF1 = pd.DataFrame([[19.827658,-20.372238,8614], [19.825407,-20.362608,7412], [19.081514,-17.134456,8121]], columns=['Longitude1', 'Latitude1','Echo_top_height'])
DF2 = pd.DataFrame([[19.083727, -17.151207, 285.319994], [19.169403, -17.154144, 284.349994], [19.081514,-17.154456, 285.349994]], columns=['Longitude2', 'Latitude2','BT'])
I need to find a match for long and lat in DF1 with a long and lat in DF2. And where data match, add the corresponding value from the BT column from DF2 to DF1.
I used the code from here and managed to check if there is a match:
from sklearn.metrics.pairwise import haversine_distances
threshold = 5000 # meters
earth_radius = 6371000 # meters
DF1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=DF1[['Latitude1','Longitude1']].to_numpy()*np.pi/180,
Y=DF2[['Latitude2','Longitude2']].to_numpy()*np.pi/180)
*earth_radius < threshold).any(axis=1).astype(int)
So the result I need would look like this:
Longitude1 Latitude1 Echo_top_height BT
19.82 -20.37 8614 290.345
19.82 -20.36 7412 289.235
and so on...

You can use BallTree:
# Update: for newer versions of sklearn
from sklearn.neighbors import BallTree
from sklearn.metrics import DistanceMetric
# from sklearn.neighbors import BallTree, DistanceMetric
# DF1
coords = np.radians(df2[['Latitude2', 'Longitude2']])
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(coords, metric=dist)
# DF2
coords = np.radians(df1[['Latitude1', 'Longitude1']])
distances, indices = tree.query(coords, k=1)
df1['BT'] = df2['BT'].iloc[indices.flatten()].values
df1['Distance'] = distances.flatten()
Output:
Longitude1
Latitude1
Echo_top_height
BT
Distance
19.8277
-20.3722
8614
284.35
0.0572097
19.8254
-20.3626
7412
284.35
0.0570377
19.0815
-17.1345
8121
285.32
0.000294681

It looks like you are comparing the dataframes by index, so you can use join and drop the unnecessary rows and columns:
DF3 = DF1.join(DF2[['BT']])
DF3 = DF3[DF3['nearby'].eq(1)].drop('nearby', axis=1)
DF3
full reproducible code:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
DF1 = pd.DataFrame([[19.827658,-20.372238,8614], [19.825407,-20.362608,7412], [19.081514,-17.134456,8121]], columns=['Longitude1', 'Latitude1','Echo_top_height'])
DF2 = pd.DataFrame([[19.083727, -17.151207, 285.319994], [19.169403, -17.154144, 284.349994], [19.081514,-17.154456, 285.349994]], columns=['Longitude2', 'Latitude2','BT'])
DF1, DF2
threshold = 5000 # meters
earth_radius = 6371000 # meters
DF1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=DF1[['Latitude1','Longitude1']].to_numpy()*np.pi/180,
Y=DF2[['Latitude2','Longitude2']].to_numpy()*np.pi/180)
*earth_radius < threshold).any(axis=1).astype(int)
DF3 = DF1.join(DF2[['BT']])
DF3 = DF3[DF3['nearby'].eq(1)].drop('nearby', axis=1)
DF3
Output:
Out[1]:
Longitude1 Latitude1 Echo_top_height BT
2 19.081514 -17.134456 8121 285.349994

Related

calculate distance between latitude longitude columns for pandas data frame

I am trying to apply this code:
import h3
coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)
distance = h3.point_dist(coords_1, coords_2, unit='km')
distance
to a pandas dataframe. This is my not working attempt.
data = {'lat1':[52.2296756],
'long1':[21.0122287],
'lat2':[52.406374],
'long2':[16.9251681],
}
df = pd.DataFrame(data)
df
df['distance'] = = h3.point_dist((df['lat1'], df['long1']), (df['lat2'], df['long2']), unit='km')
Any help would be very much appreciated. Thanks!
Assuming you have more than a single row for which you would like to compute the distance you can use apply as follows:
df['Dist'] = df.apply(lambda row: h3.point_dist((row['lat1'], row['long1']), (row['lat2'], row['long2'])), axis=1)
Which will add a column to your dataframe simi9lar to the following:
lat1 long1 lat2 long2 Dist
0 52.229676 21.012229 52.406374 16.925168 2.796556
1 57.229676 30.001176 48.421365 17.256314 6.565542
Please note, my distance calculations may not agree with yours, since I used a dummy function for h3.point_dist computation
It's working you need to just delete the second "="
data = {'lat1':[52.2296756],
'long1':[21.0122287],
'lat2':[52.406374],
'long2':[16.9251681],
}
df = pd.DataFrame(data)
df
df['distance'] = h3.point_dist((df['lat1'], df['long1']), (df['lat2'], df['long2']), unit='km')

Plot distribution of differences between two pandas dataframe columns

I have a pandas dataframe, which have columns A & B
I just want to plot a distribution graph of the percentage of differences between column A & B
A B
1 1.051990e+10 1.051990e+04
2 1.051990e+10 1.051990e+04
5 4.841800e+10 1.200000e+10
8 2.327700e+10 2.716000e+10
9 1.204900e+10 2.100000e+08
Distribution graph will be like, how many records are having 10% of differences, how many are 20% difference
I tried as follows
df percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(df['A'], df['B']), axis=1)
This is not working, as i'm newbie please help
You don't need the lambda operation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(1, 10, (20, 2)), columns=['A', 'B'])
def percCal(x,y):
return (x-y)*100/x
Alternatively, just manipulate the columns directly:
df1['diff'] = (df1['A'] - df1['B']) * 100 / df1['A']
Apply the function and plot:
df1['diff'] = percCal(df1['A'], df1['B'])
df1['diff'].plot(kind='density')
df['perc'] = (df['A'] - df['B']) *100/df['A']
def percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(x['A'], x['B']), axis=1)
Change dfin lambda for x in this case you are giving the function the data xthat means you are giving the percCalwhat you have in the row of the data frame and when you use dfyou are giving actually the data frame and the function is returning a data frame not a value. But please check your code, if xin the function can be 0 is a problem.
Think this is what you are looking for:
# Dummy df
data = [
[1.051990e+10, 1.051990e+04],
[1.051990e+10, 1.051990e+04],
[4.841800e+10, 1.200000e+10],
[2.327700e+10, 2.716000e+10],
[1.204900e+10, 2.100000e+08],
]
cols = ['A', 'B']
df2 = pd.DataFrame(data, columns=cols)
# Solution
import seaborn as sns
df2['pct_diff'] = (df2['A'] - df2['B']) / df2['A']
sns.distplot(df2['pct_diff']);

Statsmodels OLS with rolling window problem

I would like to do a regression with a rolling window, but I got only one parameter back after the regression:
rolling_beta = sm.OLS(X2, X1, window_type='rolling', window=30).fit()
rolling_beta.params
The result:
X1 5.715089
dtype: float64
What could be the problem?
Thanks in advance, Roland
I think the problem is that the parameters window_type='rolling' and window=30 simply do not do anything. First I'll show you why, and at the end I'll provide a setup I've got lying around for linear regressions on rolling windows.
1. The problem with your function:
Since you haven't provided some sample data, here's a function that returns a dataframe of a desired size with some random numbers:
# Function to build synthetic data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from collections import OrderedDict
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
Output:
X1 X2
2018-12-01 -1.085631 -1.294085
2018-12-02 0.997345 -1.038788
2018-12-03 0.282978 1.743712
2018-12-04 -1.506295 -0.798063
2018-12-05 -0.578600 0.029683
.
.
.
2019-01-17 0.412912 -1.363472
2019-01-18 0.978736 0.379401
2019-01-19 2.238143 -0.379176
Now, try:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='rolling', window=30).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And this at least represents the structure of your output too, meaning that you're expecting an estimate for each of your sample windows, but instead you get a single estimate. So I looked around for some other examples using the same function online and in the statsmodels docs, but I was unable to find specific examples that actually worked. What I did find were a few discussions talking about how this functionality was deprecated a while ago. So then I tested the same function with some bogus input for the parameters:
rolling_beta = sm.OLS(df['X2'], df['X1'], window_type='amazing', window=3000000).fit()
rolling_beta.params
Output:
X1 -0.075784
dtype: float64
And as you can see, the estimates are the same, and no error messages are returned for the bogus input. So I suggest that you take a look at the function below. This is something I've put together to perform rolling regression estimates.
2. A function for regressions on rolling windows of a pandas dataframe
df = sample(rSeed = 123, colNames = ['X1', 'X2', 'X3'], periodLength = 50)
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
"""
RegressionRoll takes a dataframe, makes a subset of the data if you like,
and runs a series of regressions with a specified window length, and
returns a dataframe with BETA or R^2 for each window split of the data.
Parameters:
===========
df: pandas dataframe
subset: integer - has to be smaller than the size of the df
dependent: string that specifies name of denpendent variable
inependent: LIST of strings that specifies name of indenpendent variables
const: boolean - whether or not to include a constant term
win: integer - window length of each model
parameters: string that specifies which model parameters to return:
BETA or R^2
Example:
========
RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'],
const = True, parameters = 'beta', win = 30)
"""
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
df_rolling = RegressionRoll(df=df, subset = 50, dependent = 'X1', independent = ['X2'], const = True, parameters = 'beta',
win = 30)
Output: A dataframe with beta estimates for OLS of X2 on X1 for each 30 period window of the data.
const X2
Date
2018-12-30 0.044042 0.032680
2018-12-31 0.074839 -0.023294
2019-01-01 -0.063200 0.077215
.
.
.
2019-01-16 -0.075938 -0.215108
2019-01-17 -0.143226 -0.215524
2019-01-18 -0.129202 -0.170304

python scipy spearman correlations

I am trying to obtain the column names from the dataframe (df) and associate them to the resulting array produced by the spearmanr correlation function. I need to associate both the column names (a-j) back to the correlation value (spearman) and the p-values (spearman_pvalue). Is there an intuitive way to perform this task?
from scipy.stats import pearsonr,spearmanr
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
def binary(row):
if row>=50:
return 1
else:
return 0
df['target']=df.a.apply(binary)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
print(spearman)
print(spearman_pvalue)
It seems you need:
from scipy.stats import spearmanr
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
#print (df)
#faster for binary df
df['target'] = (df['a'] >= 50).astype(int)
#print (df)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
df1 = pd.DataFrame(spearman.reshape(-1, 11), columns=df.columns)
#print (df1)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11), columns=df.columns)
#print (df2)
### Kyle, we can assign the index back to the column names for the total matrix:
df2=df2.set_index(df.columns)
df1=df1.set_index(df.columns)
Or:
df1 = pd.DataFrame(spearman.reshape(-1, 11),
columns=df.columns,
index=df.columns)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11),
columns=df.columns,
index=df.columns)

Visually separating bar chart clusters in pandas

This is more of a hack that almost works.
#!/usr/bin/env python
from pandas import *
import matplotlib.pyplot as plt
from numpy import zeros
# Create original dataframe
df = DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()
and gives:
pol1 pol2 pol3 pol4
art 0.247309 0.139797 0.673009 0.265708
mcf 0.951582 0.319486 0.447658 0.259821
mesa 0.888686 0.177007 0.845190 0.946728
perl 0.902977 0.863369 0.194451 0.698102
gcc 0.836407 0.700306 0.739659 0.265613
0 0.000000 0.000000 0.000000 0.000000
average 0.765392 0.439993 0.579993 0.487194
and
It gives the visual separation between benchmarks and average.
Is there a way to get rid of the 0 at the x-axis??
It turns out that DataFrame does not allow me to have muptiple dummy rows this way.
My solution was to change
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
into
row = pd.Series([dict({p:0.0 for p in df.columns}), ])
row.name = ""
Series can be named with empty string.
Still pretty hacky, but it works:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create original dataframe
df = pd.DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.reindex(np.where(df.index, df.index, ''))
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()

Categories

Resources