linear Interpolation between points of dataframe using nearest points of dataframe - python

I need a quick solution to interpolate between the nearest points of a data frame without adding new points to a data frame if there is a lot of data - millions of points (without NANs). The dataframe is sorted using x vlaues.
E.g. I have a dataframe with the next columns:
x | y
-----
0 | 1
1 | 2
2 | 3
...
I need a function that will fire out for a given x_input value calculated as a linear interpolated value between nearest points, something like this:
calc_linear(df, cinput_col = 'x', input_val=1.5, output_col=y) will output 2.5 - as interpolated y value for a given x
Maybe there are some pandas functions for that?

Use numpy.interp:
import numpy as np
def calc_linear(df, input_val, input_col='x', output_col='y'):
return np.interp(input_val, df[input_col], df[output_col])
y = calc_linear(df, 1.5)
print(y)
# Output
2.5

Related

Calculate distances from geo coordinates in a 'pythonic' way [duplicate]

I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.

Find high correlations in a large coefficient matrix

I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!
I think you can do where and stack(): this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64
This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]

Calculating distance between *multiple* sets of geo coordinates in python

I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat) tuples, for instance). Name the new column coords.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df DataFrame using the input column name, then applies the geopy vincenty() function to each row in the input column, using the square.coords column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T), because the loc[] method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters, .miles etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index].
You should be able to adapt the above fairly easily. You might have to adjust the apply call in the get_distance function to ensure you're passing the correct values to great_circle. The pandas apply docs might be useful, in particular with regard to passing positional arguments using args (you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
Oh and also
I can't remember whether geopy expects coordinates as (lon, lat) or (lat, lon). I bet it's the latter (sigh).
Update
Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T
I recently had to do a similar job, I ended writing a solution I consider very easy to understand and tweak to your needs, but possibly not the best/fastest:
Solution
It is very similar to what urschrei posted: assuming you want the distance between every two consecutive coordinates from a Pandas DataFrame, we can write a function to process each pair of points as the start and finish of a path, compute the distance and then construct a new DataFrame to be the return:
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
return pd.DataFrame(distances)
Usage example
coords = pd.DataFrame({
'lat': [-26.244333, -26.238000, -26.233880, -26.260000, -26.263730],
'lon': [-48.640946, -48.644670, -48.648480, -48.669770, -48.660700],
})
print('-> coords DataFrame:\n', coords)
print('-'*79, end='\n\n')
distances = get_distances(coords)
distances['total distance'] = distances['path distance'].cumsum()
print('-> distances DataFrame:\n', distances)
print('-'*79, end='\n\n')
# Or if you want to use tuple for start/finish coordinates:
print('-> distances DataFrame using tuples:\n', get_distances(coords, point_obj=tuple))
print('-'*79, end='\n\n')
Output example
-> coords DataFrame:
lat lon
0 -26.244333 -48.640946
1 -26.238000 -48.644670
2 -26.233880 -48.648480
3 -26.260000 -48.669770
4 -26.263730 -48.660700
-------------------------------------------------------------------------------
-> distances DataFrame:
start finish \
0 26 14m 39.5988s S, 48 38m 27.4056s W 26 14m 16.8s S, 48 38m 40.812s W
1 26 14m 16.8s S, 48 38m 40.812s W 26 14m 1.968s S, 48 38m 54.528s W
2 26 14m 1.968s S, 48 38m 54.528s W 26 15m 36s S, 48 40m 11.172s W
3 26 15m 36s S, 48 40m 11.172s W 26 15m 49.428s S, 48 39m 38.52s W
path distance total distance
0 0.7941932910049856 km 0.7941932910049856 km
1 0.5943709651000332 km 1.3885642561050187 km
2 3.5914909016938505 km 4.980055157798869 km
3 0.9958396130609087 km 5.975894770859778 km
-------------------------------------------------------------------------------
-> distances DataFrame using tuples:
start finish path distance
0 (-26.244333, -48.640946) (-26.238, -48.64467) 0.7941932910049856 km
1 (-26.238, -48.64467) (-26.23388, -48.64848) 0.5943709651000332 km
2 (-26.23388, -48.64848) (-26.26, -48.66977) 3.5914909016938505 km
3 (-26.26, -48.66977) (-26.26373, -48.6607) 0.9958396130609087 km
-------------------------------------------------------------------------------
As of 19th May
For anyone working with multiple geolocation data, you can adapt the above code but modify a bit to read the CSV file in your data drive. the code will write the output distances in the marked folder.
import pandas as pd
from geopy import Point, distance
def get_distances(coords: pd.DataFrame,
col_lat='lat',
col_lon='lon',
point_obj=Point) -> pd.DataFrame:
traces = len(coords) -1
distances = [None] * (traces)
for i in range(traces):
start = point_obj((coords.iloc[i][col_lat], coords.iloc[i][col_lon]))
finish = point_obj((coords.iloc[i+1][col_lat], coords.iloc[i+1][col_lon]))
distances[i] = {
'start': start,
'finish': finish,
'path distance': distance.geodesic(start, finish),
}
output = pd.DataFrame(distances)
output.to_csv('geopy_output.csv')
return output
I used the same code and generated distance data for over 50,000 coordinates.

Julia: Multinomial Regression with time series lagged values

I need to do some multinomial regression in Julia. In R I get the following result:
library(nnet)
data <- read.table("Dropbox/scripts/timeseries.txt",header=TRUE)
multinom(y~X1+X2,data)
# weights: 12 (6 variable)
initial value 10985.024274
iter 10 value 10438.503738
final value 10438.503529
converged
Call:
multinom(formula = y ~ X1 + X2, data = data)
Coefficients:
(Intercept) X1 X2
2 0.4877087 0.2588725 0.2762119
3 0.4421524 0.5305649 0.3895339
Residual Deviance: 20877.01
AIC: 20889.01
Here is my data
My first attempt was using Regression.jl. The documentation is quite sparse for this package so I am not sure which category is used as baseline, which parameters the resulting output corresponds to, etc. I filed an issue to ask about these things here.
using DataFrames
using Regression
import Regression: solve, Options, predict
dat = readtable("timeseries.txt", separator='\t')
X = convert(Matrix{Float64},dat[:,2:3])
y = convert(Vector{Int64},dat[:,1])
ret = solve(mlogisticreg(X',y,3), reg=ZeroReg(), options=Options(verbosity=:iter))
the result is
julia> ret.sol
3x2 Array{Float64,2}:
-0.573027 -0.531819
0.173453 0.232029
0.399575 0.29979
but again, I am not sure what this corresponds to.
Next I tried the Julia wrapper to Python's SciKitLearn:
using ScikitLearn
#sk_import linear_model: LogisticRegression
model = ScikitLearn.fit!(LogisticRegression(multi_class="multinomial", solver = "lbfgs"), X, y)
model[:coef_]
3x2 Array{Float64,2}:
-0.261902 -0.220771
-0.00453731 0.0540354
0.266439 0.166735
but I have not figured out how to extract the coefficients from this model. Updated with coefficients. These also don't look like the R results.
Any help trying to replicate R's results would be appreciate (using whatever package!)
Note the response variables are just the discretized time-lagged response i.e.
julia> dat[1:3,:]
3x3 DataFrames.DataFrame
| Row | y | X1 | X2 |
|-----|---|----|----|
| 1 | 3 | 1 | 0 |
| 2 | 3 | 0 | 1 |
| 3 | 1 | 0 | 1 |
for row 2 you can see that the response (0, 1) means the previous observation was a 3. Similarly (1,0) means previous observation was a 2 and (0,0) means previous observation was a 1.
Update:
For Regression.jl it seems it does not fit an intercept by default (and they call it "bias" instead of an intercept). By adding this term we get results very similar to python (not sure what the third column is though..)
julia> ret = solve(mlogisticreg(X',y,3, bias=1.0), reg=ZeroReg(), options=Options(verbosity=:iter))
julia> ret.sol
3x3 Array{Float64,2}:
-0.263149 -0.221923 -0.309949
-0.00427033 0.0543008 0.177753
0.267419 0.167622 0.132196
UPDATE:
Since the model coefficients are not identifiable I should not be expecting them to be the same acrossed these different implementations. However, the predicted probabilities should be the same, and in fact they are (using R, Regression.jl, or ScikitLearn).

Accelerate UDF in xlwings

I tried to experiment some features from Xlwings. I would like to use a common function from numpy which allowed to interpolate quickly (numpy.interp).
#xlfunc
def interp(x, xp,yp):
"""Interpolate vector x on vector (xp,yp)"""
y=np.interp(x,xp,yp)
return y
#xlfunc
def random():
"""Returns a random number between 0 and 1"""
x=np.random.randn()
return x
For instance, I create two vectors (xp, yp) like this (in Excel)
800 rows
First Column Second Column
0 =random()
1 =random()
2 =random()
3 =random()
4 =random()
5 =random()
[...]
In the third columns I create another vector (60 row), with random number bewteen 0 and 800 (ranked in ascending order)
Which give me something like this :
Third Column
17.2
52.6
75.8
[...]
I would like to interpolate the third column into the first column. So
Fourth Column
=interp(C1,A1:A800,B1:B800)
=interp(C2,A1:A800,B1:B800)
=interp(C3,A1:A800,B1:B800)
[...]
It's easy to do this. But if I have 10 or more columns to interpolate it could take too much time. I am sure there is a better way to do this. An idea ?
Thanks for your help !
edit :
I tried this but doesn't work at "xw.Range[...].value=y"
#xw.xlfunc
def interpbis(x, xp,yp):
"""Interpolate scalar x on vector (xp,yp)"""
thisWB=xw.Workbook.active()
thisSlctn=thisWB.get_selection(asarray=True)
sheet=thisSlctn.xl_sheet.name
r = thisSlctn.row
c = thisSlctn.column
y=np.interp(x,xp,yp)
xw.Range(sheet,(r,c)).value=y
return None
The short answer is: Use Excel's array functions.
The long answer is:
First, update to xlwings v0.6.4 (otherwise what I am going to show for random() will not work). Then change your functions as follows:
from xlwings import Application, Workbook, xlfunc
import numpy as np
#xlfunc
def interp(x, xp, yp):
"""Interpolate vector x on vector (xp,yp)"""
y = np.interp(x, xp, yp)
return y[:, np.newaxis] # column orientation
#xlfunc
def random():
"""Returns a random number between 0 and 1"""
app = Application(Workbook.caller())
# We shall make this easier in a future release (getting the array dimensions)
r = app.xl_app.Caller.Rows.Count
c = app.xl_app.Caller.Columns.Count
x = np.random.randn(r, c)
return x
Now use array formulas in Excel as described here (Ctrl+Shift+Enter).

Categories

Resources