I need to grid scattered data in a GeoPandas dataframe to a regular grid (e.g. 1 degree) and get the mean values of the individual grid boxes and secondly plot this data with various projections.
The first point I managed to achieve using the gpd_lite_toolbox.
This result I can plot on a simple lat lon map, however trying to convert this to any other projection fails.
Here is a small example with some artificial data showing my issue:
import gpd_lite_toolbox as glt
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
from shapely import wkt
# creating the artificial df
df = pd.DataFrame(
{'data': [20, 15, 17.5, 11.25, 16],
'Coordinates': ['POINT(-58.66 -34.58)', 'POINT(-47.91 -15.78)',
'POINT(-70.66 -33.45)', 'POINT(-74.08 4.60)',
'POINT(-66.86 10.48)']})
# converting the df to a gdf with projection
df['Coordinates'] = df['Coordinates'].apply(wkt.loads)
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(df, crs=crs, geometry='Coordinates')
# gridding the data using the gridify_data function from the toolbox and setting grids without data to nan
g1 = glt.gridify_data(gdf, 1, 'data', cut=False)
g1 = g1.where(g1['data'] > 1)
# simple plot of the gridded data
fig, ax = plt.subplots(ncols=1, figsize=(20, 10))
g1.plot(ax=ax, column='data', cmap='jet')
# trying to convert to (any) other projection
g2 = g1.to_crs({'init': 'epsg:3395'})
# I get the following error
---------------------------------------------------------------------------
AttributeError: 'float' object has no attribute 'is_empty'
I would also be happy to use different gridding function if this solves the problem
Your g1 conatin too much NaN value.
g1 = g1.where(g1['data'] > 1)
print(g1)
geometry data
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
...
You should use g1[g1['data'] > 1] instead of g1.where(g1['data'] > 1).
g1 = g1[g1['data'] > 1]
print(g1)
geometry data
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
181 POLYGON ((-71.08 -32.52, -70.08 -32.52, -70.08... 17.50
322 POLYGON ((-67.08 10.48, -66.08 10.48, -66.08 9... 16.00
735 POLYGON ((-59.08 -34.52, -58.08 -34.52, -58.08... 20.00
1222 POLYGON ((-48.08 -15.52, -47.08 -15.52, -47.08... 15.00
g2 = g1.to_crs({'init': 'epsg:3395'})
print(g2)
geometry data
5 POLYGON ((-8246547.877965705 606885.3761893312... 11.25
181 POLYGON ((-7912589.405585884 -3808795.10464339... 17.50
322 POLYGON ((-7467311.442412791 1165421.424891677... 16.00
735 POLYGON ((-6576755.516066602 -4074627.00861716... 20.00
1222 POLYGON ((-5352241.117340593 -1737775.44359649... 15.00
multidimensional interpolation with dataframe not working
import pandas as pd
import numpy as np
raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'],
'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018',
'17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'],
'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07],
'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan]
}
df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY'])
df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE'])
df_volsurface.interpolate(method='akima',limit_direction='both')
Output:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>NaN</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>NaN</td></tr></tbody></table>
Expected Result:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>Expected some logical value</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>Expected some logical value</td></tr></tbody></table>
Linear interpolation methods gives copy last available values to all backward and forward missing value without considering ccy_code
df_volsurface.interpolate(method='linear',limit_direction='both')
Output:
<table><tbody><tr><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th><th> </th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>0.3424</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>0.3424</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.30205</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.2326</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.2238</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.20975</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>0.2092</td></tr></tbody></table>
Any help is appreciated! Thanks!
I'd like to point out that this is still onedimensional interpolation. We have one independent variable ('STRIKE') and one dependent variable ('VOLATILITY'). Interpolation is done for different conditions, e.g. for each day, each currency, each scenario, etc. The following is an example of how the interpolation can be done based on 'END_DATE' and 'CCY_CODE'.
# set all the conditions as index
df_volsurface.set_index(['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
df_volsurface.sort_index(level=['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
# create separate columns for all criteria except the independent variable
df_volsurface = df_volsurface.unstack(level=['END_DATE', 'CCY_CODE'])
for ccy in df_volsurface:
indices = df_volsurface[ccy].notna()
if not any(indices):
continue # we are not interested in a column with only NaN
x = df_volsurface.index.get_level_values(level='STRIKE') # independent var
y = df_volsurface[ccy] # dependent var
# create interpolation function
f_interp = scipy.interpolate.interp1d(x[indices], y[indices], kind='linear',
bounds_error=False, fill_value='extrapolate')
df_volsurface['VOL_INTERP', ccy[1], ccy[2]] = f_interp(x)
print(df_volsurface)
The interpolation for the other conditions should work analogously. This is the resulting DataFrame:
VOLATILITY VOL_INTERP
END_DATE 2018-03-16 2018-03-17 2018-03-17
CCY_CODE SGD EUR USD EUR USD
STRIKE
0.005 NaN NaN NaN 0.23900 0.42310
0.010 NaN NaN NaN 0.23600 0.38275
0.015 NaN NaN 0.3424 0.23300 0.34240
0.020 NaN NaN NaN 0.23000 0.30205
0.025 NaN NaN 0.2617 0.22700 0.26170
0.030 NaN NaN 0.2414 0.22400 0.24140
0.035 NaN NaN NaN 0.22100 0.22110
0.040 NaN NaN NaN 0.21800 0.20080
0.045 NaN 0.2150 NaN 0.21500 0.18050
0.050 NaN 0.2120 NaN 0.21200 0.16020
0.055 NaN 0.2103 NaN 0.21030 0.13990
0.060 NaN NaN NaN 0.20975 0.11960
0.065 NaN 0.2092 NaN 0.20920 0.09930
0.070 NaN NaN NaN 0.20865 0.07900
Use df_volsurface.stack() to return to a multiindex of your choice. There are also several pandas interpolation methods to choose from. However, I have not found a satisfactory solution for your problem using method='akima' because it only interpolates between the given data points, but does not seem to extrapolate beyond.
I have a numpy.ndarray as follow. It's the output from talib.RSI. It's the type of numpy.ndarray. I want to get the list of rolling(windows=3).max() and the rolling(window=3).min
How to do that?
[ nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan 56.50118203
60.05461743 56.99068148 55.70899949 59.2299361 64.19044898
60.62186599 53.96346826 44.06538636 52.04519976 51.32884016
58.65240379 60.44789401 58.94743634 59.75308787 53.56534397
54.22091468 47.22502341 51.5425848 50.0923126 49.80608264
45.69087847 50.07778871 54.21701441 58.79268406 63.59307774
66.08195696 65.49255218 65.11035657 68.47403716 70.70530564
73.21955929 76.57474822 65.89852612 66.51497688 72.42658468
73.80944844 69.56561001]
If you can afford adding a new dependency, I would rather do that with Pandas.
import numpy
import pandas
x = numpy.array([0, 1, 2, 3, 4])
s = pandas.Series(x)
print(s.rolling(3).min())
print(s.rolling(3).max())
print(s.rolling(3).mean())
print(s.rolling(3).std())
Note that converting your NumPy array to a Pandas series does not create a copy of the array, as Pandas uses NumPy arrays internally for its series.
You can use np.lib.stride_tricks.as_strided:
# a smaller example
import numpy.random as npr
npr.seed(123)
arr = npr.randn(10)
arr[:4] = np.nan
windows = np.lib.stride_tricks.as_strided(arr, shape=(8, 3), strides=(8, 8))
print(windows.max(axis=1))
print(windows.sum(axis=1))
[ nan nan nan nan 1.65143654 1.65143654
1.26593626 1.26593626]
[ nan nan nan nan -1.35384296 -1.20415534
-1.58965561 -0.02971677]
I have the following code that I am working on in python with interp1d and it seems that the output of the interp1d times the query points outputs the beginning values of array as NaN. Why?
Freq_Vector = np.arange(0,22051,1)
Freq_ref = np.array([20,25,31.5,40,50,63,80,100,125,160,200,250,315,400,500,630,750,800,1000,1250,1500,1600,2000,2500,3000,3150,4000,5000,6000,6300,8000,9000,10000,11200,12500,14000,15000,16000,18000,20000])
W_ref=-1*np.array([39.6,32,25.85,21.4,18.5,15.9,14.1,12.4,11,9.6,8.3,7.4,6.2,4.8,3.8,3.3,2.9,2.6,2.6,4.5,5.4,6.1,8.5,10.4,7.3,7,6.6,7,9.2,10.2,12.2,10.8,10.1,12.7,15,18.2,23.8,32.3,45.5,50])
if FreqVector[-1] > Freq_ref[-1]:
Freq_ref[-1] = FreqVector[-1]
WdB = interpolate.interp1d(Freq_ref,W_ref,kind='cubic',axis=-1, copy=True, bounds_error=False, fill_value=np.nan)(FreqVector)
The first 20 values in WdB are :
00000 = {float64} nan
00001 = {float64} nan
00002 = {float64} nan
00003 = {float64} nan
00004 = {float64} nan
00005 = {float64} nan
00006 = {float64} nan
00007 = {float64} nan
00008 = {float64} nan
00009 = {float64} nan
00010 = {float64} nan
00011 = {float64} nan
00012 = {float64} nan
00013 = {float64} nan
00014 = {float64} nan
00015 = {float64} nan
00016 = {float64} nan
00017 = {float64} nan
00018 = {float64} nan
00019 = {float64} nan
00020 = {float64} -39.6
00021 = {float64} -37.826313148
The following is the same outputted in maltab for the first 20 values:
-58.0424562952059
-59.2576965087483
-60.1150845850336
-60.6367649499501
-60.8448820293863
-60.7615802492306
-60.4090040353715
-59.8092978136973
-58.9846060100965
-57.9570730504576
-56.7488433606689
-55.3820613666188
-53.8788714941959
-52.2614181692886
-50.5518458177851
-48.7722988655741
-46.9449217385440
-45.0918588625830
-43.2352546635798
-41.3972535674226
-39.6000000000000
-37.8656383872004
How can I avoid this and actually have real values like matlab does with interp1d?
interp1d "outputs the beginning values of array as NaN. Why?"
Because the set of sample points you give it (Freq_ref) has a lower bound of 20 and interp1d will assign values for points outside the sample set the value of fill_value if bounds_error is False (docs).
And since you requested an interpolation for frequency values from 0 to 19, the method assigned them NaN.
This is different from Matlab's default which is to extrapolate using the requested interpolation method (docs).
That being said, I would be wary to call Matlab's (or any program's) default extrapolation values "real values", as extrapolation can be quite difficult and easily generate anomalous behavior. For the values you quotes, Matlab's 'cubic'/'pchip' extrapolation produces the graph:
The extrapolation indicates that the y-value turns over. This may be correct but should be considered carefully before taking as gospel.
That being said, if you would like to add extrapolation abilities to the interp1d method, see this answer (since I'm a Matlab guy and not a Python guy (yet)).
I do not know exactly the reason, but the fit actually works when looking at the plotted data.
from scipy import interpolate
import numpy as np
from matplotlib import pyplot as plt
Freq_Vector = np.arange(0,22051.0,1)
Freq_ref = np.array([20,25,31.5,40,50,63,80,100,125,160,200,250,315,\
400,500,630,750,800,1000,1250,1500,1600,2000,2500,3000,3150,\
4000,5000,6000,6300,8000,9000,10000,11200,12500,14000,15000,\
16000,18000,20000])
W_ref=-1*np.array([39.6,32,25.85,21.4,18.5,15.9,14.1,12.4,11,\
9.6,8.3,7.4,6.2,4.8,3.8,3.3,2.9,2.6,2.6,4.5,5.4,6.1,8.5,10.4,7.3,7,\
6.6,7,9.2,10.2,12.2,10.8,10.1,12.7,15,18.2,23.8,32.3,45.5,50])
if Freq_Vector[-1] > Freq_ref[-1]:
Freq_ref[-1] = Freq_Vector[-1]
WdB = interpolate.interp1d(Freq_ref.tolist(),W_ref.tolist(),\
kind='cubic', bounds_error=False)(Freq_Vector)
plt.plot(Freq_ref,W_ref,'..',color='black',label='Reference')
plt.plot(Freq_ref,W_ref,'-.',color='blue',label='Interpolated')
plt.legend()
The plot looks as follows:
The interpolation is actually happening, but the fitting is not as good as desirable. But if your intention is to fit your data, why don't you use a spline interpolator? Which is still cubic but less prone to overloads.
interpolate.InterpolatedUnivariateSpline(Freq_ref.tolist(),W_ref.tolist())(Freq_Vector)
And the data and plots come out very smoothly.
WdB
Out[34]:
array([-114.42984432, -108.43602531, -102.72381906, ..., -50.00471866,
-50.00236016, -50. ])