nan in interp1d scipy - python

I have the following code that I am working on in python with interp1d and it seems that the output of the interp1d times the query points outputs the beginning values of array as NaN. Why?
Freq_Vector = np.arange(0,22051,1)
Freq_ref = np.array([20,25,31.5,40,50,63,80,100,125,160,200,250,315,400,500,630,750,800,1000,1250,1500,1600,2000,2500,3000,3150,4000,5000,6000,6300,8000,9000,10000,11200,12500,14000,15000,16000,18000,20000])
W_ref=-1*np.array([39.6,32,25.85,21.4,18.5,15.9,14.1,12.4,11,9.6,8.3,7.4,6.2,4.8,3.8,3.3,2.9,2.6,2.6,4.5,5.4,6.1,8.5,10.4,7.3,7,6.6,7,9.2,10.2,12.2,10.8,10.1,12.7,15,18.2,23.8,32.3,45.5,50])
if FreqVector[-1] > Freq_ref[-1]:
Freq_ref[-1] = FreqVector[-1]
WdB = interpolate.interp1d(Freq_ref,W_ref,kind='cubic',axis=-1, copy=True, bounds_error=False, fill_value=np.nan)(FreqVector)
The first 20 values in WdB are :
00000 = {float64} nan
00001 = {float64} nan
00002 = {float64} nan
00003 = {float64} nan
00004 = {float64} nan
00005 = {float64} nan
00006 = {float64} nan
00007 = {float64} nan
00008 = {float64} nan
00009 = {float64} nan
00010 = {float64} nan
00011 = {float64} nan
00012 = {float64} nan
00013 = {float64} nan
00014 = {float64} nan
00015 = {float64} nan
00016 = {float64} nan
00017 = {float64} nan
00018 = {float64} nan
00019 = {float64} nan
00020 = {float64} -39.6
00021 = {float64} -37.826313148
The following is the same outputted in maltab for the first 20 values:
-58.0424562952059
-59.2576965087483
-60.1150845850336
-60.6367649499501
-60.8448820293863
-60.7615802492306
-60.4090040353715
-59.8092978136973
-58.9846060100965
-57.9570730504576
-56.7488433606689
-55.3820613666188
-53.8788714941959
-52.2614181692886
-50.5518458177851
-48.7722988655741
-46.9449217385440
-45.0918588625830
-43.2352546635798
-41.3972535674226
-39.6000000000000
-37.8656383872004
How can I avoid this and actually have real values like matlab does with interp1d?

interp1d "outputs the beginning values of array as NaN. Why?"
Because the set of sample points you give it (Freq_ref) has a lower bound of 20 and interp1d will assign values for points outside the sample set the value of fill_value if bounds_error is False (docs).
And since you requested an interpolation for frequency values from 0 to 19, the method assigned them NaN.
This is different from Matlab's default which is to extrapolate using the requested interpolation method (docs).
That being said, I would be wary to call Matlab's (or any program's) default extrapolation values "real values", as extrapolation can be quite difficult and easily generate anomalous behavior. For the values you quotes, Matlab's 'cubic'/'pchip' extrapolation produces the graph:
The extrapolation indicates that the y-value turns over. This may be correct but should be considered carefully before taking as gospel.
That being said, if you would like to add extrapolation abilities to the interp1d method, see this answer (since I'm a Matlab guy and not a Python guy (yet)).

I do not know exactly the reason, but the fit actually works when looking at the plotted data.
from scipy import interpolate
import numpy as np
from matplotlib import pyplot as plt
Freq_Vector = np.arange(0,22051.0,1)
Freq_ref = np.array([20,25,31.5,40,50,63,80,100,125,160,200,250,315,\
400,500,630,750,800,1000,1250,1500,1600,2000,2500,3000,3150,\
4000,5000,6000,6300,8000,9000,10000,11200,12500,14000,15000,\
16000,18000,20000])
W_ref=-1*np.array([39.6,32,25.85,21.4,18.5,15.9,14.1,12.4,11,\
9.6,8.3,7.4,6.2,4.8,3.8,3.3,2.9,2.6,2.6,4.5,5.4,6.1,8.5,10.4,7.3,7,\
6.6,7,9.2,10.2,12.2,10.8,10.1,12.7,15,18.2,23.8,32.3,45.5,50])
if Freq_Vector[-1] > Freq_ref[-1]:
Freq_ref[-1] = Freq_Vector[-1]
WdB = interpolate.interp1d(Freq_ref.tolist(),W_ref.tolist(),\
kind='cubic', bounds_error=False)(Freq_Vector)
plt.plot(Freq_ref,W_ref,'..',color='black',label='Reference')
plt.plot(Freq_ref,W_ref,'-.',color='blue',label='Interpolated')
plt.legend()
The plot looks as follows:
The interpolation is actually happening, but the fitting is not as good as desirable. But if your intention is to fit your data, why don't you use a spline interpolator? Which is still cubic but less prone to overloads.
interpolate.InterpolatedUnivariateSpline(Freq_ref.tolist(),W_ref.tolist())(Freq_Vector)
And the data and plots come out very smoothly.
WdB
Out[34]:
array([-114.42984432, -108.43602531, -102.72381906, ..., -50.00471866,
-50.00236016, -50. ])

Related

Keep inner values of numpy array

I have a tiff with lakes, which I converted into a 2D array. I would like to keep the outline of the lakes in a 2D array.
import rasterio
import numpy as np
with rasterio.open('myfile.tif') as dtm:
array = dtm.read(1)
array[array>0] = 1
array = array.astype(float)
array[array==0] = np.nan
My array looks like this now, a lake can be seen in the upper right corner:
[[ nan nan nan ... 2888.001 **2877.458 2867.5798**]
[ nan nan nan ... 2890.188 **2879.2876 2869.0415**]
[ nan nan nan ... 2892.2622 2880.9907 2870.4985]
...
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]]
I wish to keep the outline of the lakes I have to set all values to nan, which are NOT located next to a nan (marked in bold).
I have tried:
array[1:-1, 1:-1] = np.nan
However, this converts ALL inner values of the entire array to nan, not just the inner values of the lakes.
If you know of a completely different way how to keep the outline of the lakes (maybe with rasterio), I would also be thankful.
I hope I made clear what I mean with inner values of the lakes.
Tim

How to make a smooth heatmap?

I have a pandas dataframe called 'result' containing Longitude, Latitude and Production values. The dataframe looks like the following. For each pair of latitude and longitude there is one production value, therefore there many NaN values.
> Latitude 0.00000 32.00057 32.00078 ... 32.92114 32.98220 33.11217
Longitude ...
-104.5213 NaN NaN NaN ... NaN NaN NaN
-104.4745 NaN NaN NaN ... NaN NaN NaN
-104.4679 NaN NaN NaN ... NaN NaN NaN
-104.4678 NaN NaN NaN ... NaN NaN NaN
-104.4660 NaN NaN NaN ... NaN NaN NaN
This is my code:
plt.rcParams['figure.figsize'] = (12.0, 10.0)
plt.rcParams['font.family'] = "serif"
plt.figure(figsize=(14,7))
plt.title('Heatmap based on ANN results')
sns.heatmap(result)
The heatmap plot looks like this
but I want it to look more like this
How to adjust my code so it looks like the one on the second image?
I made a quick and dirty example of how you can smooth data in numpy array. It should be directly applicable to pandas dataframes as well.
First I present the code, then go through it:
# Some needed packages
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
from scipy.ndimage import gaussian_filter
np.random.seed(42)
# init an array with a lot of nans to imitate OP data
non_zero_entries = sparse.random(50, 60)
sparse_matrix = np.zeros(non_zero_entries.shape) + non_zero_entries
sparse_matrix[sparse_matrix == 0] = None
# set nans to 0
sparse_matrix[np.isnan(sparse_matrix)] = 0
# smooth the matrix
smoothed_matrix = gaussian_filter(sparse_matrix, sigma=5)
# Set 0s to None as they will be ignored when plotting
# smoothed_matrix[smoothed_matrix == 0] = None
sparse_matrix[sparse_matrix == 0] = None
# Plot the data
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
sharex=False, sharey=True,
figsize=(9, 4))
ax1.matshow(sparse_matrix)
ax1.set_title("Original matrix")
ax2.matshow(smoothed_matrix)
ax2.set_title("Smoothed matrix")
plt.tight_layout()
plt.show()
The code is fairly simple. You can't smooth NaN and we have to get rid of them. I set them to zero, but depending on your field you might want to interpolate them.
Using the gaussian_filter we smooth the image, where sigma controls the width of the kernel.
The plot code yields the following images

One-way Anova loop through pandas dataframe - results in a single table

I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.

Dateframe interpolate not working in Panda + multidimensional interpolation

multidimensional interpolation with dataframe not working
import pandas as pd
import numpy as np
raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'],
'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018',
'17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'],
'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07],
'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan]
}
df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY'])
df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE'])
df_volsurface.interpolate(method='akima',limit_direction='both')
Output:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>NaN</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>NaN</td></tr></tbody></table>
Expected Result:
<table><tbody><tr><th> </th><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>NaN</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>Expected some logical value</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.296358</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.230295</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.220911</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.209471</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>Expected some logical value</td></tr></tbody></table>
Linear interpolation methods gives copy last available values to all backward and forward missing value without considering ccy_code
df_volsurface.interpolate(method='linear',limit_direction='both')
Output:
<table><tbody><tr><th>CCY_CODE</th><th>END_DATE</th><th>STRIKE</th><th>VOLATILITY</th><th> </th></tr><tr><td>0</td><td>SGD</td><td>3/16/2018</td><td>0.005</td><td>0.3424</td></tr><tr><td>1</td><td>USD</td><td>3/17/2018</td><td>0.01</td><td>0.3424</td></tr><tr><td>2</td><td>USD</td><td>3/17/2018</td><td>0.015</td><td>0.3424</td></tr><tr><td>3</td><td>USD</td><td>3/17/2018</td><td>0.02</td><td>0.30205</td></tr><tr><td>4</td><td>USD</td><td>3/17/2018</td><td>0.025</td><td>0.2617</td></tr><tr><td>5</td><td>USD</td><td>3/17/2018</td><td>0.03</td><td>0.2414</td></tr><tr><td>6</td><td>USD</td><td>3/17/2018</td><td>0.035</td><td>0.2326</td></tr><tr><td>7</td><td>EUR</td><td>3/17/2018</td><td>0.04</td><td>0.2238</td></tr><tr><td>8</td><td>EUR</td><td>3/17/2018</td><td>0.045</td><td>0.215</td></tr><tr><td>9</td><td>EUR</td><td>3/17/2018</td><td>0.05</td><td>0.212</td></tr><tr><td>10</td><td>EUR</td><td>3/17/2018</td><td>0.55</td><td>0.2103</td></tr><tr><td>11</td><td>EUR</td><td>3/17/2018</td><td>0.06</td><td>0.20975</td></tr><tr><td>12</td><td>EUR</td><td>3/17/2018</td><td>0.065</td><td>0.2092</td></tr><tr><td>13</td><td>USD</td><td>3/17/2018</td><td>0.07</td><td>0.2092</td></tr></tbody></table>
Any help is appreciated! Thanks!
I'd like to point out that this is still onedimensional interpolation. We have one independent variable ('STRIKE') and one dependent variable ('VOLATILITY'). Interpolation is done for different conditions, e.g. for each day, each currency, each scenario, etc. The following is an example of how the interpolation can be done based on 'END_DATE' and 'CCY_CODE'.
# set all the conditions as index
df_volsurface.set_index(['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
df_volsurface.sort_index(level=['END_DATE', 'CCY_CODE', 'STRIKE'], inplace=True)
# create separate columns for all criteria except the independent variable
df_volsurface = df_volsurface.unstack(level=['END_DATE', 'CCY_CODE'])
for ccy in df_volsurface:
indices = df_volsurface[ccy].notna()
if not any(indices):
continue # we are not interested in a column with only NaN
x = df_volsurface.index.get_level_values(level='STRIKE') # independent var
y = df_volsurface[ccy] # dependent var
# create interpolation function
f_interp = scipy.interpolate.interp1d(x[indices], y[indices], kind='linear',
bounds_error=False, fill_value='extrapolate')
df_volsurface['VOL_INTERP', ccy[1], ccy[2]] = f_interp(x)
print(df_volsurface)
The interpolation for the other conditions should work analogously. This is the resulting DataFrame:
VOLATILITY VOL_INTERP
END_DATE 2018-03-16 2018-03-17 2018-03-17
CCY_CODE SGD EUR USD EUR USD
STRIKE
0.005 NaN NaN NaN 0.23900 0.42310
0.010 NaN NaN NaN 0.23600 0.38275
0.015 NaN NaN 0.3424 0.23300 0.34240
0.020 NaN NaN NaN 0.23000 0.30205
0.025 NaN NaN 0.2617 0.22700 0.26170
0.030 NaN NaN 0.2414 0.22400 0.24140
0.035 NaN NaN NaN 0.22100 0.22110
0.040 NaN NaN NaN 0.21800 0.20080
0.045 NaN 0.2150 NaN 0.21500 0.18050
0.050 NaN 0.2120 NaN 0.21200 0.16020
0.055 NaN 0.2103 NaN 0.21030 0.13990
0.060 NaN NaN NaN 0.20975 0.11960
0.065 NaN 0.2092 NaN 0.20920 0.09930
0.070 NaN NaN NaN 0.20865 0.07900
Use df_volsurface.stack() to return to a multiindex of your choice. There are also several pandas interpolation methods to choose from. However, I have not found a satisfactory solution for your problem using method='akima' because it only interpolates between the given data points, but does not seem to extrapolate beyond.

How to get the rolling(window = 3).max() on a numpy.ndarray?

I have a numpy.ndarray as follow. It's the output from talib.RSI. It's the type of numpy.ndarray. I want to get the list of rolling(windows=3).max() and the rolling(window=3).min
How to do that?
[ nan nan nan nan nan
nan nan nan nan nan
nan nan nan nan 56.50118203
60.05461743 56.99068148 55.70899949 59.2299361 64.19044898
60.62186599 53.96346826 44.06538636 52.04519976 51.32884016
58.65240379 60.44789401 58.94743634 59.75308787 53.56534397
54.22091468 47.22502341 51.5425848 50.0923126 49.80608264
45.69087847 50.07778871 54.21701441 58.79268406 63.59307774
66.08195696 65.49255218 65.11035657 68.47403716 70.70530564
73.21955929 76.57474822 65.89852612 66.51497688 72.42658468
73.80944844 69.56561001]
If you can afford adding a new dependency, I would rather do that with Pandas.
import numpy
import pandas
x = numpy.array([0, 1, 2, 3, 4])
s = pandas.Series(x)
print(s.rolling(3).min())
print(s.rolling(3).max())
print(s.rolling(3).mean())
print(s.rolling(3).std())
Note that converting your NumPy array to a Pandas series does not create a copy of the array, as Pandas uses NumPy arrays internally for its series.
You can use np.lib.stride_tricks.as_strided:
# a smaller example
import numpy.random as npr
npr.seed(123)
arr = npr.randn(10)
arr[:4] = np.nan
windows = np.lib.stride_tricks.as_strided(arr, shape=(8, 3), strides=(8, 8))
print(windows.max(axis=1))
print(windows.sum(axis=1))
[ nan nan nan nan 1.65143654 1.65143654
1.26593626 1.26593626]
[ nan nan nan nan -1.35384296 -1.20415534
-1.58965561 -0.02971677]

Categories

Resources