How to more accurately approximate a set of points? - python

I would like to approximate bond yields in python. But the question arose which curve describes this better?
import numpy as np
import matplotlib.pyplot as plt
x = [0.02, 0.22, 0.29, 0.38, 0.52, 0.55, 0.67, 0.68, 0.74, 0.83, 1.05, 1.06, 1.19, 1.26, 1.32, 1.37, 1.38, 1.46, 1.51, 1.61, 1.62, 1.66, 1.87, 1.93, 2.01, 2.09, 2.24, 2.26, 2.3, 2.33, 2.41, 2.44, 2.51, 2.53, 2.58, 2.64, 2.65, 2.76, 3.01, 3.17, 3.21, 3.24, 3.3, 3.42, 3.51, 3.67, 3.72, 3.74, 3.83, 3.84, 3.86, 3.95, 4.01, 4.02, 4.13, 4.28, 4.36, 4.4]
y = [3, 3.96, 4.21, 2.48, 4.77, 4.13, 4.74, 5.06, 4.73, 4.59, 4.79, 5.53, 6.14, 5.71, 5.96, 5.31, 5.38, 5.41, 4.79, 5.33, 5.86, 5.03, 5.35, 5.29, 7.41, 5.56, 5.48, 5.77, 5.52, 5.68, 5.76, 5.99, 5.61, 5.78, 5.79, 5.65, 5.57, 6.1, 5.87, 5.89, 5.75, 5.89, 6.1, 5.81, 6.05, 8.31, 5.84, 6.36, 5.21, 5.81, 7.88, 6.63, 6.39, 5.99, 5.86, 5.93, 6.29, 6.07]
a = np.polyfit(np.power(x,0.5), y, 1)
y1 = a[0]*np.power(x,0.5)+a[1]
b = np.polyfit(np.log(x), y, 1)
y2 = b[0]*np.log(x) + b[1]
c = np.polyfit(x, y, 2)
y3 = c[0] * np.power(x,2) + np.multiply(c[1], x) + c[2]
plt.plot(x, y, 'ro', lw = 3, color='black')
plt.plot(x, y1, 'g', lw = 3, color='red')
plt.plot(x, y2, 'g', lw = 3, color='green')
plt.plot(x, y3, 'g', lw = 3, color='blue')
plt.axis([0, 4.5, 2, 8])
plt.rcParams['figure.figsize'] = [10, 5]
The parabolic too goes down at the end (blue), the logarithmic goes too quickly to zero at the beginning (green), and the square root has a strange hump (red). Is there any other ways of more accurate approximation or is it that I'm already getting pretty good?

Your fits look really good! If you wanted more information to compare which of your fits is better, you can look at sum of residuals and covariance of the coefficients.
a,residuals,cov = np.polyfit(np.power(x,0.5), y, 1, full=True, cov=True)
Residuals is the sum of squared residuals of the least-squares fit.
The cov matrix is the covariance of the polynomial coefficient estimates. The diagonal of this matrix is the variance estimates for each coefficient.

You need to search on google about "different types of error measures". These would help you to determine your best fit. Most commonly Root Mean Squared Error (RMSE) or Mean Absolute Percentage Error (MAPE) are used. You can also read about Relative Root Mean Squared Error (rRMSE). Choice of error measure depends on the problem at hand.

Related

Scatter color map based on a list different from x and y (Python)

Imagine you have the following:
import matplotlib.pyplot as plt
x = [0.8, 0.4, 0.6, 1, 1.5, 1.8, 2.0, 0.5, 1.3, 0.1]
y = [0.5, 0.12, 0.45, 0.98, 1.31, 1.87, 1.0, 0.11, 1.45, 0.67]
r = [x[i]/y[i] for i in range(len(x))]
fig, ax = plt.subplots(1,1, tight_layout=True, figsize=(10,10))
ax.subplot(x,y,cmap=?)
Now I would like to plot this and have a color map. However, the colors of the points are given by the values of r. How do I do this?
Thank you in advance.
Here's how you do it
import matplotlib.pyplot as plt
x = [0.8, 0.4, 0.6, 1, 1.5, 1.8, 2.0, 0.5, 1.3, 0.1]
y = [0.5, 0.12, 0.45, 0.98, 1.31, 1.87, 1.0, 0.11, 1.45, 0.67]
r = [x[i]/y[i] for i in range(len(x))]
fig, ax = plt.subplots(1,1, tight_layout=True, figsize=(10,10))
ax.scatter(x, y, c=r)
You can also change the default colormap
ax.scatter(x, y, c=r, cmap='viridis')
The complete color map reference
You can use MatPlotLib's pyplot.scatter, which takes 2 arrays (x values, y values) as required arguments. You can also supply a 3rd array, c, of the same length of x and y that sets the color value of each point.
In your case:
ax.scatter(x=x,y=y,c=r)
easy as that!

How can I get one array to return only the masked values define by another array with Numpy / PyTorch?

I have a mask, which has a shape of: [64, 2895] and an array pred which has a shape of [64, 2895, 161].
mask is binary with only 0s and 1s. What I want to do is reduce pred so that it maintains 64 batches, and along the 2895, wherever there is a 1 in the mask for each batch, return the related pred.
So as a simplified example, if:
mask = [[1, 0, 0],
[1, 1, 0],
[0, 0, 1]]
pred = [[[0.12, 0.23, 0.45, 0.56, 0.57],
[0.91, 0.98, 0.97, 0.96, 0.95],
[0.24, 0.46, 0.68, 0.80, 0.15]],
[[1.12, 1.23, 1.45, 1.56, 1.57],
[1.91, 1.98, 1.97, 1.96, 1.95],
[1.24, 1.46, 1.68, 1.80, 1.15]],
[[2.12, 2.23, 2.45, 2.56, 2.57],
[2.91, 2.98, 2.97, 2.96, 2.95],
[2.24, 2.46, 2.68, 2.80, 2.15]]]
What I want is:
[[[0.12, 0.23, 0.45, 0.56, 0.57]],
[[1.12, 1.23, 1.45, 1.56, 1.57],
[1.91, 1.98, 1.97, 1.96, 1.95]],
[[2.24, 2.46, 2.68, 2.80, 2.15]]]
I realize that there are different dimensions, I hope that that's possible. If not, then fill in the missing dimensions with 0. Either numpy or pytorch would be helpful. Thank you.
If you want a vectorized computation then different dimension seems not possible, but this would give you the one with masked entry filled with 0:
# pred: torch.size([64, 2895, 161])
# mask: torch.size([64, 2895])
result = pred * mask[:, :, None]
# extend mask with another dimension so now it can do entry-wise multiplication
and result is exactly what you want

Updtated titlle: Scipy.stats pdf bug?

I have a simple plot of a 2D Gaussian distribution.
from scipy.stats import multivariate_normal
from matplotlib import pyplot as plt
means = [ 1.03872615e+00, -2.66927843e-05]
cov_matrix = [[3.88809050e-03, 3.90737359e-06], [3.90737359e-06, 4.28819569e-09]]
# This works
a_lims = [0.7, 1.3]
b_lims = [-5, 5]
# This does not work
a_lims = [0.700006488869478, 1.2849292618191401]
b_lims =[-5.000288311285968, 5.000099437047633]
dist = multivariate_normal(mean=means, cov=cov_matrix)
a_plot, b_plot = np.mgrid[a_lims[0]:a_lims[1]:1e-2, b_lims[0]:b_lims[1]:0.1]
pos = np.empty(a_plot.shape + (2,))
pos[:, :, 0] = a_plot
pos[:, :, 1] = b_plot
z = dist.pdf(pos)
plt.figure()
plt.contourf(a_plot, b_plot, z, cmap='coolwarm', levels=100)
If I use the limits marked as "this works", I get the following plot (correct).
However, if I use the same limits, but slightly adjusted, it plots completely wrong, because localized at different values (below).
I guess it is a bug in mgrid. Does anyone have any ideas? More specifically, why does the maximum of the distribution move?
Focusing just on the xaxis:
In [443]: a_lims = [0.7, 1.3]
In [444]: np.mgrid[a_lims[0]:a_lims[1]:1e-2]
Out[444]:
array([0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ,
0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91,
0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. , 1.01, 1.02,
1.03, 1.04, 1.05, 1.06, 1.07, 1.08, 1.09, 1.1 , 1.11, 1.12, 1.13,
1.14, 1.15, 1.16, 1.17, 1.18, 1.19, 1.2 , 1.21, 1.22, 1.23, 1.24,
1.25, 1.26, 1.27, 1.28, 1.29, 1.3 ])
In [445]: a_lims = [0.700006488869478, 1.2849292618191401]
In [446]: np.mgrid[a_lims[0]:a_lims[1]:1e-2]
Out[446]:
array([0.70000649, 0.71000649, 0.72000649, 0.73000649, 0.74000649,
0.75000649, 0.76000649, 0.77000649, 0.78000649, 0.79000649,
0.80000649, 0.81000649, 0.82000649, 0.83000649, 0.84000649,
0.85000649, 0.86000649, 0.87000649, 0.88000649, 0.89000649,
0.90000649, 0.91000649, 0.92000649, 0.93000649, 0.94000649,
0.95000649, 0.96000649, 0.97000649, 0.98000649, 0.99000649,
1.00000649, 1.01000649, 1.02000649, 1.03000649, 1.04000649,
1.05000649, 1.06000649, 1.07000649, 1.08000649, 1.09000649,
1.10000649, 1.11000649, 1.12000649, 1.13000649, 1.14000649,
1.15000649, 1.16000649, 1.17000649, 1.18000649, 1.19000649,
1.20000649, 1.21000649, 1.22000649, 1.23000649, 1.24000649,
1.25000649, 1.26000649, 1.27000649, 1.28000649])
In [447]: _444.shape
Out[447]: (61,)
In [449]: _446.shape
Out[449]: (59,)
mgrid when given ranges like a:b:c uses np.arange(a, b, c). arange when given float step is not reliable with regards to the end point.
mgrid lets you use np.linspace which is better for floating point steps. For example with the first set of limits:
In [453]: a_lims = [0.7, 1.3]
In [454]: np.mgrid[a_lims[0]:a_lims[1]:61j]
Out[454]:
array([0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ,
0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91,
0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. , 1.01, 1.02,
1.03, 1.04, 1.05, 1.06, 1.07, 1.08, 1.09, 1.1 , 1.11, 1.12, 1.13,
1.14, 1.15, 1.16, 1.17, 1.18, 1.19, 1.2 , 1.21, 1.22, 1.23, 1.24,
1.25, 1.26, 1.27, 1.28, 1.29, 1.3 ])
===
By narrowing the b_lims considerably, and generating a finer mesh, I get a nice tilted ellipse.
means = [ 1, 0]
a_lims = [0.7, 1.3]
b_lims = [-.0002,.0002]
dist = multivariate_normal(mean=means, cov=cov_matrix)
a_plot, b_plot = np.mgrid[ a_lims[0]:a_lims[1]:1001j, b_lims[0]:b_lims[1]:1001j]
So I think the difference in your plots is an artifact of an excessively coarse mesh in the vertical direction. That potentially affects both the pdf generation and the contouring.
High resolution plot with original grid points. Only one b level intersects with the high probability values. Since the ellipse is tilted the two grids sample different parts, and hence the seemingly different pdfs.

matplotlib: grouping error bars for each x-axes tick

I am trying to use matplotlib to plot error bars but have a slightly different requirements. So, the setup is as follows:
I have 3 different methods that I am comparing across 10 different parameter setting. So, on the y-axes I have the model fitting errors as given by the 3 methods and on the x-axes, I have the different parameter settings.
So, for each parameter setting, I would like to get 3 error bar plots corresponding to the three methods. Ideally, I would like to plot the 95% confidence interval and also the minimum and maximum for each method for each parameter setting.
Some example data can be simulated as:
parameters = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
mean_1 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_2 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_3 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_3 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
I have kept the values same as it does not change anything from the plotting point of view. I see matplotlib.errorbar method but I do not know how to extend it for multiple methods over one single x-axes value as I have in my case. Additionally, I am not sure how to add the [min, max] markers for each of the methods.
Taking your parameters list as x axis, mean_1 as y value and std_1 as errors you can plot an errorbar chart with
pylab.errorbar(parameters, mean_1, yerr=std_1, fmt='bo')
In case the error bars are not symmetric, i.e. you have lower_err and upper_err, the statement reads
pylab.errorbar(parameters, mean_1, yerr=[lower_err, upper_err], fmt='bo')
The same works with keyword xerr for errors in x direction, which is now hopefully self-explanatory.
To show several (in your case 3) different datasets, you can go the following way:
# import pylab and numpy
import numpy as np
import pylab as pl
# define datasets
parameters = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
mean_1 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_2 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_2 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_3 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_3 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
# here comes the plotting;
# to achieve a grouping, two things are extra here:
# 1. Don't use line plot but circular markers and different marker color
# 2. slightly displace the datasets in x direction to avoid overlap
# and create visual grouping
pl.errorbar(np.array(parameters)-0.01, mean_1, yerr=std_1, fmt='bo')
pl.errorbar(parameters, mean_2, yerr=std_2, fmt='go')
pl.errorbar(np.array(parameters)+0.01, mean_3, yerr=std_3, fmt='ro')
pl.show()
This is about pylab.errorbar, where you have to give the errors explicitly. An alternative approach is to use pylab.boxplot and to prodice a boxplot for each model, but therefore I guess I'll need the full distribution per model per parameter instead of just mean and std.

Why is Scipy's ndimage.map_coordinates returning no values or wrong results for some arrays?

Code Returning Correct value but not always returning a value
In the following code, python is returning the correct interpolated value for arr_b but not for arr_a.
Event though, I've been looking at this problem for about a day now, I really am not sure what's going on.
For some reason, for arr_a, twoD_interpolate keeps returning [0] even if I play around or mess around with the data and input.
How can I fix my code so it's actually interpolating over arr_a and returning the correct results?
import numpy as np
from scipy.ndimage import map_coordinates
def twoD_interpolate(arr, xmin, xmax, ymin, ymax, x1, y1):
"""
interpolate in two dimensions with "hard edges"
"""
ny, nx = arr.shape # Note the order of ny and xy
x1 = np.atleast_1d(x1)
y1 = np.atleast_1d(y1)
# Mask upper and lower boundaries using #Jamies suggestion
np.clip(x1, xmin, xmax, out=x1)
np.clip(y1, ymin, ymax, out=y1)
# Change coordinates to match your array.
x1 = (x1 - xmin) * (xmax - xmin) / float(nx - 1)
y1 = (y1 - ymin) * (ymax - ymin) / float(ny - 1)
# order=1 is required to return your examples.
return map_coordinates(arr, np.vstack((y1, x1)), order=1)
# test data
arr_a = np.array([[0.7, 1.7, 2.5, 2.8, 2.9],
[1.9, 2.9, 3.7, 4.0, 4.2],
[1.4, 2.0, 2.5, 2.7, 3.9],
[1.1, 1.3, 1.6, 1.9, 2.0],
[0.6, 0.9, 1.1, 1.3, 1.4],
[0.6, 0.7, 0.9, 1.1, 1.2],
[0.5, 0.7, 0.9, 0.9, 1.1],
[0.5, 0.6, 0.7, 0.7, 0.9],
[0.5, 0.6, 0.6, 0.6, 0.7]])
arr_b = np.array([[6.4, 5.60, 4.8, 4.15, 3.5, 2.85, 2.2],
[5.3, 4.50, 3.7, 3.05, 2.4, 1.75, 1.1],
[4.7, 3.85, 3.0, 2.35, 1.7, 1.05, 0.4],
[4.2, 3.40, 2.6, 1.95, 1.3, 0.65, 0.0]])
# Test the second array
print twoD_interpolate(arr_b, 0, 6, 9, 12, 4, 11)
# Test first area
print twoD_interpolate(
arr_a, 0, 500, 0, 2000, 0, 2000)
print arr_a[0]
print twoD_interpolate(
arr_a_60, 0, 500, 0, 2000, 0, 2000)[0]
print twoD_interpolate(
arr_a, 20, 100, 100, 1600, 902, 50)
print twoD_interpolate(
arr_a, 100, 1600, 20, 100, 902, 50)
print twoD_interpolate(
arr_a, 100, 1600, 20, 100, 50, 902)
## Output
[ 1.7]
[ 0.]
[ 0.7 1.7 2.5 2.8 2.9]
0.0
[ 0.]
[ 0.]
[ 0.]
Code returning incorrect value:
arr = np.array([[12.8, 20.0, 23.8, 26.2, 27.4, 28.6],
[10.0, 13.6, 15.8, 17.4, 18.2, 18.8],
[5.5, 7.7, 8.7, 9.5, 10.1, 10.3],
[3.3, 4.7, 5.1, 5.5, 5.7, 6.1]])
twoD_interpolate(arr, 0, 1, 1400, 3200, 0.5, 1684)
# above should return 21 but is returning 3.44
This is actually my fault in the original question.
If we examine the position it is trying to interpolate twoD_interpolate(arr, 0, 1, 1400, 3200, 0.5, 1684) we get arr[ 170400, 0.1] as the value to find which will be clipped by mode='nearest' to arr[ -1 , 0.1]. Note I switched the x and y to get the positions as it would appear in an array.
This corresponds to a interpolation from the values arr[-1,0] = 3.3 and arr[-1,1] = 4.7 so the interpolation looks like 3.3 * .9 + 4.7 * .1 = 3.44.
The issues comes in the stride. If we take an array that goes from 50 to 250:
>>> a=np.arange(50,300,50)
>>> a
array([ 50, 100, 150, 200, 250])
>>> stride=float(a.max()-a.min())/(a.shape[0]-1)
>>> stride
50.0
>>> (75-a.min()) * stride
1250.0 #Not what we want!
>>> (75-a.min()) / stride
0.5 #There we go
>>> (175-a.min()) / stride
2.5 #Looks good
We can view this using map_coordinates:
#Input array from the above.
print map_coordinates(arr, np.array([[.5,2.5,1250]]), order=1, mode='nearest')
[ 75 175 250] #First two are correct, last is incorrect.
So what we really need is (x-xmin) / stride, for previous examples the stride was 1 so it did not matter.
Here is what the code should be:
def twoD_interpolate(arr, xmin, xmax, ymin, ymax, x1, y1):
"""
interpolate in two dimensions with "hard edges"
"""
arr = np.atleast_2d(arr)
ny, nx = arr.shape # Note the order of ny and xy
x1 = np.atleast_1d(x1)
y1 = np.atleast_1d(y1)
# Change coordinates to match your array.
if nx==1:
x1 = np.zeros_like(x1.shape)
else:
x_stride = (xmax-xmin)/float(nx-1)
x1 = (x1 - xmin) / x_stride
if ny==1:
y1 = np.zeros_like(y1.shape)
else:
y_stride = (ymax-ymin)/float(ny-1)
y1 = (y1 - ymin) / y_stride
# order=1 is required to return your examples and mode=nearest prevents the need of clip.
return map_coordinates(arr, np.vstack((y1, x1)), order=1, mode='nearest')
Note that clip is not required with mode='nearest'.
print twoD_interpolate(arr, 0, 1, 1400, 3200, 0.5, 1684)
[ 21.024]
print twoD_interpolate(arr, 0, 1, 1400, 3200, 0, 50000)
[ 3.3]
print twoD_interpolate(arr, 0, 1, 1400, 3200, .5, 50000)
[ 5.3]
Checking for arrays that are either 1D or pseudo 1D. Will interpolate the x dimension only unless the input array is of the proper shape:
arr = np.arange(50,300,50)
print twoD_interpolate(arr, 50, 250, 0, 5, 75, 0)
[75]
arr = np.arange(50,300,50)[None,:]
print twoD_interpolate(arr, 50, 250, 0, 5, 75, 0)
[75]
arr = np.arange(50,300,50)
print twoD_interpolate(arr, 0, 5, 50, 250, 0, 75)
[50] #Still interpolates the `x` dimension.
arr = np.arange(50,300,50)[:,None]
print twoD_interpolate(arr, 0, 5, 50, 250, 0, 75)
[75]

Categories

Resources