Given x and y data, I'd like to fit a spline to the data and numerically integrate the following fit. Using Univariate.Spline, I get a nice linear fit for log10(y) vs x. I then integrate the resulting spline using Univariate.Spline.integral(bounds). My problem is that I'm not sure how to interpret the output, given that I am working in semi-log space.
y = np.array([1,10,100,1000])
x = np.array([15,16,17,18])
x_vals = np.linspace(0,50,1000)
plt.scatter(x,np.log10(y))
s = interpolate.UnivariateSpline(x,np.log10(y))
plt.plot(x_vals,s(x_vals))
print(s.integral(15,17))
Should I take 10^(s.integral(15,17) to obtain the "true" value of the integral?
You can numerically integrate a function of the interpolation
from scipy import interpolate, integrate
def antilog_s(x):
return 10.0**s(x)
integrate.quad(antilog_s, 15, 17)
Out[16]: (42.99515370842196, 4.773420959438774e-13)
Related
I have a cloud of data points (x,y) that I would like to interpolate and smooth.
Currently, I am using scipy :
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
spl = interp1d(Cloud[:,1], Cloud[:,0]) # interpolation
x = np.linspace(Cloud[:,1].min(), Cloud[:,1].max(), 1000)
smoothed = savgol_filter(spl(x), 21, 1) #smoothing
This is working pretty well, except that I would like to give some weights to the data points given at interp1d. Any suggestion for another function that is handling this ?
Basically, I thought that I could just multiply the occurrence of each point of the cloud according to its weight, but that is not very optimized as it increases a lot the number of points to interpolate, and slows down the algorithm ..
The default interp1d uses linear interpolation, i.e., it simply computes a line between two points. A weighted interpolation does not make much sense mathematically in such scenario - there is only one way in euclidean space to make a straight line between two points.
Depending on your goal, you can look into other methods of interpolation, e.g., B-splines. Then you can use scipy's scipy.interpolate.splrep and set the w argument:
w - Strictly positive rank-1 array of weights the same length as x and y. The weights are used in computing the weighted least-squares spline fit. If the errors in the y values have standard-deviation given by the vector d, then w should be 1/d. Default is ones(len(x)).
Suppose I have a curve, and then I estimate its gradient via finite differences by using np.gradient. Given an initial point x[0] and the gradient vector, how can I reconstruct the original curve? Mathematically I see its possible given this system of equations, but I'm not certain how to do it programmatically.
Here is a simple example of my problem, where I have sin(x) and I compute the numerical difference, which matches cos(x).
test = np.vectorize(np.sin)(x)
numerical_grad = np.gradient(test, 30./100)
analytical_grad = np.vectorize(np.cos)(x)
## Plot data.
ax.plot(test, label='data', marker='o')
ax.plot(numerical_grad, label='gradient')
ax.plot(analytical_grad, label='proof', alpha=0.5)
ax.legend();
I found how to do it, by using numpy's trapz function (trapezoidal rule integration).
Following up on the code I presented on the question, to reproduce the input array test, we do:
x = np.linspace(1, 30, 100)
integral = list()
for t in range(len(x)):
integral.append(test[0] + np.trapz(numerical_grad[:t+1], x[:t+1]))
The integral array then contains the results of the numerical integration.
You can restore initial curve using integration.
As life example: If you have function for position for 1D moving, you can get function for velocity as derivative (gradient)
v(t) = s(t)' = ds / dt
And having velocity, you can potentially get position (not all functions are integrable analytically - in this case numerical integration is used) with some unknown constant (shift) added - and with initial position you can restore exact value
s(T) = Integral[from 0 to T](v(t)dt) + s(0)
I have a data as 2D array and I used gaussian_kde to make estimation for data distribution. Now, I want to get the first derivative for the resultant density estimator to get zero crossings. Is it possible to get it from estimated density ?. If so, is there any built-in function in Python that can help ?
Following the example in the documentation of the gaussian_kde, once you have the Z, or more generally, the estimation of your density in a X axis, you can calculate its derivatives using standard numpy functions:
diff = np.gradient(Z)
Note that np.gradient computes central differences. If you would like forward differences you could do something like:
diff = np.r_[Z[1:] - Z[:-1], 0]
To find the zero-crossings you can do:
sdiff = np.sign(diff)
zc = np.where(sdiff[:-1] != sdiff[1:])
You can extend the above for 2D as dy, dx = np.gradient(Z) with Z a 2D array. And then operate in both Y and X direction.
I have a plot with me which is logarithmic on both the axes. I have pyplot's loglog function to do this. It also gives me the logarithmic scale on both the axes.
Now, using numpy I fit a straight line to the set of points that I have. However, when I plot this line on the plot, I cannot get a straight line. I get a curved line.
The blue line is the supposedly "straight line". It is not getting plotted straight. I want to fit this straight line to the curve plotted by red dots
Here is the code I am using to plot the points:
import numpy
from matplotlib import pyplot as plt
import math
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
coefficients=numpy.polyfit(b,a,1)
polynomial=numpy.poly1d(coefficients)
ys=polynomial(b)
print polynomial
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()
To better understand this problem, let's first talk about plain ol' linear regression (the polyfit function, in this case, is your linear regression algorithm).
Suppose you have a set of data points (x,y), shown below:
You want to create a model that predicts y as a function of x, so you use linear regression. That uses the model:
y = mx + b
and computes the values of m and b that best predict your data, using some linear algebra.
Next, you use your model to predict values of y as a function of x. You do this by picking a set of values for x (think linspace) and computing the corresponding values of y. Plotting these (x,y) pairs gives you your regression line.
Now, let's talk about logarithmic regression. In this case, we still have two variables, y versus x, and we are still interested in their relationship, i.e., being able to predict y given x. The only difference is, now y and x happen to be logarithms of two other variables, which I'll call log(F) and log(R). Thus far, this is nothing more than a simple change of name.
The linear regression also works the same way. You're still regressing y versus x. The linear regression algorithm doesn't care that y and x are actually log(F) and log(R) - it makes no difference to the algorithm.
The last step is a little bit different - and this is where you're getting tripped up in your plot above. What you're doing is computing
F = m R + b
but this is incorrect, because the relationship between F and R is not linear. (That's why you're using a log-log plot.)
Instead, you should compute
log(F) = m log(R) + b
If you transform this (raise 10 to the power of both sides and rearrange), you get
F = c R^m
where c = 10^b. This is the relationship between F and R: it is a power law relationship. (Power law relationships are what log-log plots are best at.)
In your code, you're using A and B when calling polyfit, but you should be using log(A) and log(B).
Your linear fit is not performed on the same data as shown in the loglog-plot.
Make a and b numpy arrays like this
a = numpy.asarray(a, dtype=float)
b = numpy.asarray(b, dtype=float)
Now you can perform operations on them. What the loglog-plot does, is to take the logarithm to base 10 of both a and b. You can do the same by
logA = numpy.log10(a)
logB = numpy.log10(b)
This is what the loglog plot visualizes. Check this by ploting both logA and logB as a regular plot. Repeat the linear fit on the log data and plot your line in the same plot as the logA, logB data.
coefficients = numpy.polyfit(logB, logA, 1)
polynomial = numpy.poly1d(coefficients)
ys = polynomial(b)
plt.plot(logB, logA)
plt.plot(b, ys)
The other answers offer great explanations and a solution. However I would like to propose a solution that helped myself a lot and maybe will help you as well.
Another simple way of writing a line fit for log-log scale is the function powerfit in the code below. It takes in the original x and y data and by using a number of new x-points you can get a straight line on log-log scale. In the current case the values xnew are the same as x (which are both b).
The advantage of defining new x-coordinates is that you can get as few or as many points of the powerfitted line for whatever purpose you might need them.
import numpy as np
from matplotlib import pyplot as plt
import math
def powerfit(x, y, xnew):
"""line fitting on log-log scale"""
k, m = np.polyfit(np.log(x), np.log(y), 1)
return np.exp(m) * xnew**(k)
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
ys = powerfit(b, a, b)
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()
I have done some work in Python, but I'm new to scipy. I'm trying to use the methods from the interpolate library to come up with a function that will approximate a set of data.
I've looked up some examples to get started, and could get the sample code below working in Python(x,y):
import numpy as np
from scipy.interpolate import interp1d, Rbf
import pylab as P
# show the plot (empty for now)
P.clf()
P.show()
# generate random input data
original_data = np.linspace(0, 1, 10)
# random noise to be added to the data
noise = (np.random.random(10)*2 - 1) * 1e-1
# calculate f(x)=sin(2*PI*x)+noise
f_original_data = np.sin(2 * np.pi * original_data) + noise
# create interpolator
rbf_interp = Rbf(original_data, f_original_data, function='gaussian')
# Create new sample data (for input), calculate f(x)
#using different interpolation methods
new_sample_data = np.linspace(0, 1, 50)
rbf_new_sample_data = rbf_interp(new_sample_data)
# draw all results to compare
P.plot(original_data, f_original_data, 'o', ms=6, label='f_original_data')
P.plot(new_sample_data, rbf_new_sample_data, label='Rbf interp')
P.legend()
The plot is displayed as follows:
Now, is there any way to get a polynomial expression representing the interpolated function created by Rbf (i.e. the method created as rbf_interp)?
Or, if this is not possible with Rbf, any suggestions using a different interpolation method, another library, or even a different tool are also welcome.
The RBF uses whatever functions you ask, it is of course a global model, so yes there is a function result, but of course its true that you will probably not like it since it is a sum over many gaussians. You got:
rbf.nodes # the factors for each of the RBF (probably gaussians)
rbf.xi # the centers.
rbf.epsilon # the width of the gaussian, but remember that the Norm plays a role too
So with these things you can calculate the distances (with rbf.xi then pluggin the distances with the factors in rbf.nodes and rbf.epsilon into the gaussian (or whatever function you asked it to use). (You can check the python code of __call__ and _call_norm)
So you get something like sum(rbf.nodes[i] * gaussian(rbf.epsilon, sqrt((rbf.xi - center)**2)) for i, center in enumerate(rbf.nodes)) to give some funny half code/formula, the RBFs function is written in the documentation, but you can also check the python code.
The answer is no, there is no "nice" way to write down the formula, or at least not in a short way. Some types of interpolations, like RBF and Loess, do not directly search for a parametric mathematical function to fit to the data and instead they calculate the value of each new data point separately as a function of the other points.
These interpolations are guaranteed to always give a good fit for your data (such as in your case), and the reason for this is that to describe them you need a very large number of parameters (basically all your data points). Think of it this way: you could interpolate linearly by connecting consecutive data points with straight lines. You could fit any data this way and then describe the function in a mathematical form, but it would take a large number of parameters (at least as many as the number of points). Actually what you are doing right now is pretty much a smoothed version of that.
If you want the formula to be short, this means you want to describe the data with a mathematical function that does not have many parameters (specifically the number of parameters should be much lower than the number of data points). Such examples are logistic functions, polynomial functions and even the sine function (that you used to generate the data). Obviously, if you know which function generated the data that will be the function you want to fit.
RBF likely stands for Radial Basis Function. I wouldn't be surprised if scipy.interpolate.Rbf was the function you're looking for.
However, I doubt you'll be able to find a polynomial expression to represent your result.
If you want to try different interpolation methods, check the corresponding Scipy documentation, that gives link to RBF, splines...
I don’t think SciPy’s RBF will give you the actual function. But one thing that you could do is sample the function that SciPy’s RBF gave you (ie 100 points). Then use Lagrange interpretation with those points. This will generate a polynomial function for you. Here is an example on how this would look. If you do not want to use Lagrange interpolation, You can also use “Newton’s dividend difference method” to generate a polynomial function.
My answer is based on numpy only :
import matplotlib.pyplot as plt
import numpy as np
x_data = [324, 531, 806, 1152, 1576, 2081, 2672, 3285, 3979, 4736]
y_data = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
x = np.array(x_data)
y = np.array(y_data)
model = np.poly1d(np.polyfit(x, y, 2))
ynew = model(x)
plt.plot(x, y, 'o', x, ynew, '-' , )
plt.ylabel( str(model).strip())
plt.show()