Fitting a straight line to a log-log curve in matplotlib - python

I have a plot with me which is logarithmic on both the axes. I have pyplot's loglog function to do this. It also gives me the logarithmic scale on both the axes.
Now, using numpy I fit a straight line to the set of points that I have. However, when I plot this line on the plot, I cannot get a straight line. I get a curved line.
The blue line is the supposedly "straight line". It is not getting plotted straight. I want to fit this straight line to the curve plotted by red dots
Here is the code I am using to plot the points:
import numpy
from matplotlib import pyplot as plt
import math
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
coefficients=numpy.polyfit(b,a,1)
polynomial=numpy.poly1d(coefficients)
ys=polynomial(b)
print polynomial
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()

To better understand this problem, let's first talk about plain ol' linear regression (the polyfit function, in this case, is your linear regression algorithm).
Suppose you have a set of data points (x,y), shown below:
You want to create a model that predicts y as a function of x, so you use linear regression. That uses the model:
y = mx + b
and computes the values of m and b that best predict your data, using some linear algebra.
Next, you use your model to predict values of y as a function of x. You do this by picking a set of values for x (think linspace) and computing the corresponding values of y. Plotting these (x,y) pairs gives you your regression line.
Now, let's talk about logarithmic regression. In this case, we still have two variables, y versus x, and we are still interested in their relationship, i.e., being able to predict y given x. The only difference is, now y and x happen to be logarithms of two other variables, which I'll call log(F) and log(R). Thus far, this is nothing more than a simple change of name.
The linear regression also works the same way. You're still regressing y versus x. The linear regression algorithm doesn't care that y and x are actually log(F) and log(R) - it makes no difference to the algorithm.
The last step is a little bit different - and this is where you're getting tripped up in your plot above. What you're doing is computing
F = m R + b
but this is incorrect, because the relationship between F and R is not linear. (That's why you're using a log-log plot.)
Instead, you should compute
log(F) = m log(R) + b
If you transform this (raise 10 to the power of both sides and rearrange), you get
F = c R^m
where c = 10^b. This is the relationship between F and R: it is a power law relationship. (Power law relationships are what log-log plots are best at.)
In your code, you're using A and B when calling polyfit, but you should be using log(A) and log(B).

Your linear fit is not performed on the same data as shown in the loglog-plot.
Make a and b numpy arrays like this
a = numpy.asarray(a, dtype=float)
b = numpy.asarray(b, dtype=float)
Now you can perform operations on them. What the loglog-plot does, is to take the logarithm to base 10 of both a and b. You can do the same by
logA = numpy.log10(a)
logB = numpy.log10(b)
This is what the loglog plot visualizes. Check this by ploting both logA and logB as a regular plot. Repeat the linear fit on the log data and plot your line in the same plot as the logA, logB data.
coefficients = numpy.polyfit(logB, logA, 1)
polynomial = numpy.poly1d(coefficients)
ys = polynomial(b)
plt.plot(logB, logA)
plt.plot(b, ys)

The other answers offer great explanations and a solution. However I would like to propose a solution that helped myself a lot and maybe will help you as well.
Another simple way of writing a line fit for log-log scale is the function powerfit in the code below. It takes in the original x and y data and by using a number of new x-points you can get a straight line on log-log scale. In the current case the values xnew are the same as x (which are both b).
The advantage of defining new x-coordinates is that you can get as few or as many points of the powerfitted line for whatever purpose you might need them.
import numpy as np
from matplotlib import pyplot as plt
import math
def powerfit(x, y, xnew):
"""line fitting on log-log scale"""
k, m = np.polyfit(np.log(x), np.log(y), 1)
return np.exp(m) * xnew**(k)
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
ys = powerfit(b, a, b)
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()

Related

Least-square spline interpolation forcing interpolant to pass through specific points

I am having issues in implementing some less-than-usual interpolation problem. I have some (x,y) data points scattered along some curve which a priori I don't know, and I want to reconstruct this curve at my best, interpolating my point with min square error. I thought of using scipy.interpolate.splrep for this purpose (but maybe there are better options you would advise to use). The additional difficulty in my case, is that I want to constrain the spline curve to pass through some specific points of my original data. I assume that playing with knots and weights could make the trick, but I don't know how to do so (I am procrastinating avoidance of spline interpolation theory besides basic fitting procedures). Also, for some undisclosed reasons, when I try to setup knots in my splrep I get the same error of this post, which keeps complicating things. The following is my sample code:
from __future__ import division
import numpy as np
import scipy.interpolate as spi
import matplotlib.pylab as plt
# Some surrogate sample data
f = lambda x : x**2 - x/2.
x = np.arange(0.,20.,0.1)
y = f(4*(x + np.random.normal(size=np.size(x))))
# I want to use spline interpolation with least-square fitting criterion, making sure though that the spline starts
# from the origin (or in general passes through a precise point of my dataset).
# In my case for example I would like the spline to originate from the point in x=0. So I attempted to include as first knot x=0...
# but it won't work, nor I am sure this is the right procedure...
fy = spi.splrep(x,y)
fy = spi.splrep(x,y,t=fy[0])
yy = spi.splev(x,fy)
plt.plot(x,y,'-',x,yy,'--')
plt.show()
which despite the fact I am even passing knots computed from a first call of splrep, it will give me:
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/fitpack.py", line 289, in splrep
res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 515, in splrep
raise _iermess[ier][1](_iermess[ier][0])
ValueError: Error on input data
You use the weights argument of splrep: can give these points you need fixed very large weights. This is a workaround for sure, so keep an eye on the fit quality and stability.
Setting high weights for specific points is indeed a working solution as suggested by #ev-br. In addition, because there is no direct way to match derivatives at the extrema of the curve, the same rationale can be applied in this case as well. Say you want the derivative in y[0] and y[-1] match the derivative of your data points, then you add large weights also for y[1] and y[-2], i.e.
weights = np.ones(len(x))
weights[[0,-1]] = 100 # Promote spline interpolant through first and last point
weights[[1,-2]] = 50 # Make spline interpolant derivative tend to derivatives at first/last point
fy = spi.splrep(x,y,w=weights,s=0.1)
yy = spi.splev(x,fy)

How to improve the performance when 2d interpolating/smoothing lines using scipy?

I have a moderate size data set, namely 20000 x 2 floats in a two column matrix. The first column is the the x column which represents the distance to the original point along a trajectory, another column is the y column which represents the work has done to the object. This data set is obtained from lab operations, so it's fairly arbitrary. I've already turned this structure into numpy array. I want to plot y vs x in a figure with a smooth curve. So I hope the following code could help me:
x_smooth = np.linspace(x.min(),x.max(), 20000)
y_smooth = spline(x, y, x_smooth)
plt.plot(x_smooth, y_smooth)
plt.show()
However, when my program execute the line y_smooth = spline(x,y,x_smooth), it takes a very long time,say 10 min, and even sometimes it will blow my memory that I have to restart my machine. I tried to reduce the chunk number to 200 and 2000 and none of them works. Then I checked the official scipy reference: scipy.interpolate.spline here. And they said that spline is deprecated in v 0.19, but I'm not using the new version. If spline is deprecated for quite a bit of the time, how to use the equivalent Bspline now? If spline is still functioning, then what causes the slow performance
One portion of my data could look like this:
13.202 0.0
13.234738 -0.051354643759
12.999116 0.144464320836
12.86252 0.07396528119
13.1157 0.10019738758
13.357109 -0.30288563381
13.234004 -0.045792536285
12.836279 0.0362257166275
12.851597 0.0542649286915
13.110691 0.105297378401
13.220619 -0.0182963209185
13.092143 0.116647353635
12.545676 -0.641112204849
12.728248 -0.147460703493
12.874176 0.0755861585235
12.746764 -0.111583725833
13.024995 0.148079528382
13.106033 0.119481137144
13.327233 -0.197666132456
13.142423 0.0901867159545
Several issues here. First and foremost, spline fitting you're trying to use is global. This means that you're solving a system of linear equations of the size 20000 at the construction time (evaluations are weakly sensitive to the dataset size though). This explains why the spline construction is slow.
scipy.interpolate.spline, furthermore, does linear algebra with full matrices --- hence memory consumption. This is precisely why it's deprecated from scipy 0.19.0 on.
The recommended replacement, available in scipy 0.19.0, is the BSpline/ make_interp_spline combo:
>>> spl = make_interp_spline(x, y, k=3) # returns a BSpline object
>>> y_new = spl(x_new) # evaluate
Notice it is not BSpline(x, y, k): BSpline objects do not know anything about the data or fitting or interpolation.
If you are using older scipy versions, your options are:
CubicSpline(x, y) for cubic splines
splrep(x, y, s=0) / splev combo.
However, you may want to think if you really need twice continuously differentiable functions. If only once differentiable functions are smooth enough for your purposes, then you can use local spline interpolations, e.g. Akima1DInterpolator or PchipInterpolator:
In [1]: import numpy as np
In [2]: from scipy.interpolate import pchip, splmake
In [3]: x = np.arange(1000)
In [4]: y = x**2
In [5]: %timeit pchip(x, y)
10 loops, best of 3: 58.9 ms per loop
In [6]: %timeit splmake(x, y)
1 loop, best of 3: 5.01 s per loop
Here splmake is what spline uses under the hood, and it's also deprecated.
Most interpolation methods in SciPy are function-generating, i.e. they return function which you can then execute on your x data. For example, using CubicSpline method, which connects all points with pointwise cubic spline would be
from scipy.interpolate import CubicSpline
spline = CubicSpline(x, y)
y_smooth = spline(x_smooth)
Based on your description I think that you correctly want to use BSpline. To do so, follow the pattern above, i.e.
from scipy.interpolate import BSpline
order = 2 # smoothness order
spline = BSpline(x, y, order)
y_smooth = spline(x_smooth)
Since you have such amount of data, it probably must be very noisy. I'd suggest using bigger spline order, which relates to the number of knots used for interpolation.
In both cases, your knots, i.e. x and y, should be sorted. These are 1D interpolation (since you are using only x_smooth as input). You can sort them using np.argsort. In short:
from scipy.interpolate import BSpline
sort_idx = np.argsort(x)
x_sorted = x[sort_idx]
y_sorted = y[sort_idx]
order = 20 # smoothness order
spline = BSpline(x_sorted, y_sorted, order)
y_smooth = spline(x_smooth)
plt.plot(x_sorted, y_sorted, '.')
plt.plot(x_smooth, y_smooth, '-')
plt.show()
My problem can be generalize to how to smoothly plot 2d graphs when data points are randomized. Since you are only dealing with two columns of data, if you sort your data by independent variable, at least your data points will be connected in order, and that's how matplotlib connects your data points.
#Dawid Laszuk has provided one solution to sort data by independent variable, and I'll display mine here:
plotting_columns = []
for i in range(len(x)):
plotting_columns.append(np.array([x[i],y[i]]))
plotting_columns.sort(key=lambda pair : pair[0])
plotting_columns = np.array(plotting_columns)
traditional sort() by filter condition could also do the sorting job efficient here.
But it's just your first step. The following steps are not hard either, to smooth your graph, you also want to keep your independent variable in linear ascending order with identical step interval, so
x_smooth = np.linspace(x.min(), x.max(), num_steps)
is enough to do the job. Usually, if you have plenty of data points, for example, more than 10000 points (correctness and accuracy are not human verifiable), you just want to plot the significant points to display the trend, then only smoothing x is enough. So you can plt.plot(x_smooth,y) simply.
You will notice that x_smooth will generate many x values that will not have corresponding y value. When you want to maintain the correctness, you need to use line fitting functions. As #ev-br demonstrated in his answer, spline functions are expensive on purpose. Therefore you might want to do some simpler trick. I smoothed my graph without using those functions. And you have some simple steps to it.
First, round your values so that your data will not vary too much in small intervals. (You can skip this step)
You can change one line when you constructing the plotting_columns as:
plotting_columns.append(np.around(np.array(x[i],y[i]), decimal=4))
After done this, you can filter out the point that you don't want to plot by choosing the points close to the x_smooth values:
new_plots = []
for i in range(len(x_smooth)):
if plotting_columns[:,0][i] >= x_smooth[i] - error and plotting_columns[:,0][i]< x_smooth[i] + error:
new_plots.append(plotting_columns[i])
else:
# Remove all points between the interval #
This is how I solved my problems.

Numerical integration of spline function in semi-log space

Given x and y data, I'd like to fit a spline to the data and numerically integrate the following fit. Using Univariate.Spline, I get a nice linear fit for log10(y) vs x. I then integrate the resulting spline using Univariate.Spline.integral(bounds). My problem is that I'm not sure how to interpret the output, given that I am working in semi-log space.
y = np.array([1,10,100,1000])
x = np.array([15,16,17,18])
x_vals = np.linspace(0,50,1000)
plt.scatter(x,np.log10(y))
s = interpolate.UnivariateSpline(x,np.log10(y))
plt.plot(x_vals,s(x_vals))
print(s.integral(15,17))
Should I take 10^(s.integral(15,17) to obtain the "true" value of the integral?
You can numerically integrate a function of the interpolation
from scipy import interpolate, integrate
def antilog_s(x):
return 10.0**s(x)
integrate.quad(antilog_s, 15, 17)
Out[16]: (42.99515370842196, 4.773420959438774e-13)

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:
The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

How to force polyfit with second degree to a y-intercept of 0

I've been using the numpy.polyfit function to do some forecasting. If I put in a degree of 1, it works, but I need to do a second degree polynomial fit. In some cases it works, in other cases the plot of the prediction goes down and then goes up forever. For example:
import matplotlib.pyplot as plt
from numpy import *
x=[1,2,3,4,5,6,7,8,9,10]
y=[100,85,72,66,52,48,39,33,29,32]
fit = polyfit(x, y, degree)
fitfunction = poly1d(z4)
to_predict=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
plt.plot(to_predict,fitfunction(to_predict))
plt.show()
After I run that, this shows up (I tried putting a picture up but stackoverflow won't let me).
I want to force it to go through zero.
How would I do that?
If you don't need the fit's error be computed using the original least square formula (i.e. minimizing ∑ |yi - (axi2 + bxi)|2), you could try to perform a linear fit of y/x instead, because (ax2 + bx)/x = ax + b.
If you must use the same error metric, construct the coefficient matrices directly and use numpy.linalg.lstsq:
coeff = numpy.transpose([x*x, x])
((a, b), _, _, _) = numpy.linalg.lstsq(coeff, y)
polynomial = numpy.poly1d([a, b, 0])
(Note that your provided data sequence does not look like a parabola having a y-intercept of 0.)
if anyone has to do this under a deadline, a quick solution is to just add a bunch of extra points at 0 to skew the weighting off. i did this:
for i in range(0,100):
x_vent.insert(i,0)
y_vent.insert(i,0)
slope_vent,intercept_vent=np.polyfit(x_vent,y_vent,1)

Categories

Resources