Trying to plot a quadratic regression, getting multiple lines - python

I'm making a demonstration of a different types of regression in numpy with ipython, and so far, I've been able to plot a simple linear regression without difficulty. Now, when I go on to make a quadratic fit to my data and go to plot it, I don't get a quadratic curve but instead get many lines. Here's the code I'm running that generates the problem:
import numpy
from numpy import random
from matplotlib import pyplot as plt
import math
# Generate random data
X = random.random((100,1))
epsilon=random.randn(100,1)
f = 3+5*X+epsilon
# least squares system
A =numpy.array([numpy.ones((100,1)),X,X**2])
A = numpy.squeeze(A)
A = A.T
quadfit = numpy.linalg.solve(numpy.dot(A.transpose(),A),numpy.dot(A.transpose(),f))
# plot the data and the fitted parabola
qdbeta0,qdbeta1,qdbeta2 = quadfit[0][0],quadfit[1][0],quadfit[2][0]
plt.scatter(X,f)
plt.plot(X,qdbeta0+qdbeta1*X+qdbeta2*X**2)
plt.show()
What I get is this picture (zoomed in to show the problem):
You can see that rather than having a single parabola that fits the data, I have a huge number of individual lines doing something that I'm not sure of. Any help would be greatly appreciated.

Your X is ordered randomly, so it's not a good set of x values to use to draw one continuous line, because it has to double back on itself. You could sort it, I guess, but TBH I'd just make a new array of x coordinates and use those:
plt.scatter(X,f)
x = np.linspace(0, 1, 1000)
plt.plot(x,qdbeta0+qdbeta1*x+qdbeta2*x**2)
gives me

Related

2d interpolators of scipy for scattered data going crazy

I want to find the derivatives of some scattered data. I have tried two different methods:
projecting the scattered data on a regular grid using scipy.interpolate.griddata, then computing the gradients with numpy.gradients, and then projecting values back to the scattered locations.
creating a CloughTocher2DInterpolater (but I have the same issue with others) and getting the gradients out of it
The second one is an order of magnitude faster than the first one but unfortunately, it also goes crazy quite quickly when data are a bit complex. For instance starting with this signal (called F and which is a simple addition of tanh stepwise functions along x and y):
When I process F using the two methods, I get:
Method 1 gives a good approximation. Method 2 is also good but I need force the colormap because of the existence of some extreme values.
Now, if I add a small noise (i.e. of amplitude 0.1 while the signal has amplitudes between -3 and 3), the interpolator just goes crazy giving very large extreme values:
I don't know how to deal with this. I understand the interpolator won't like irregular function or noise, but I was not expecting such discrepancy. My first idea was to smooth data first but strangely I can't find any method that would help me on this. Another idea would be to make a 2d fit of F to try to remove noise but I'm dry here too...any idea ?
Here is the corresponding python example (working on python3.6.9):
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
plt.interactive(True)
# scattered data
N = 200
coordu = np.random.rand(N**2,2)
Xu=coordu[:,0]
Yu=coordu[:,1]
noise = 0.
noise = np.random.rand(Xu.shape[0])*0.1
Zu=np.tanh((Xu-0.25)/0.01+(Yu-0.25)/0.001)+np.tanh((Xu-0.5)/0.01+(Yu-0.5)/0.001)+np.tanh((Xu-0.75)/0.001+(Yu-0.75)/0.001)+noise
plt.figure();plt.scatter(Xu,Yu,1,Zu)
plt.title('Data signal F')
#plt.savefig('signalF_noisy.png')
### get the gradient
# using griddata np.gradients
Xs,Ys=np.meshgrid(np.linspace(0,1,N),np.linspace(0,1,N))
coords = np.array([Xs,Ys]).T
Zs = interpolate.griddata(coordu,Zu,coords)
nearest = interpolate.griddata(coordu,Zu,coords,method='nearest')
znan = np.isnan(Zs)
Zs[znan] = nearest[znan]
dZs = np.gradient(Zs,np.min(np.diff(Xs[0,:])))
dZus = interpolate.griddata(coords.reshape(N*N,2),dZs[0].reshape(N*N),coordu)
hist_dzus = np.histogram(dZus,100)
plt.figure();plt.scatter(Xu,Yu,1,dZus)
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using griddata and np.gradients')
#plt.savefig('dxF_griddata_noisy.png')
# using interpolation method Clough
interp = interpolate.CloughTocher2DInterpolator(coordu,Zu)
dZuCT = interp.grad
hist_dzct = np.histogram(dZuCT[:,0,0],100)
plt.figure();plt.scatter(Xu,Yu,1,dZuCT[:,0,0])
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using CloughTocher2DInterpolator')
#plt.savefig('dxF_CT2D_noisy.png')
# histograms
plt.figure()
plt.semilogy(hist_dzus[1][:-1],hist_dzus[0],'.-')
plt.semilogy(hist_dzct[1][:-1],hist_dzct[0],'.-')
plt.title('histogram of dF/dx')
plt.legend(('griddata','ClouhTocher'))
#plt.savefig('dxF_hist_noisy.png')

How can I find a well-fitting trendline in Matplotlib for this data?

my problem is allegedly simple - I have scatter data in X and Y, and want to get a nice, well-fitting trendline with a known equation so that I can go on to correspond LDR voltages into power readings. However, I'm having trouble with generating a trendline in Matplotlib or Scipy that fits well, which I believe is because there's a logarithmic relationship.
I'm using Spyder and Matplotlib, and first tried plotting the X (Thorlabs) and Y (LDR) data as a log-log scatter plot. Because the data didn't seem to show a linear relationship after doing this, I then used numpy's Polynomial.fit with degree 5 to 6. This looked good, but then when inverting the axes, so I could get something of the form [LDR] = f[Thorlabs], I noticed the fit was suddenly not very good at all at the extremes of my data.
Using this question using curve_fit seems to be the way to go, but I tried using curve_fit as described here and, after adjusting to increase the max number of curve-fit iterations, stumbled when I got the error message "TypeError: can't multiply sequence by non-int of type 'numpy.float64'", which will likely be because my data contains decimal points. I'm not sure how to account for this.
I have several mini-questions, then -
am I misunderstanding the above examples?
is there a better way I could go about trying to find the ideal trendline for this data? Is it possible that it's some sort of logarithmic relationship on top of a log-log plot?
once I get a trendline, how can I make sure it fits well and can be displayed?
#import libraries
import matplotlib.pyplot as plt
import csv
import numpy as np
from numpy.polynomial import Polynomial
import scipy.optimize as opt
#initialise arrays - I create log arrays too so I can plot directly
deg = 6 #degree of polynomial fitting for Polynomial.fit()
thorlabs = []
logthorlabs = []
ldr = []
logldr = []
#read in LDR/Thorlabs datasets from file
with open('16ldr561nm.txt','r') as csvfile:
plots = csv.reader(csvfile, delimiter='\t')
for row in plots:
thorlabs.append(float(row[0]))
ldr.append(float(row[1]))
logthorlabs.append(np.log(float(row[0])))
logldr.append(np.log(float(row[1])))
#This seems to work just fine, I now have arrays containing data in float
#fit and plot log polynomials
p = Polynomial.fit(logthorlabs, logldr, deg)
plt.plot(*p.linspace()) #plot lines
#plot scatter graphs on log-log axis - either using log arrays or on loglog plot
#plt.loglog()
plt.scatter(logthorlabs, logldr, label='16bit ADC LDR1')
plt.xlabel('log Thorlabs laser power (microW)')
plt.ylabel('log LDR voltage (mV)')
plt.title('LDR voltage against laser power at 561nm')
plt.legend()
plt.show()
#attempt at using curve_fit - when using, comment out the above block
"""
# This is the function we are trying to fit to the data.
def func(x, a, b, c):
return a * np.exp(-b * x) + c
#freaks out here as I get a type error which I am not sure how to account for
# Plot the actual data
plt.plot(thorlabs, ldr, ".", label="Data");
#Adjusted maxfev to 5000. I know you can make "guesses" here but I am not sure how to do so
# The actual curve fitting happens here
optimizedParameters, pcov = opt.curve_fit(func, thorlabs, ldr, maxfev=5000);
# Use the optimized parameters to plot the best fit
plt.plot(thorlabs, func(ldr, *optimizedParameters), label="fit");
# Show the graph
plt.legend();
plt.show();
"""
When using curve_fit, I get a "TypeError: can't multiply sequence by non-int of type 'numpy.float64'".
As I don't have enough reputation to post images, my raw dataset can be found here. (Otherwise I'd include the graphs!)
(Note that I actually have two datasets, but as I only want to know the principle for calculating a trendline for one, I've left out the other dataset above.)
Refactoring your code a bit, most importantly to use native Numpy arrays once things have been parsed out from the file, makes things not crash, but the CurveFit line doesn't look good at all.
The code prints out the parameters fit by curve_fit, which don't look very good either, and a warning too: "Covariance of the parameters could not be estimated". I'm no mathematician/statistician, so I don't know what to do there.
from numpy.polynomial import Polynomial
import csv
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
def read_dataset(filename):
x = []
y = []
with open(filename, "r") as csvfile:
plots = csv.reader(csvfile, delimiter="\t")
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
# cast to native numpy arrays
x = np.array(x)
y = np.array(y)
return (x, y)
ldr, thorlabs = read_dataset("16ldr561nm.txt")
plt.scatter(thorlabs, ldr, label="Data")
plt.xlabel("Thorlabs laser power (microW)")
plt.ylabel("LDR voltage (mV)")
plt.title("LDR voltage against laser power at 561nm")
# Generate and plot polynomial
p = Polynomial.fit(thorlabs, ldr, 6)
plt.plot(*p.linspace(), label="Polynomial")
# Generate and plot curvefit
def func(x, a, b, c):
return a * np.exp(-b * x) + c
optimizedParameters, pcov = opt.curve_fit(func, thorlabs, ldr)
print(optimizedParameters, pcov)
plt.plot(thorlabs, func(ldr, *optimizedParameters), label="CurveFit")
# Show everything
plt.legend()
plt.show()
If you really need to log() the data, it's easily done with
x = np.log(x)
y = np.log(y)
which will keep the arrays as NumPy arrays and be plenty faster than doing it "by hand".

How to normalise plotted points and get a circle?

Given 2000 random points in a unit circle (using numpy.random.normal(0,1)), I want to normalize them such that the output is a circle, how do I do that?
I was requested to show my efforts. This is part of a larger question: Write a program that samples 2000 points uniformly from the circumference of a unit circle. Plot and show it is indeed picked from the circumference. To generate a point (x,y) from the circumference, sample (x,y) from std normal distribution and normalise them.
I'm almost certain my code isn't correct, but this is where I am up to. Any advice would be helpful.
This is the new updated code, but it still doesn't seem to be working.
import numpy as np
import matplotlib.pyplot as plot
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
plot.plot(xy)
plot.show()
I think the problem is in
plot.plot(xy)
even if I use
plot.plot(xy[:,0],xy[:,1])
it doesn't work.
Connected lines are not a good visualization here. You essentially connect random points on the circle. Since you do this quite often, you will get a filled circle. Try drawing points instead.
Also avoid name space mangling. You import matplotlib.pyplot as plot and also name your function plot. This will lead to name conflicts.
import numpy as np
import matplotlib.pyplot as plt
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
fig, ax = plt.subplots(figsize=(5,5))
# scatter draws dots instead of lines
ax.scatter(xy[:,0], xy[:,1])
If you use dots instead, you will see that your points indeed lie on the unit circle.
Your code has many problems:
Why using np.random.normal (a gaussian distribution) when the problem text is about uniform (flat) sampling?
To pick points on a circle you need to correlate x and y; i.e. randomly sampling x and y will not give a point on the circle as x**2+y**2 must be 1 (for example for the unit circle centered in (x=0, y=0)).
A couple of ways to get the second point is to either "project" a random point from [-1...1]x[-1...1] on the unit circle or to pick instead uniformly the angle and compute a point on that angle on the circle.
First of all, if you look at the documentation for numpy.random.normal (and, by the way, you could just use numpy.random.randn), it takes an optional size parameter, which lets you create as large of an array as you'd like. You can use this to get a large number of values at once. For example: xy = numpy.random.normal(0,1,(2000,2)) will give you all the values that you need.
At that point, you need to normalize them such that xy[:,0]**2 + xy[:,1]**2 == 1. This should be relatively trivial after computing what xy[:,0]**2 + xy[:,1]**2 is. Simply using norm on each dimension separately isn't going to work.
Usual boilerplate
import numpy as np
import matplotlib.pyplot as plt
generate the random sample with two rows, so that it's more convenient to refer to x's and y's
xy = np.random.normal(0,1,(2,2000))
normalize the random sample using a library function to compute the norm, axis=0 means consider the subarrays obtained varying the first array index, the result is a (2000) shaped array that can be broadcasted to xy /= to have points with unit norm, hence lying on the unit circle
xy /= np.linalg.norm(xy, axis=0)
Eventually, the plot... here the key is the add_subplot() method, and in particular the keyword argument aspect='equal' that requires that the scale from user units to output units it's the same for both axes
plt.figure().add_subplot(111, aspect='equal').scatter(xy[0], xy[1])
pt.show()
to have

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M=[0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
ff=linspace(F[0],F[1],10)
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
aa=InterpolatedUnivariateSpline(x=F,y=M,k=2);
mm=aa(ff)
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
aa=InterpolatedUnivariateSpline(x=F,y=M,k=1)
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.
Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
F_new.append(F[-1])
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
plt.show()
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.

Manipulating the numpy.random.exponential distribution in Python

I am trying to create an array of random numbers using Numpy's random exponential distribution. I've got this working fine, however I have one extra requirement for my project and that is the ability to specify precisely how many array elements have a certain value.
Let me explain (code is below, but I'll have a go at explaining it here): I generate my random exponential distribution and plot a histogram of the data, producing a nice exponential curve. What I really want to be able to do is use a variable to specify the y-intercept of this curve (point where curve meets the y-axis). I can achieve this in a basic way by changing the number of bins in my histogram, but this only changes the plot and not the original data.
I have inserted the bones of my code here. To give some context, I am trying to create the exponential disc of a galaxy, hence the random array I want to generate is an array of radii and the variable I want to be able to specify is the number density in the centre of the galaxy:
import numpy as N
import matplotlib.pyplot as P
n = 1000
scale_radius = 2
central_surface_density = 100 #I would like this to be the controlling variable, even if it's specification had knock on effects on n.
radius_array = N.random.exponential(scale_radius,(n,1))
P.figure()
nbins = 100
number_density, radii = N.histogram(radius_array, bins=nbins,normed=False)
P.plot(radii[0:-1], number_density)
P.xlabel('$R$')
P.ylabel(r'$\Sigma$')
P.ylim(0, central_surface_density)
P.legend()
P.show()
This code creates the following histogram:
So, to summarise, I would like to be able to specify where this plot intercepts the y-axis by controlling how I've generated the data, not by changing how the histogram has been plotted.
Any help or requests for further clarification would be very much appreciated.
According to the docs for numpy.random.exponential, the input parameter beta, is 1/lambda for the definition of the exponential described in wikipedia.
What you want is this function evaluated at f(x=0)=lambda=1/beta. Therefore in a normed distribution, your y-intercept should just be the inverse of the numpy function:
import numpy as np
import pylab as plt
target = 250
beta = 1.0/target
Y = np.random.exponential(beta, 5000)
plt.hist(Y, normed=True, bins=200,lw=0,alpha=.8)
plt.plot([0,max(Y)],[target,target],'r--')
plt.ylim(0,target*1.1)
plt.show()
Yes the y-intercept of the histogram will change with different bin sizes, but this doesn't mean anything. The only thing that you can reasonably talk about here is the underlying probability distribution (hence the normed=true)

Categories

Resources