An error with Python Polynomial Regression data fitting - python

I'm trying to generate a Fit for the Data I have The Data
The Sample when Plotted directly is as follows: Sample Data
I've been trying to generate a Polynomial fit for this Data where T = Time in days & IC/IC100 is the data corresponding,
I've used 2 methods to generate the Polynomial Fit
1 Using Polyfit & Poly1D
Here is my code for this approach
import math
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
sns.set(style="darkgrid")
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100
from scipy.stats import sem
from scipy import optimize
from scipy.optimize import curve_fit
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
IC_M = pd.read_csv("TvsC100_MES.csv")
IC_M.set_index('Group/#/', inplace=True)
IICM_1 = IC_M[0:5]
IICM_1
# DEGREE = 2
mymodel = np.poly1d(np.polyfit(IICM_1["IC/IC100"],IICM_1["T"], 2))
figure(figsize=(12, 8), dpi=100)
plt.plot(IICM_1["T"], IICM_1["IC/IC100"], marker = 'o', label = 'Original Plot', c = 'blue')
plt.plot(mymodel(IICM_1["IC/IC100"]),IICM_1["IC/IC100"], marker = 'x', label = 'New Y', color = 'red')
#plt.plot(mymodel(new_y),new_y, marker = 'x', label = 'New Y', color = 'red')
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()`
when i plot the graph i get this error in the graph, where one point is off, its not supposed to be like that, i haven't been able to fix this error... The behavior of this co-relation isn't the same during experimentation of the recorded values
The Error
The Second Method i used was using Polyfit,polytransform & predict
in this method, The coefficients are being generated for each point and the fit is as per each point and not the line as a whole (e.g. the equation is Y(X) = AX^2 + BX + C, ABC should remain constant for all points and that is the fit i am looking for.. If i were to extend the values of Y, i'm supposed to find the next predicted value according to the sample data, unfortunately this isnt the case...
here is my code:
just the main part differes after inputing the data from before...
`
# Poly Creation
# The degree here is in format (min degree, max degree) = according to our selection of (2,2), we are removing the term without the Degree^2
poly = PolynomialFeatures(degree=(2,2), include_bias= False)
# Actually Transforming the Data & applying the Polynomial Function to it,,
poly_features = poly.fit_transform(np.array(IICM_1["IC/IC100"]).reshape(-1,1)) # Y
#Creating an Instance of the Linear Regression Model
poly_reg_model = LinearRegression(fit_intercept = False, positive = True)
# Fitting is the procedure where we Train the Model based on X(input) & Y(response) to solve for the Coefficients during these Values
# y = A*(X) + C
poly_reg_model.fit(poly_features, np.array(IICM_1["T"]).reshape(-1,1)) #X
y_predicted = poly_reg_model.predict(poly_features)
figure(figsize=(12, 8), dpi=100)
# points + Curve
plt.plot(IICM_1["T"], IICM_1["IC/IC100"] ,marker = 'o', label = "Samp: C/3005-1", color = "blue")
plt.plot(y_predicted, IICM_1["IC/IC100"] ,marker = 'x', label = "Samp: Prediction", color = "red")
plt.legend()
plt.xlabel("T")
plt.ylabel("IC/IC100")
plt.show()`
This output i get, which is incorrect...IMO
I need to fix this, either my understanding of Polynomial is incorrect or maybe i'm using this function incorrectly or something else... How can i approach & fix this issue..?
I tried changing the Input order for the functions thinking that it will consider the points as a single line and not as individual lines, but the results were bad...
I need to fix this, either my understanding of Polynomial is incorrect or maybe i'm using this function incorrectly or something else... How can i approach & fix this issue..?

Related

How can I plot x/y datapoints from three different time periods on the same axes for analysis?

I'm wanting to statistically compare the results of linear regression analyses for air temperature (x) versus gas use (in kWh) across three different years.
I'm unsure how to approach plotting multiple conditions on the same x-y axes and then go about the statistical analysis.
I've been using SciKitLearn using code from an excellent tutorial to plot my regression analysis for each time period (testing data included below). However, unsure how i can include multiple conditions in the same plot?
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# test data
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.rand(100, 1)
# sckit-learn implementation
# Model initialization
regression_model = LinearRegression()
# Fit the data(train the model)
regression_model.fit(xa, ya)
# Predict
ya_predicted = regression_model.predict(xa)
# model evaluation
rmse = mean_squared_error(ya, ya_predicted)
r2 = r2_score(ya, ya_predicted)
# printing values
print('Slope:' ,regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
# plotting values
# data points
plt.scatter(xa, ya, s=10)
plt.xlabel('Average air temperature (\xb0C)')
plt.ylabel('Total daily gas use (kWh)')
# predicted values
plt.plot(xa, ya_predicted, color='r')
plt.show()
Thanks!
To plot multiple conditions just don't call plt.show() until you've plotted them all.
example:
import matplotlib.pyplot as plt
import numpy as np
values = np.linspace(1, 10, 5)
new = [1, 2, 3, 4, 5]
new2 = [4, 5, 6, 7, 8]
plt.scatter(values, new)
plt.plot(values, new2)
plt.show()
you get:
if you're plotting across three different years, you'll need to plot as a function of time of year so that you have the same x axis for all three graphs. You can use different colors of the graphs and a label legend to indicate the year of each thing you're plotting.

How to make standard deviation and percentile bands in a python scatter plot

I have data for a scatter plot (for reference, x values are labelled sm, y values are labelled bhm) and my three goals are to find the medians of binned data, create standard deviation bands, and create bands at the 90th and 10th percentiles. I've managed to do the first, and while I've been able to make vertical bars indicating the standard deviation, I can't figure out how to make filled-in bands since every time I try to set parameters with the fill_between function, it says operators with sm/bhm are incompatible since they're datasets and I'm comparing them to singular values (the mean line). I copied all of my code down below and there's a comment pointing out the relevant stuff - I just kept all of it since the variable names are a bit important and also because some parts of the plot don't show up properly without the seemingly extraneous code
To create the bands at 90/10 percent, I tried this bit of code by trying to bin the mean as I did for the median, and then filling the top and bottom of the line +-90% of the data but I keep getting
patsy.PatsyError: model is missing required outcome variables
#stuff that really doesn't work
model = smf.quantreg(bhm, sm)
quantiles = [0.1, 0.9]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
_sm = np.linspace(min(sm), max(sm))
for index, quantile in enumerate(quantiles):
_bhm = fits[index].params['world'] * _sm +
fits[index].params['Intercept']
axes.plot(_sm, _bhm, label = quantile)
axes.plot(_sm, _sm, 'g--', label = 'i guess this line is the mean')
#stuff that also doesn't really work
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import h5py
import statistics as stat
import pandas as pd
import statsmodels.formula.api as smf
#my files and labels for things
f=h5py.File(r'C:\Users\hanna\Downloads\CatalogueGalsz0p0.hdf5', 'r')
sm = f['StellarMass']
bhm = f['BHMass']
bt = f['BtoT']
dt = f['DtoT']
nbins = 125
#titles and scaling for the plot
plt.title('Relationships Between Stellar Mass, Black Hole Mass, and Bulge
to Total Ratios')
plt.xlabel('Stellar Mass')
plt.ylabel('Black Hole Mass')
plt.xscale('log')
plt.yscale('log')
axes = plt.gca()
axes.set_ylim([500000,max(bhm)])
axes.set_xlim([min(sm),max(sm)])
#labels for the legend and how I colored the points in the plot
DtoT = np.copy(f['DtoT'].value)
colour = np.zeros(len(DtoT),dtype=str)
for i in np.arange(0, len(bt)):
if bt[i]>=0.5:
colour[i]='green'
else:
colour[i]='red'
redbt = mpatches.Patch(color = 'red', label = 'Bulge to Total Ratios Below 0.5')
greenbt = mpatches.Patch(color = 'green', label = 'Bulge to Total Ratios Above 0.5')
plt.legend(handles = [(redbt), (greenbt)])
#the important part - this is how I binned my data to make the median line, and this part works but not the standard deviation bands
bins = np.linspace(0, max(sm), nbins)
delta = bins[1]-bins[0]
idx = np.digitize(sm, bins)
runningmedian = [np.median(bhm[idx==k]) for k in range(nbins)]
runningstd = [bhm[idx==k].std() for k in range(nbins)]
plt.plot(bins-delta/2, runningmedian, c = 'b', lw=1)
plt.scatter(sm, bhm, c=colour, s=.2)
plt.show()

Limiting exponential regression in Python

I have managed to create an exponential regression based on some data from an experiment. However, I would like the regression to stop when the y-values start plateauing (around x = 42000 seconds). See attached image of plot.
This is the code so far:
import matplotlib.pyplot as plt;
import numpy as np;
import pandas as pd
import scipy.optimize as opt;
# This is the function we are trying to fit to the data.
def func(x, a, b, c):
return a * b**x
dataC = pd.read_csv("yeastdata1cropped.txt")
data = pd.read_csv("yeastdata1.txt")
xdata = np.array(data.iloc[:,1])
ydata = np.array(data.iloc[:,0])
xdatac = np.array(dataC.iloc[:,1])
ydatac = np.array(dataC.iloc[:,0])
# Plot the actual data
plt.plot(xdata, ydata, ".", label="Data");
# The actual curve fitting happens here
optimizedParameters, pcov = opt.curve_fit(func, xdatac, ydatac);
# Use the optimized parameters to plot the best fit
plt.plot(xdata, func(xdata, *optimizedParameters), label="fit");
# Show the graph
plt.legend();
plt.show();
You just need to pass the relevant/interested values to the fit as follows. You can use NumPy indexing to pass only those values of x which are below 42000. Using [xdatac<42000] will return the indices/positions where this condition holds True. The rest of the code remains the same.
optimizedParameters, pcov = opt.curve_fit(func, xdatac[xdatac<42000],
ydatac[xdatac<42000]);
This way, the fit will only be performed up to 42000 and you can still plot the fitted line later by passing the complete x data.

How to smooth the curve?

I am using the following code to draw a curve from my two column Raw data ( x=time , y=|float data|).The graph it is plotting is a rough edge graph. Is it possible to have a smooth edged on these data? I am attaching the code, data and curve.
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates
from matplotlib import style
# changing matplotlib the default style
matplotlib.style.use('ggplot')
#one of {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}
plt.rcParams['lines.linewidth']=1
plt.rcParams['axes.facecolor']='.3'
plt.rcParams['xtick.color']='b'
plt.rcParams['ytick.color']='r'
x,y= np.loadtxt('MaxMin.txt', dtype=str, unpack=True)
x = np.array([datetime.strptime(i, "%H:%M:%S.%f") for i in x])
y = y.astype(float)
# naming the x axis
plt.xlabel('<------Clock-Time(HH:MM:SS)------>')
# naming the y axis
plt.ylabel('Acceleration (m/sq.sec)')
# giving a title to my graph
plt.title('Sample graph!')
# plotting the points
plt.plot(x, y)
# beautify the x-labels
plt.gcf().autofmt_xdate()
#Custom Format
loc = matplotlib.dates.MicrosecondLocator(1000000)
plt.gca().xaxis.set_major_locator(loc)
plt.gca().xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%H:%M:%S'))
# function to show the plot
plt.show()
I have searched similar threads but the mathematical concepts used by them went over my head. So I cannot identify what exactly has to be done for my data.
Generated Graph from RAW data
I am also giving the sample data file so that you can re-construct it at your end.
Get Data File
PS. I am also not being able to change the line color in the graph from default red even after using
plt.rcParams['lines.color']='g'
Although that is a minor issue in this case.
The input data has wrong timestamps, the original author should have used zero-padding when formatting the milliseconds (%03d).
[...]
10:27:19.3 9.50560385141
10:27:19.32 9.48882194058
10:27:19.61 9.75936468731
10:27:19.91 9.96021690527
10:27:19.122 9.48972151383
10:27:19.151 9.49265161533
[...]
We need to fix that first:
x, y = np.loadtxt('MaxMin.txt', dtype=str, unpack=True)
# fix the zero-padding issue
x_fixed = []
for xx in x:
xs = xx.split(".")
xs[1] = "0"*(3-len(xs[1])) + xs[1]
x_fixed.append(xs[0] + '.' + xs[1])
x = np.array([datetime.strptime(i, "%H:%M:%S.%f") for i in x_fixed])
y = y.astype(float)
You can then use a smoothing kernel (e.g. moving average) to smooth the data:
window_len = 3
kernel = np.ones(window_len, dtype=float)/window_len
y_smooth = np.convolve(y, kernel, 'same')
The scipy module has some ways of getting smooth curves through your points. Try adding this to the top:
from scipy import interpolate
Then add these lines just before your plt.show():
xnew = np.linspace(x.min(), x.max(), 100)
bspline = interpolate.make_interp_spline(x, y)
y_smoothed = bspline(xnew)
plt.plot(xnew, y_smoothed)
If you do a little search for scipy.interpolate.make_interp_spline, you can find more info on what that does. But essentially, the combination of that and np.linspace generates a bunch of fake data points to make up a smooth curve.

Python - Plotting confidence error bars with Maxwell Distribution

i've never tried implementing error bars based off of confidence intervals. Being that this is what I want to do, i'm unsure how to proceed further.
I have this large data array that consists ~1000 elements. From plotting the histogram that has this data, it looks well enough like a Maxwell-Boltzmann distribution.
Lets say my data is called x, which I apply the fitting for it as
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
maxwell = stats.maxwell
## Scale Parameter
params = maxwell.fit(x, floc=0)
print params
## mean
mean = 2*params[1]*np.sqrt(2/np.pi)
print mean
## Variance
sig = (params[1])**(3*np.pi-8)/np.pi
print sig
>>> (0, 178.17597215151301)
>>> 284.327714571
>>> 512.637498406
To which when plotting it
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)
xd = np.argsort(x)
ax.plot(x[xd], maxwell.pdf(x, *params)[xd])
ax.hist(x[xd], bins=75, histtype="stepfilled", linewidth=1.5, facecolor='none', alpha=0.55, edgecolor='black',
normed=True)
How on earth do you go about implanting confidence intervals with the curve fit?
I can use
conf = maxwell.interval(0.90,loc=mean,scale=sig)
>>> (588.40702793225228, 1717.3973740895271)
But I have no clue what do with this

Categories

Resources