Pipeline with PolynomialFeatures and LinearRegression - unexpected result

Pipeline with PolynomialFeatures and LinearRegression - unexpected result - python

with the following code I just want to fit a regression curve to sample data which is not working as expected.
X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)
poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)
y_pred=poly_model.predict(X[:,np.newaxis])
plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")
plt.show()
Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).
If I instead use the xfit data to predict the model the result is as desired...
...
y_pred=poly_model.predict(xfit[:,np.newaxis])
plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
plt.show()
So whats the issue and the explanation for such a behaviour?

The difference between the two plots is that in the line
plt.plot(X[:,np.newaxis],y_pred,color="red")
The values in X[:,np.newaxis] are not sorted, while in
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
the values of xfit[:,np.newaxis] are sorted.
Now, plt.plot connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.
Replace
plt.plot(X[:,np.newaxis],y_pred,color="red")
with
plt.scatter(X[:,np.newaxis],y_pred,color="red")
and you'll get this nice looking figure:

Based on the answer of Miriam Farber I have figured out an other way. Since the X values are not sorted I can fix the issue by simply sort the values with:
X=np.sort(X)
Now the remaining code can remain stationary and will deliver the desired result.

Related

Python Gaussian Process Regression with sklearn

I'm attempting to do a regression to fit a function to some data points I have, these are simply put (x,y) where x = date and y = a point of data. Seems simple enough.
I'm following along on a how-to and it comes to the part where you split your data into training/testing, that much I understand, but the input for model.fit is a 2D array and then labels.
I think I'm being incredibly dense, but this is what I have for that:
model.fit(input, date_time_training)
My input is an array like so [[5, 3], [7,5], etc] my "labels" are dates because that's how I'd want to label my data but that's not right, they need to be numbers. There are two things it could be, though, my data points which are y on my graph and my x-axis which are dates. I converted my dates into numbers (0,1,2,3,etc) corresponding to each date.
Is that also what my labels would be?
Also my input is just [[date_converted_to_int, score], etc] which when looking at the documentation, seemingly that should be [[points, features], etc]. I'm pretty confused, obviously not super experienced with regression either (otherwise I'm guessing this would be clearer).

You are trying to predict {actual term is forecast in this case} your y over time.
So, It is more suitable to use a time series model in this case. Because by definition this is a time series use case.
[time series: you try to understand the evolution of values of an attribute over time]
Try some models like:
AR
ARIMA
and
statsmodel would be a nice place to visit by for documentation

Issues with Seaborn clustermap using a pre-computed Distance Correlation matrix

I am:
(A) running the example from the Seaborn documentation, Discovering
structure in heatmap data, but using the Distance
Correlation from the dcor library, instead of
pandas.DataFrame.corr, which is limited to linear or rank
coefficients.
Then I want to:
(B) do the same using a couple of DataFrames with my own data.
I supply the distance correlation to sns.clustermap directly, as done in the documentation example, because I am interested in the structure in the heatmap, as opposed to using the Distance Correlation matrix to calculate the linkage, as done in this SO answer, for example. I create the distance correaltion matrix with a modification of code from this excellent SO answer.
(A) No issues here
As I execute:
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
dcor_df= df.apply(lambda col1: df.apply(lambda col2: distcorr(col1, col2)))
sns.clustermap(dcor_df, cmap="mako",
row_colors=network_colors, col_colors=network_colors,
linewidths=.75, figsize=(13, 13))
I get the result I expected:
(B) I do encounter issues here, as I move to my own data
For some background: I have two DataFrames with variables labeled A, B, ..., P in both. The variables are identical (same measurement, same units), but the measurements were collected in two locations that are spatially separated, hence my goal was to run the analysis separately, to see if the variables correlate in a similar way (i.e. with similar structure in the heatmap) in different locations.
Data from the first location is stored in here.
I execute the following code:
df_1 = pd.read_csv('df_1.csv')
pd.options.display.float_format = '{:,.3f}'.format
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt_1 = df_1.apply(lambda col1: df_1.apply(lambda col2: distcorr(col1, col2)))
rslt_1
and I get the expected (square, symmetric) Distance Correlation matrix:
which I can plot with sns.heatmap as:
h=sns.heatmap(rslt_1, cmap="mako", vmin=0, vmax=1,
xticklabels=True, yticklabels=True, square=True)
fig = plt.gcf()
fig.set_size_inches(14, 10)
However, when I try to pass the Distance Correlation matrix to `sns.clustermap' with:
s=sns.clustermap(rslt_1, cmap="mako", standard_scale=1, linewidths=0)
fig = plt.gcf()
fig.set_size_inches(10, 10);
I get this:
which is very weird to me because I'm expecting the same ordering on both rows and columns as in the above modified documentation example. Unless I'm totally out to lunch and am missing or misunderstand something important.
If I pass metric='correlation' like this:
s=sns.clustermap(rslt_1, cmap="mako", metric='correlation',
standard_scale=1, linewidths=0)
fig = plt.gcf()
fig.set_size_inches(10, 10);
I get a result that is symmetric about the diagonal as I expected, and if I 'eyeball' those clusters they make more sense to me when I compare to the matrix in tabular form:
With the data from the second location, which is stored here, I get reasonable results (and fairly similar, although not identical) whether I pass metric='correlation' or not:
I cannot explain the behavior in the first case. Am I missing something?
Thank you.
PS I am on a Windows 10 PC.
Some info:

Remove the standard scale parameter from your clustermap.
According to the seaborn documentation (seaborn.pydata.org/generated/seaborn.clustermap.html), the standard_scale=1 parameter standardize the column dimension, which means subtracting the minimum and divide each by its maximum. Your data matrix passed to the clustermap function looks like to be already between [0 1].
s=sns.clustermap(rslt_1, cmap="mako", metric='correlation', linewidths=0)
Clustering is basically grouping data based on relationships among the variables in the data. Clustering algorithms help in getting structured data in unsupervised learning. The most common types of clustering are shown in the picture below.
The clustermap() function of seaborn plots a hierarchically-clustered heat map of the given matrix dataset. It returns a clustered grid index.
In Agglomerative clustering, we start with considering each data point as a cluster and then repeatedly combine two nearest clusters into larger clusters until we are left with a single cluster. The graph plotted after performing agglomerative clustering on data is called "Dendrogram".
Clustering obviously re-order your data.

sklearn.mixture.GMM to fit multiple Gaussian curves into a histogram, an EM algorithm error

I'm using sklearn.mixture.GMM to fit two Gaussian curves to an array of data and consequently overlay it with data histogram (dat disturbution is mixture of 2 Gaussian curves).
My data is a list of float number and here is the line of code i am using :
clf = mixture.GMM(n_components=1, covariance_type='diag')
clf.fit(listOffValues)
if i set n_components to 1, I get the following error:
"(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.
and if i use n_components to 2 there error is:
(self.n_components, X.shape[0]))
ValueError: GMM estimation with 2 components, but got only 1 samples.
For the first error, I tried changing all init parameters of GMM, but it didn't make any difference.
Tried an array of random numbers and the code is working perfectly fine.
I cant figure out what possibly can be the issue.
Is there an implementation issue I'm overlooking?
Thank you for your help.

If I understood you correctly - you would like to fit you data distribution with gaussians and you have only one feature per element. Than you should reshape your vector to be a column vector:
listOffValues = np.reshape(listOffValues, (-1, 1))
otherwise, if your listOffValues corresponds to some curve that you want to fit it with several gaussians, than you should use curve_fit. See Gaussian fit for Python

Extrapolating data from a curve using Python

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.
This solution isnt accurate enough as if I feed
x = dfnew["days_since"]
y = dfnew["nonbrand"]
z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)
x_new = future_days
y_new = f(x_new)
plt.plot(x,y, '.', x_new, y_new, '-')
EDIT:
I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:
def func(x, a, b):
return a*numpy.log(x)+b
x = dfnew["days_since"]
y = dfnew["nonbrand"]
popt, pcov = curve_fit(func, x, y)
plt.plot( future_days, func(future_days, *popt), '-')
However when I plot it, my Y-values are way off:

The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:
You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
You are using the wrong function, it is always worth taking a look at:
your data curve &
what you know about the process that is generating the data
to come up with some speculation/theorem/guesses about what sort of model might fit better.
Might your process be a logarithmic one, a saturating on, etc. try them!
Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

Equivalent to R's lines() function in Python?

Is it feasible to plot a line of predicted values in Python? In R, you can plot the line of predicted values using lines(x$age, predict(result, data.frame(x$height, x$weight))). What I want to do is to plot such predicted values determined from the regression result, but also want to plot anything other than a simple, one-variable regression, which can be done in NumPy and matplotlib.
So for example, in the following code:
import statsmodels.formula.api as smf
df = pd.read_csv("sample.csv")
result = smf.ols("res~height+weight+I((180-height^2))+I((60-weight)^3)", data=df).fit()
And I want to use the regression result from smf.ols(), which is a bit more complicated than the usual multiple regression (say, res~height+weight). So how can I determine the predicted values and plot those values?
Thanks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pipeline with PolynomialFeatures and LinearRegression - unexpected result - python

Based on the answer of Miriam Farber I have figured out an other way. Since the X values are not sorted I can fix the issue by simply sort the values with: X=np.sort(X) Now the remaining code can remain stationary and will deliver the desired result.

Related

Python Gaussian Process Regression with sklearn

Issues with Seaborn clustermap using a pre-computed Distance Correlation matrix

sklearn.mixture.GMM to fit multiple Gaussian curves into a histogram, an EM algorithm error

Extrapolating data from a curve using Python

Equivalent to R's lines() function in Python?

Categories

Resources