I have a polynomial regression script that works correctly to predict values with X and Y axis, in my example I use CPU consumption, below we see an example of the data set:
Complete data set
Where time represents the collection time, example:
1 = 1 minute
2 = 2 minute
And so on...
And consume is the use value of the cpu for that minute, summarizing this data set demonstrates the behavior of a host in the period of 30 minutes, each value corresponding to one minute in ascending order (1min, 2min, 3min ...)
The result for this is:
With this algorithm:
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
What's the problem?
If we duplicate this data set so that the 30 minute range appears 2x, the algorithm does not understand the data set and its result is not as efficient, example of the data set:
--> Up to time = 30
--> Up to time = 30
Complete data set
Note: In the case it has 60 values, where every 30 values represents the range of 30 minutes, it is as if they were different collection days.
The result it shows is this:
Objective: I would like the blue line that represents the polynomial regression to be similar to the first result image, the one we see above demonstrates a loop, where the points are connected, it is as if the algorithm had failed.
Research source
The problem is that in the second case, you plot using X = 1, 2, ... 30, 1, 2, ... 30. The plot function connects successive points. If you just plotted a scatter using pyplot, you would see your nice regression curve. Or you could argsort. Here is the code with the scatter in green, the argsort line in black.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
# Importing the dataset
# dataset = pd.read_csv('data.csv')
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
indices = np.argsort(X[:, 0])
plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
Here is the output image for the larger dataset.
Related
I have this dataframe and I want to calculate the polynomial regression for ozone. I pass o3 as y value, and the dates as x value. Why does my polynomial regression look the same for grade 2 to 15? I have compared grade 4 to grade 15 and there is no difference... I have compared the obtained regressions to CurveExpert software, and they are entirely different... How to solve the problems and to view differences between grade 4 and 15?
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv')
dataset['day'] = pd.to_datetime(dataset['day'], dayfirst=True)
dataset = dataset.sort_values(by=['readable time'])
print(dataset.head())
group_by_df = pd.DataFrame([name, group.mean()["o3"]] for name, group in dataset.groupby('day'))
group_by_df.columns = ['day', "o3"]
group_by_df['day'] = pd.to_datetime(group_by_df['day'])
group_by_df['day'] = group_by_df['day'].map(dt.datetime.toordinal)
X = group_by_df[['day']].values
y = group_by_df[['o3']].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualizing the Linear Regression results
def viz_linear():
plt.scatter(X, y, color='red')
plt.plot(X, lin_reg.predict(X), color='blue')
plt.title('Linear Regression')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_linear()
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('poly Regression grade 15')
plt.xlabel('Date')
plt.ylabel('O3 levels')
plt.show()
return
viz_polymonial()
you are so close. Nice job, you have a lot going on here.
I think you want to fit the test sets like this for linear:
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_test, y_test)
and like this for Polynomial:
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=15)
X_poly = poly_reg.fit_transform(X_test)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y_test)
now the curves show visually much different
I was using matplotlib.pyplot to plot a continuous curve in jupyter notebook. I used the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X.reshape(-1,1), y, random_state = 0)
poly = PolynomialFeatures(degree=9)
X_train_p = poly.fit_transform(X_train)
X_test_p = poly.fit_transform(X_test)
plt.figure(figsize=(5,5))
plt.title("deg={}".format(9))
plt.plot(X_train, y_train.reshape(-1,1), 'r')
plt.show()
I expected the data points to be successively connected by straight lines, however the outcome turns up like this:
I tried multiple variations of reshaping X_train and y_train using .reshape(), but didn't get the expected outcome.
I have been playing around with lasso regression on polynomial functions using the code below. The question I have is should I be doing feature scaling as part of the lasso regression (when attempting to fit a polynomial function). The R^2 results and plot as outlined in the code I have pasted below suggests not. Appreciate any advice on why this is not the case or if I have fundamentally stuffed something up. Thanks in advance for any advice.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def answer_regression():
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
global X_train, X_test, y_train, y_test
degrees = 12
poly = PolynomialFeatures(degree=degrees)
X_train_poly = poly.fit_transform(X_train.reshape(-1,1))
X_test_poly = poly.fit_transform(X_test.reshape(-1,1))
#Lasso Regression Model
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)
#No feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_poly, y_train)
y_test_lassopredict = linlasso.predict(X_test_poly)
Lasso_R2_test_score = r2_score(y_test, y_test_lassopredict)
#With feature scaling
linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_train_scaled, y_train)
y_test_lassopredict_scaled = linlasso.predict(X_test_scaled)
Lasso_R2_test_score_scaled = r2_score(y_test, y_test_lassopredict_scaled)
%matplotlib notebook
plt.figure()
plt.scatter(X_test, y_test, label='Test data')
plt.scatter(X_test, y_test_lassopredict, label='Predict data - No Scaling')
plt.scatter(X_test, y_test_lassopredict_scaled, label='Predict data - With Scaling')
return (Lasso_R2_test_score, Lasso_R2_test_score_scaled)
answer_regression()```
Your X range is around [0,10], so the polynomial features will have a much wider range. Without scaling, their weights are already small (because of their larger values), so Lasso will not need to set them to zero. If you scale them, their weights will be much larger, and Lasso will set most of them to zero. That's why it has a poor prediction for the scaled case (those features are needed to capture the true trend of y).
You can confirm this by getting the weights (linlasso.coef_) for both cases, where you will see that most of the weights for the second case (scaled one) are set to zero.
It seems your alpha is larger than an optimal value and should be tuned. If you decrease alpha, you will get similar results for both cases.
I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.
We are trying to plot the predicted values and truth values on the same graph after fitting a model to predict a truth value using a RandomForestRegressor in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
Here is how we do the prediction.
import pandas as pd
import numpy as np
import glob, os
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
from sklearn.cross_validation import train_test_split
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data*.csv"))))
for i in range(1,10):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(10)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train,y_train)
modelPred_test = reg.predict(X_test)
print(modelPred_test)
For comparison, we wish to generate a plot before prediction and after prediction. For the truth value, we tried it with
fig, ax = plt.subplots()
ax.plot(df['time'].values, df['Y'].values)
We wish to plot (in the same graph) the ground truth (time as x-axis and the value of Y as y-axis. When we do
ax.plot(df['time'].values, modelPred_test)
We are getting the following error.
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
This means that we have less prediction values than we have time stamps in our dataset. To verify this, I did
print(df['time'].values.shape) and print(modelPred_test.shape) - and it outputs (258523,) and (103410,) respectively. How can we match which of my time values correspond to the prediction values, then i can use a subset of the time values for my plot command?
You have to set your data like the following.
X = df.drop('Y', axis=1)
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_test = X_test.drop('time', axis=1)
and then sort the datasets
index_values=range(0,len(y_test))
y_test.sort_index(inplace=True)
X_test.sort_index(inplace=True)
modelPred_test = reg.predict(X_test)
ax.plot(pd.Series(index_values), y_test.values)
finally, do the same plot for the predicted values of y. Hope this helps.
You need to keep track of the indices for training and testing datasets. For example, you could define
train_index, test_index = train_test_split(df.index, test_size=0.40)
and then X_train = X[train_index], etc.
Then, you could plot the results via ax.plot(df['time'][test_index].values, modelPred_test[df.index == test_index]).