I was using matplotlib.pyplot to plot a continuous curve in jupyter notebook. I used the following code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X.reshape(-1,1), y, random_state = 0)
poly = PolynomialFeatures(degree=9)
X_train_p = poly.fit_transform(X_train)
X_test_p = poly.fit_transform(X_test)
plt.figure(figsize=(5,5))
plt.title("deg={}".format(9))
plt.plot(X_train, y_train.reshape(-1,1), 'r')
plt.show()
I expected the data points to be successively connected by straight lines, however the outcome turns up like this:
I tried multiple variations of reshaping X_train and y_train using .reshape(), but didn't get the expected outcome.
Related
I am having trouble to solve the array dimension problem showing in the code. When I am trying to figure out the y_predict, the valueerror problem is showing. here is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#importing dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:,1:2].values
y = dataset.iloc[:,2].values
y=np.reshape(y,(10,1))
#Spliting dataset into training set and test set
'''from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)'''
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
######## SVR regression
from sklearn.svm import SVR
svr_regressor = SVR(kernel='rbf') #rbf = gaussian kernel
svr_regressor.fit(X, y)
#Prediction of given value using SVR regression
X = np.reshape(X,(-1, 1))
y_predict = sc_y.inverse_transform(svr_regressor.predict(sc_X.transform([[6.5]])))
########### Visulization of svr model
plt.scatter(X, y, color = 'blue')
plt.plot(X, svr_regressor.predict(X), color = 'red')
plt.show()
I am getting error:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.27861589].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I have a polynomial regression script that works correctly to predict values with X and Y axis, in my example I use CPU consumption, below we see an example of the data set:
Complete data set
Where time represents the collection time, example:
1 = 1 minute
2 = 2 minute
And so on...
And consume is the use value of the cpu for that minute, summarizing this data set demonstrates the behavior of a host in the period of 30 minutes, each value corresponding to one minute in ascending order (1min, 2min, 3min ...)
The result for this is:
With this algorithm:
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
What's the problem?
If we duplicate this data set so that the 30 minute range appears 2x, the algorithm does not understand the data set and its result is not as efficient, example of the data set:
--> Up to time = 30
--> Up to time = 30
Complete data set
Note: In the case it has 60 values, where every 30 values represents the range of 30 minutes, it is as if they were different collection days.
The result it shows is this:
Objective: I would like the blue line that represents the polynomial regression to be similar to the first result image, the one we see above demonstrates a loop, where the points are connected, it is as if the algorithm had failed.
Research source
The problem is that in the second case, you plot using X = 1, 2, ... 30, 1, 2, ... 30. The plot function connects successive points. If you just plotted a scatter using pyplot, you would see your nice regression curve. Or you could argsort. Here is the code with the scatter in green, the argsort line in black.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
# Importing the dataset
# dataset = pd.read_csv('data.csv')
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
indices = np.argsort(X[:, 0])
plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
Here is the output image for the larger dataset.
there!
I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)
# Merge datasets
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])
# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result
I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.
Running this code on my IBM Data Cloud Notebook produces the following warning:
/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning:
cycling among columns of inputs with non-matching shapes is deprecated.
cbook.warn_deprecated("2.2", "cycling among columns of inputs "
I searched on Google and here on StackOverflow, but I can't figure what is wrong.
I'll appreciate some assistance. Thanks in advance!
There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.
The following code snippet should set you in the right direction:
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values
# Plot the result
plt.scatter(x_train, y_train)
regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);
To do the same for the test subset with already fitted model:
# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);
Feel free to ask questions if you have any.
I am trying to imitate this one code that i found on Kaggle on plotting SVM decision boundaries. I am using my own dataset with 608 data and 10 features, with 2 classes. Those 2 classes, for instance, is whether you're diabetec or not. I copied the SVM part of the code on this link (in which you can find when you scroll it way down at the bottom) where it mentioned about decision boundary visualisation. Here's the link to my reference.
However, i get this error saying that "X must be a Numpy array". Can someone explain to me what does this mean?
The code below is what i've done. Take note that my dataset have been normalised beforehand. Also, I'm splitting the data into 70:30 ratio.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.pyplot as show
import matplotlib as cm
import matplotlib.colors as colors
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
autism = pd.read_csv('diabetec.csv')
x = autism.drop(['TARGET'], axis = 1)
y = autism['TARGET']
x_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=1)
t = np.array(y_train)
t = t.astype(np.integer)
clf_svm = SVC(C=1.3, gamma=0.8, kernel='rbf')
clf_svm.fit(x_train, t)
plt.figure(figsize=[15,10])
plot_decision_regions(x_train, t, clf = clf_svm, hide_spines = False, colors = 'purple,limegreen', markers = ['x','o'])
plt.title('Support Vector Machine')
plot_decision_regions expects a numpy array but x_train is a pandas dataframe . Try with x_train.values, i.e.
plot_decision_regions(x_train.values, t, clf = clf_svm, ...
I'm a little new with modeling techniques and I'm trying to compare SVR and Linear Regression. I've used f(x) = 5x+10 linear function to generate training and test data set. I've written following code snippet so far:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
with open('test.csv', 'r') as f1:
train_dataframe = pd.read_csv(f1)
X_train = train_dataframe.iloc[:30,(0)]
y_train = train_dataframe.iloc[:30,(1)]
with open('test.csv','r') as f2:
test_dataframe = pd.read_csv(f2)
X_test = test_dataframe.iloc[30:,(0)]
y_test = test_dataframe.iloc[30:,(1)]
svr = svm.SVR(kernel="rbf", gamma=0.1)
log = LinearRegression()
svr.fit(X_train.reshape(-1,1),y_train)
log.fit(X_train.reshape(-1,1), y_train)
predSVR = svr.predict(X_test.reshape(-1,1))
predLog = log.predict(X_test.reshape(-1,1))
plt.plot(X_test, y_test, label='true data')
plt.plot(X_test, predSVR, 'co', label='SVR')
plt.plot(X_test, predLog, 'mo', label='LogReg')
plt.legend()
plt.show()
As you can see in the picture, Linear Regression works fine but SVM has poor prediction accuracy.
Please let me know if you any suggestion to tackle this issue.
Thanks
The reason is SVR with kernel rbf don't apply the feature scaling. You need to apply feature scaling before fitting the data to the model.
Sample Code For Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
sc_y = StandardScaler()
y = sc_y.fit_transform(y)
Please see the code below:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
X = np.linspace(0,100,101)
y = np.array([(100*np.random.rand(1)+num) for num in (5*x+10)])
X_train, X_test, y_train, y_test = train_test_split(X, y)
svr = SVR(kernel='linear')
lm = LinearRegression()
svr.fit(X_train.reshape(-1,1),y_train.flatten())
lm.fit(X_train.reshape(-1,1), y_train.flatten())
pred_SVR = svr.predict(X_test.reshape(-1,1))
pred_lm = lm.predict(X_test.reshape(-1,1))
plt.plot(X,y, label='True data')
plt.plot(X_test[::2], pred_SVR[::2], 'co', label='SVR')
plt.plot(X_test[1::2], pred_lm[1::2], 'mo', label='Linear Reg')
plt.legend(loc='upper left');
The reason you were going nowhere was rbf kernel
If we adjust SVR rbf model with constrains like this:
svr_rbf=SVR(kernel='rbf', C=1e3, gamma=0.1)
we will see different result as below chart. Green Stars are new SVR_rbf model prediction. Hope it helps.