ConvergenceWarning when running cross validation with SVM model - python

I tried to train a LinearSVC model and evaluate it with cross_val_score on a linearly separable dataset that I created, but I'm getting an error.
Here is a reproducible example:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# creating the dataset
x1 = 2 * np.random.rand(100, 1)
y1 = 5 + 3 * x1 + np.random.randn(100, 1)
lable1 = np.zeros((100, 1))
x2 = 2 * np.random.rand(100, 1)
y2 = 15 + 3 * x2 + np.random.randn(100, 1)
lable2 = np.ones((100, 1))
x = np.concatenate((x1, x2))
y = np.concatenate((y1, y2))
lable = np.concatenate((lable1, lable2))
x = np.reshape(x, (len(x),))
y = np.reshape(y, (len(y),))
lable = np.reshape(lable, (len(lable),))
d = {'x':x, 'y':y, 'lable':lable}
df = pd.DataFrame(data=d)
df.plot(kind="scatter", x="x", y="y")
# preparing data and model
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
X = train_set.drop("lable", axis=1)
y = train_set["lable"].copy()
scaler = StandardScaler()
scaler.fit_transform(X)
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42)
linear_svc.fit(X, y)
# evaluation
scores = cross_val_score(linear_svc, X, y, scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("Mean:", rmse_scores.mean())
Output:
Mean: 0.0
/usr/local/lib/python3.7/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)

This is not an error, but a warning, and it already contains some advice:
increase the number of iterations
which by default is 1000 (docs).
Moreover, LinearSVC is a classifier, so using scoring="neg_mean_squared_error" (i.e. a regression metric) in cross_val_score makes no sense; see the documentation for a rough list of relevant metrics per kind of problem.
So, with the following changes:
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42, max_iter=100000)
scores = cross_val_score(linear_svc, X, y, scoring="accuracy", cv=10)
your code runs OK without any error or warning.

Related

Linear Regression Neural Network Tensorflow Keras Python program

I wrote a small
"Linear Regression Neural Network Tensorflow Keras Python program"
Input dataset is
y = mx + c straight line data.
Predicted y values are not correct and are giving horizontal line kind of
values, instead of a line with some slope.
I ran this program on Windows laptop with tensorflow, Keras and
Jupyter notebook.
What to do to fix this program please?
Thanks and best regards,
SSJ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n2 = 50
count = 20
n4 = n2 + count
p = 100
m = 10
c = 5
x = np.linspace(n2, n4, p)
y = m * x + c
x
y
plt.scatter(x,y)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
x_normalizer = preprocessing.Normalization(input_shape=[1,])
x_normalizer.adapt(x)
x_normalized = x_normalizer(x)
y_normalizer = preprocessing.Normalization(input_shape=[1,])
y_normalizer.adapt(y)
y_normalized = x_normalizer(y)
y_model = tf.keras.Sequential([
y_normalizer,
layers.Dense(1)
])
y_model.compile(optimizer='rmsprop', loss='mse', metrics = ['mae'])
y_hist = y_model.fit(x, y, epochs=100, verbose=0, validation_split = 0.2)
hist = pd.DataFrame(y_hist.history)
hist['epoch'] = y_hist.epoch
hist.head()
hist.tail()
xin = [51,53,59,64]
ypred = y_model.predict(xin)
ypred
plt.scatter(x, y)
plt.scatter(xin, ypred, color = 'r')
plt.grid(linestyle = '--')
Use StandardScaler instead of Normalization
Normalizer acts row-wise and StandardScaler column-wise.
Normalizer does not remove the mean and scale by deviation but scales
the whole row to unit norm.
Found here: Difference between StandardScaler and Normalizer
This is how you can process the data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
x = np.linspace(50, 70, 100).reshape(-1, 1)
y = 10 * x + 5
x_standard_scaler = StandardScaler().fit(x)
y_standard_scaler = StandardScaler().fit(y)
x_scaled = x_standard_scaler.transform(x)
y_scaled = y_standard_scaler.transform(y)
Remember that you need two separate scalers for x and y so don't use the same object for that. Also if you want to use that scaler to process new data for testing, save the scaler in some variable. A good practice is to not refit the scaler again on test data.
model = Sequential([
Dense(1, input_dim=1, activation='linear'),
])
model.compile(optimizer='rmsprop', loss='mse')
history = model.fit(x_scaled, y_scaled, epochs=1000, verbose=0, validation_split = 0.2).history
pd.DataFrame(history).plot()
plt.show()
As you can see the model is converging. Its worth to plot the loss history which helps to tell if your model is learning or not.
x_test = np.linspace(20, 100, 10).reshape(-1, 1)
y_test = 10 * x_test + 5
x_test_scaled = x_standard_scaler.transform(x_test)
y_test_scaled = y_standard_scaler.transform(y_test)
If you have a test data that you want to use for validation or just predict it, remember to use standard scaler again, but without fitting. It should be fitted on train data only in most cases.
y_test_pred_scaled = model.predict(x_test_scaled)
y_test_pred = y_standard_scaler.inverse_transform(y_test_pred_scaled)
plt.scatter(x_test, y_test, s=30, label='true')
plt.scatter(x_test, y_test_pred, s=15, label='pred')
plt.legend()
plt.show()
If you want to get your prediction rescaled back to its original range use inverse_transform. Notice that prediction on x_test after rescaling is very close to y_test.

Why am I getting this error : Found input variables with inconsistent numbers of samples: [1, 15]

I am trying to solve the following problem but I am getting an error.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.r2_score(X_train, y_train)
r2_test = linreg.r2_train(X_test, y_test)
Found input variables with inconsistent numbers of samples: [1, 15]
Any reason why am I getting the said error.
Three errors in the code:
You need to reshape x into a 2D numpy array by using x.reshape(-1,1).
linreg.r2_score is invalid. Also, no need to use r2_score. Just use linreg.score. This will return the coefficient of determination R^2 of the prediction (reference).
degree r2_score be 0 so use PolynomialFeatures(i+1) inside the loop except if you really intend to use a 0 degree polynomial expansion. Keep in mind that if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
Full working example:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
from sklearn.model_selection import train_test_split
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i+1)
x_poly = poly.fit_transform(x.reshape(-1,1))
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
You have not reshaped x. x should be of shape (n_samples, n_features). And linreg.r2_score is no more. I modified the code as following:
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
Your code have lots of mistakes and typos. It will be useful if you first practice some well known solved problem like iris, house price regression problem etc.
Correct code :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
from sklearn.model_selection import train_test_split
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
#### convert x into 2D matrix #####
x= x.reshape(-1,1)
i=1
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train,linreg.predict(X_train))
r2_test = r2_score(y_test ,linreg.predict(X_test))
#### linreg.score(X_train, y_train) can also used to calculate r2_score

Plot scatter for classification algorithm

Please help me to create scatter graph for this classification algorithm. Here in y i have a column of labels( 0, 1) i want the predicted labels in two different colors for both labels.
X = np.array(df.iloc[: , [0, 1,2,3,4,5,6,7,8,9,10,]].values)
y = df.iloc[: , 17].values
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
I don't have access to your dataframes, but here is a minimum working example, assuming I understood right.
The point is that you have to use logical indexing for your numpy arrays during plotting. This is exemplified by the last two lines.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
X = np.zeros((100,2))
X[:,0] = np.array(list(range(100)))
X[:,1] = np.array(list(range(100)))
y = list([0] * 50 + [1] * 50)
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
plt.scatter(test_x[dtc_labels == 0,0],test_x[dtc_labels == 0,1])
plt.scatter(test_x[dtc_labels == 1,0],test_x[dtc_labels == 1,1])

mismatch in sizes with test and train data during function print

Data file I would like to process has 71 records build with two columns: one for x value and second one for y value. Main task is to select training part and testing part, print chosen functions (in my example I've taken linear and exponential(^4) one.
However I've stumbled upon error I can't solve.
Full description of the error:
File="zad1.py", line 25, in module
v = np.linalg.pinv(c) # y
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?, k),(k, m?)->(n?,m?) (size 71 is different from 53)
code
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([X_train, np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
X_train * X_train * X_train * X_train,
X_train * X_train * X_train,
X_train * X_train,
X_train,
np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train^4 + v[1]*X_train^3 + v[2]*X_train^2 + v[3]*X_train +v[4])
plt.show()
Would appreciate any help :).
I've redone it a little and both functions are being printed now but the exponentail one is kinda weird...I mean something is not right here because it's not adjusting to the points od diagram but it's being printed way further.
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([x, np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
x * x * x * x,
x * x * x,
x * x,
x,
np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train*X_train*X_train*X_train + v[1]*X_train*X_train*X_train
+
v[2]*X_train*X_train + v[3]*X_train +v[4])
plt.show()
The problem apparently happens when you multiply X_train * X_train.
Since it is not a square matrix, it can not be multiplied by itself. Do you just need to raise each number in X_train to 2d-4th power? In that case use numpy.multiply

Regression using Python

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Categories

Resources