Regression using Python - python

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?

The return statement should be performed after the for is done, so it should be indented under the for, not further in.

At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Related

How to apply KNN from large datatset to small dataset or to just one test data

I have trained and tested a KNN model on a supervised dataset of about 180 samples (6 classes of 30 samples each) in Python. I would like to apply these results to a small unsupervised dataset of 21 samples (3 classes of 7 samples).
The problem is datasets have different number of raws. So either I getting an error with inconsistent numbers of samples, or matching target in a new datasets and getting not representative result.
I want to see which classes datas from new small dataset corespond in large dataset. Is there a way to do that?
Here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import utils
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
clf = KNeighborsClassifier()
for key in data:
scores = cross_val_score(clf, data[key], y, cv=5)
print("Accuracy for {:5s} : {:0.2f} (+/- {:0.2f})".format(
key, scores.mean(), scores.std() * 2))
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('small dataset')
X = df.drop(columns=['subject', 'sessionIndex', 'rep'])
y = df['subject']
Y = pd.get_dummies(y).values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
n_neighbors = [2, 3, 4, 5, 6]
parameters = dict(n_neighbors=n_neighbors)
clf = KNeighborsClassifier()
grid = GridSearchCV(clf, parameters, cv=5)
grid.fit(X_train, Y_train)
results = grid.cv_results_
for i in range(1, 4):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {}".format(results['params'][candidate]))
print()
from sklearn.metrics import accuracy_score, roc_curve, auc
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
So I'm getting an array [[0 0 1]] which is correct, only it doesn't check any classes in large dataset of 6 classes like if I matching X and Y to datas from it, not from small dataset
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
X = data['large dataset']
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
This way the result an a array of 6 numbers like [[0 0 0 0 0 1]]. And I want to see the same when testing new small dataset.

Arranging an array of `ConfusionMatrixDisplay` objects in a single plot using matplotlib plt.subplots()

I have an array of different models' confusion matrices images, each of which is of type ConfusionMatrixDisplay. I want to display them nicely on a single figure using plt.subplots. How to achieve that? A sample code that I tried is attached below.
%matplotlib notebook
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=0)
clf = SVC(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
disp = ConfusionMatrixDisplay.from_predictions(
y_test, y_pred)
arr = [disp,disp,disp]*3
rows = 0
l = len(arr)
if l%4==0:
rows = l//4
else:
rows = l//4 + 1
fig,ax = plt.subplots(rows, 4, sharex='col', sharey='row',figsize=(6, 6))
print(len(arr))
for i in range(rows):
for j in range(4):
if(4*i + j < len(arr)):
ax[i,j] = arr[4*i + j].ax_
plt.show()

ConvergenceWarning when running cross validation with SVM model

I tried to train a LinearSVC model and evaluate it with cross_val_score on a linearly separable dataset that I created, but I'm getting an error.
Here is a reproducible example:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# creating the dataset
x1 = 2 * np.random.rand(100, 1)
y1 = 5 + 3 * x1 + np.random.randn(100, 1)
lable1 = np.zeros((100, 1))
x2 = 2 * np.random.rand(100, 1)
y2 = 15 + 3 * x2 + np.random.randn(100, 1)
lable2 = np.ones((100, 1))
x = np.concatenate((x1, x2))
y = np.concatenate((y1, y2))
lable = np.concatenate((lable1, lable2))
x = np.reshape(x, (len(x),))
y = np.reshape(y, (len(y),))
lable = np.reshape(lable, (len(lable),))
d = {'x':x, 'y':y, 'lable':lable}
df = pd.DataFrame(data=d)
df.plot(kind="scatter", x="x", y="y")
# preparing data and model
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
X = train_set.drop("lable", axis=1)
y = train_set["lable"].copy()
scaler = StandardScaler()
scaler.fit_transform(X)
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42)
linear_svc.fit(X, y)
# evaluation
scores = cross_val_score(linear_svc, X, y, scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("Mean:", rmse_scores.mean())
Output:
Mean: 0.0
/usr/local/lib/python3.7/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
This is not an error, but a warning, and it already contains some advice:
increase the number of iterations
which by default is 1000 (docs).
Moreover, LinearSVC is a classifier, so using scoring="neg_mean_squared_error" (i.e. a regression metric) in cross_val_score makes no sense; see the documentation for a rough list of relevant metrics per kind of problem.
So, with the following changes:
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42, max_iter=100000)
scores = cross_val_score(linear_svc, X, y, scoring="accuracy", cv=10)
your code runs OK without any error or warning.

Why am I getting this error : Found input variables with inconsistent numbers of samples: [1, 15]

I am trying to solve the following problem but I am getting an error.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.r2_score(X_train, y_train)
r2_test = linreg.r2_train(X_test, y_test)
Found input variables with inconsistent numbers of samples: [1, 15]
Any reason why am I getting the said error.
Three errors in the code:
You need to reshape x into a 2D numpy array by using x.reshape(-1,1).
linreg.r2_score is invalid. Also, no need to use r2_score. Just use linreg.score. This will return the coefficient of determination R^2 of the prediction (reference).
degree r2_score be 0 so use PolynomialFeatures(i+1) inside the loop except if you really intend to use a 0 degree polynomial expansion. Keep in mind that if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
Full working example:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
from sklearn.model_selection import train_test_split
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i+1)
x_poly = poly.fit_transform(x.reshape(-1,1))
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
You have not reshaped x. x should be of shape (n_samples, n_features). And linreg.r2_score is no more. I modified the code as following:
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
Your code have lots of mistakes and typos. It will be useful if you first practice some well known solved problem like iris, house price regression problem etc.
Correct code :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
from sklearn.model_selection import train_test_split
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
#### convert x into 2D matrix #####
x= x.reshape(-1,1)
i=1
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train,linreg.predict(X_train))
r2_test = r2_score(y_test ,linreg.predict(X_test))
#### linreg.score(X_train, y_train) can also used to calculate r2_score

mismatch in sizes with test and train data during function print

Data file I would like to process has 71 records build with two columns: one for x value and second one for y value. Main task is to select training part and testing part, print chosen functions (in my example I've taken linear and exponential(^4) one.
However I've stumbled upon error I can't solve.
Full description of the error:
File="zad1.py", line 25, in module
v = np.linalg.pinv(c) # y
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?, k),(k, m?)->(n?,m?) (size 71 is different from 53)
code
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([X_train, np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
X_train * X_train * X_train * X_train,
X_train * X_train * X_train,
X_train * X_train,
X_train,
np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train^4 + v[1]*X_train^3 + v[2]*X_train^2 + v[3]*X_train +v[4])
plt.show()
Would appreciate any help :).
I've redone it a little and both functions are being printed now but the exponentail one is kinda weird...I mean something is not right here because it's not adjusting to the points od diagram but it's being printed way further.
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([x, np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
x * x * x * x,
x * x * x,
x * x,
x,
np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train*X_train*X_train*X_train + v[1]*X_train*X_train*X_train
+
v[2]*X_train*X_train + v[3]*X_train +v[4])
plt.show()
The problem apparently happens when you multiply X_train * X_train.
Since it is not a square matrix, it can not be multiplied by itself. Do you just need to raise each number in X_train to 2d-4th power? In that case use numpy.multiply

Categories

Resources