Please help me to create scatter graph for this classification algorithm. Here in y i have a column of labels( 0, 1) i want the predicted labels in two different colors for both labels.
X = np.array(df.iloc[: , [0, 1,2,3,4,5,6,7,8,9,10,]].values)
y = df.iloc[: , 17].values
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
I don't have access to your dataframes, but here is a minimum working example, assuming I understood right.
The point is that you have to use logical indexing for your numpy arrays during plotting. This is exemplified by the last two lines.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
X = np.zeros((100,2))
X[:,0] = np.array(list(range(100)))
X[:,1] = np.array(list(range(100)))
y = list([0] * 50 + [1] * 50)
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
plt.scatter(test_x[dtc_labels == 0,0],test_x[dtc_labels == 0,1])
plt.scatter(test_x[dtc_labels == 1,0],test_x[dtc_labels == 1,1])
Related
I tried to train a LinearSVC model and evaluate it with cross_val_score on a linearly separable dataset that I created, but I'm getting an error.
Here is a reproducible example:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# creating the dataset
x1 = 2 * np.random.rand(100, 1)
y1 = 5 + 3 * x1 + np.random.randn(100, 1)
lable1 = np.zeros((100, 1))
x2 = 2 * np.random.rand(100, 1)
y2 = 15 + 3 * x2 + np.random.randn(100, 1)
lable2 = np.ones((100, 1))
x = np.concatenate((x1, x2))
y = np.concatenate((y1, y2))
lable = np.concatenate((lable1, lable2))
x = np.reshape(x, (len(x),))
y = np.reshape(y, (len(y),))
lable = np.reshape(lable, (len(lable),))
d = {'x':x, 'y':y, 'lable':lable}
df = pd.DataFrame(data=d)
df.plot(kind="scatter", x="x", y="y")
# preparing data and model
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
X = train_set.drop("lable", axis=1)
y = train_set["lable"].copy()
scaler = StandardScaler()
scaler.fit_transform(X)
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42)
linear_svc.fit(X, y)
# evaluation
scores = cross_val_score(linear_svc, X, y, scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("Mean:", rmse_scores.mean())
Output:
Mean: 0.0
/usr/local/lib/python3.7/dist-packages/sklearn/svm/_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
This is not an error, but a warning, and it already contains some advice:
increase the number of iterations
which by default is 1000 (docs).
Moreover, LinearSVC is a classifier, so using scoring="neg_mean_squared_error" (i.e. a regression metric) in cross_val_score makes no sense; see the documentation for a rough list of relevant metrics per kind of problem.
So, with the following changes:
linear_svc = LinearSVC(C=5, loss="hinge", random_state=42, max_iter=100000)
scores = cross_val_score(linear_svc, X, y, scoring="accuracy", cv=10)
your code runs OK without any error or warning.
I have a train set with 10192 samples of '0' and 2512 samples of '1'.
I've applied a PCA on the set to reduce the dimensionality.
I want to undersample this numpy array.
Here's my code :
df = read_csv("train.csv")
X = df.drop(['label'], axis = 1)
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = PCA(n_components = 19)
model.fit(X_train)
X_train_pca = model.transform(X_train)
X_validation_pca = model.transform(X_validation)
X_train = np.array(X_train_pca)
X_validation = np.array(X_validation_pca)
y_train = np.array(y_train)
y_validation = np.array(y_validation)
How can I undersample '0' class from X_train numpy array?
Try after importing csv into df:
# class count
count_class_0, count_class_1 = df.label.value_counts()
# separate according to `label`
df_class_0 = df[df['label'] == 0]
df_class_1 = df[df['label'] == 1]
# sample only from class 0 quantity of rows of class 1
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)
Then perform all calculations on df_test_under data frame.
Alternatively use RandomUnderSampler:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
I am trying to solve the following problem but I am getting an error.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.r2_score(X_train, y_train)
r2_test = linreg.r2_train(X_test, y_test)
Found input variables with inconsistent numbers of samples: [1, 15]
Any reason why am I getting the said error.
Three errors in the code:
You need to reshape x into a 2D numpy array by using x.reshape(-1,1).
linreg.r2_score is invalid. Also, no need to use r2_score. Just use linreg.score. This will return the coefficient of determination R^2 of the prediction (reference).
degree r2_score be 0 so use PolynomialFeatures(i+1) inside the loop except if you really intend to use a 0 degree polynomial expansion. Keep in mind that if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
Full working example:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np
from sklearn.model_selection import train_test_split
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
for i in degrees:
poly = PolynomialFeatures(i+1)
x_poly = poly.fit_transform(x.reshape(-1,1))
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
You have not reshaped x. x should be of shape (n_samples, n_features). And linreg.r2_score is no more. I modified the code as following:
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
x = x.reshape(-1, 1)
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = linreg.score(X_train, y_train)
r2_test = linreg.score(X_test, y_test)
Your code have lots of mistakes and typos. It will be useful if you first practice some well known solved problem like iris, house price regression problem etc.
Correct code :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
from sklearn.model_selection import train_test_split
import numpy as np
degrees = np.arange(0, 9)
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
#### convert x into 2D matrix #####
x= x.reshape(-1,1)
i=1
for i in degrees:
poly = PolynomialFeatures(i)
x_poly = poly.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(x_poly, y, random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
r2_train = r2_score(y_train,linreg.predict(X_train))
r2_test = r2_score(y_test ,linreg.predict(X_test))
#### linreg.score(X_train, y_train) can also used to calculate r2_score
While applying some LDA on my Churn_Modelling.csv file, eveything goes well until the point where my X_train return (8000, 1) except of (8000, 2) as expected :
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_train is before-hand "hot-encoded" and "feature scaled" as followed :
# LDA
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
While doing the same on an other .csv file I have no troubles... do you have any idea why ?
Thank you very very much for your help !
I think I have the answer but I would prefer to have confirmation if possible :-)
The maximal number of columns I can hope to obtain using transform. is n-1 so, in my case, 2 classes (True, False) yields maximally 1 column (n-1).
Am I right ? Thank you again.
I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.