Scikit learn wrong predictions with SVC - python

I am trying to predict the MNIST (http://pjreddie.com/projects/mnist-in-csv/) dataset with an SVM using the radial kernel. I want to train with few examples (e.g. 1000) and predict many more. The problem is that whenever I predict, the predictions are constant unless the indices of the test set coincide with those of the training set. That is, suppose I train with examples 1:1000 from my training examples. Then, the predictions are correct (i.e. the SVM does its best) for 1:1000 of my test set, but then I get the same output for the rest. If however I train with examples 2001:3000, then only the test examples corresponding to those rows in the test set are labeled correctly (i.e. not with the same constant). I am completely at a loss, and I think that there is some sort of bug, because the exact same code works just fine with LinearSVC, although evidently the accuracy of the method is lower.
First, I train with examples 501:1000 of training data:
# dat_train/test are pandas DFs corresponding to both MNIST datasets
dat_train = pd.read_csv('data/mnist_train.csv', header=None)
dat_test = pd.read_csv('data/mnist_train.csv', header=None)
svm = SVC(C=10.0)
idx = range(1000)
#idx = np.random.choice(range(len(dat_train)), size=1000, replace=False)
X_train = dat_train.iloc[idx,1:].reset_index(drop=True).as_matrix()
y_train = dat_train.iloc[idx,0].reset_index(drop=True).as_matrix()
X_test = dat_test.reset_index(drop=True).as_matrix()[:,1:]
y_test = dat_test.reset_index(drop=True).as_matrix()[:,0]
svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])
Here you can see that about half the predictions are wrong
y_pred = svm.predict(X_test[:1000,:])
confusion_matrix(y_test[:1000], y_pred)
All wrong (i.e. constant)
y_pred = svm.predict(X_test[:500,:])
confusion_matrix(y_test[:500], y_pred)
This is what I would expect to see for all test data
y_pred = svm.predict(X_test[501:1000,:])
confusion_matrix(y_test[501:1000], y_pred)
You can check that all of the above are correct using LinearSVC!

The default kernel is RBF, in which case gamma matters. If gamma is not provided, it is auto by default, which is 1/n_features. You'd better run grid search to find the optimal parameters. Here I just illustrate the result is normal given suitable parameters.
In [120]: svm = SVC(C=1, gamma=0.0000001)
In [121]: svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])
Out[121]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=1e-07, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [122]: y_pred = svm.predict(X_test[:1000,:])
In [123]: confusion_matrix(y_test[:1000], y_pred)
Out[123]:
array([[ 71, 0, 2, 0, 2, 9, 1, 0, 0, 0],
[ 0, 123, 0, 0, 0, 1, 1, 0, 1, 0],
[ 2, 5, 91, 1, 1, 1, 3, 7, 5, 0],
[ 0, 1, 4, 48, 0, 40, 1, 5, 7, 1],
[ 0, 0, 0, 0, 88, 2, 3, 2, 0, 15],
[ 1, 1, 1, 0, 2, 77, 0, 3, 1, 1],
[ 3, 0, 3, 0, 5, 4, 72, 0, 0, 0],
[ 0, 2, 3, 0, 3, 0, 1, 88, 1, 1],
[ 2, 0, 1, 2, 3, 9, 1, 4, 63, 4],
[ 0, 1, 0, 0, 16, 3, 0, 11, 1, 62]])

Finding good parameters for an SVC is an art in itself. Grid Search might help, better works some population based training like in this article - i recently tried it. If you let it run the same time, it has better results than GridSearch. If you let it run until the accuracy is the same, it is faster.
It also helps to make a graphic: let the x and y axis be C and gamma, and plot the prediction scores as color. Usually you will find kind of a V-Shape with the best training results at the point where the two lines meet. At the same time this point has low C-Values, too, which is desirable because C determines the runtime of the SVC: High C makes a long runtime.

Related

Python Linear Regression - Error with data dimensions

I am having a 1-dimensional error within my code. I'm attempting to create a linear regression on stock prices to predict a few months in the future and, frankly, I'm very confused. I've been tweaking this program for the last few hours and I can't seem to get it right.
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 5 20:45:06 2022
#author: samwa
"""
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
from pandas_datareader import data
from sklearn.preprocessing import MinMaxScaler
HUM = data.DataReader('HUM', 'yahoo', '1970-01-01')
HUM.to_csv('HUM Stock Data.csv')
df = pd.read_csv('HUM Stock Data.csv')
df.shape
df = df['Open'].values
df = df.reshape(-1, 1)
df.shape
dataset_train = np.array(df[:int(df.shape[0]*0.8)])
dataset_test = np.array(df[int(df.shape[0]*0.8):])
print(dataset_train.shape)
print(dataset_test.shape)
scaler = MinMaxScaler(feature_range=(0, 1))
dataset_train = scaler.fit_transform(dataset_train)
dataset_train[:5]
dataset_test = scaler.transform(dataset_test)
dataset_test[:5]
def create_dataset(df):
x = []
y = []
for i in range(20, df.shape[0]):
x.append(df[i-50:i, 0])
y.append(df[i, 0])
x = np.array(x)
y = np.array(y)
return x, y
x = dataset_train
y = dataset_train
# Build dummy variables for categorical variables
x = pd.get_dummies(x)
dataset_train = pd.get_dummies(dataset_train)
dataset_test = pd.get_dummies(dataset_test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
model = LinearRegression()
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train.values, (-1, 1))
x_test = np.reshape(x_test.values, (-1, 1))
y_test = np.reshape(y_test.values, (-1, 1))
model.fit(x_train, y_train)
predictions = model.predict(x_test)
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.title("Humana Reality v Prediction", fontsize=16)
plt.xlabel('Date', fontsize=10)
fig.autofmt_xdate()
plt.ylabel("Price", fontsize=10)
plt.plot(y_test, color='green', label='Original price')
plt.plot(predictions, color='red', label='Predicted price')
plt.legend(loc="center left")
I have updated the passage below with np.reshape
model = LinearRegression()
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train.values, (-1, 1))
x_test = np.reshape(x_test.values, (-1, 1))
y_test = np.reshape(y_test.values, (-1, 1))
However I am still receiving the 1 dimensional data error. Furthermore, I don't believe my testing data is any good because when I run it the predicted value and the actual historical value nearly completely overlap. I could really use some help with this one, I'm definitely at a lose.
Error log below:
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 627, in _sanitize_ndim
raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional
runfile('C:/Users/s/Desktop/HUM to CSV.py', wdir='C:/Users/s/Desktop')
(8256, 1)
(2064, 1)
Traceback (most recent call last):
File "C:\Users\s\Desktop\HUM to CSV.py", line 81, in <module>
x = pd.get_dummies(x)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 948, in get_dummies
result = _get_dummies_1d(
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 972, in _get_dummies_1d
codes, levels = factorize_from_iterable(Series(data))
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\series.py", line 439, in __init__
data = sanitize_array(data, index, dtype, copy)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 576, in sanitize_array
subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 627, in _sanitize_ndim
raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional`
Not easy without a minimal reproducible example, with some mock up data included.
But see
df=pd.DataFrame({'col':[1,2,3,1,2]})
df.values
#array([[1],
# [2],
# [3],
# [1],
# [2]])
pd.get_dummies(df.values)
# ValueError: Data must be 1-dimensional
df.col.values
# array([1, 2, 3, 1, 2])
pd.get_dummies(df.col.values)
# 1 2 3
#0 1 0 0
#1 0 1 0
#2 0 0 1
#3 1 0 0
#4 0 1 0
From what I see in your code, without running it, you are in the first case (single column, but yet, still a 2d-matrix of a single column). You want to be in the second case (a 1d-matrix of that column).
In your case, if depending on your data (seems to have no column names, and have only 1 column), you can get_dummies on df[0].values, or starting from your array, on x[:,0]
Unrelated side-note, but I can't avoid it: don't ever iterate over rows. There is always a better way. For example, here, I feel that what you are looking for is np.lib.stride_tricks.sliding_window_view, to get a matrix of n rows and 50 columns, showing 50 subsequent values of x (and since that is just a view, you don't actually build a matrix).
So, I would
df=pd.DataFrame({0:[1,2,3,1,2,1,2,1,1,1,2,2,1]})
def create_dataset(df, nhist=5):
xdum = pd.get_dummies(df[0]).values
nr,nc=xdum.shape
xview = np.lib.stride_tricks.as_strided(xdum.ravel(), shape=(nr-nhist, nc*nhist), strides=(nc,1))
return xview, df[0].values[nhist-1:]
Plus, I think it solve your dilemma: you need 1 column to get dummies, but after that you need to but them in 50 (5 in my example) columns of dummies
See result, assuming previous df[0] are x and df[1] are y.
(array([[1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]], dtype=uint8),
array([2, 1, 2, 1, 1, 1, 2, 2, 1]))

How to get the predicted classes from predicted images

So I have a fine-tuned model that returns among other features, the predicted images, I want to get the predicted classes from those images but I'm not able to get with it, this is intended to compute a Confusion matrix either manually or using scikit learn, I'm able to get with each class of my original dataset but I'm struggling to get with the predicted classes of each image, so far this is a patch of my code:
predlist = torch.zeros(0, dtype=torch.long, device='cpu')
lbllist = torch.zeros(0, dtype=torch.long, device='cpu')
with torch.no_grad():
for i, (inputs, classes) in enumerate(val_loader):
inputs = inputs.to(device) # [1, 13, 224, 224]
classes = classes.to(device)
outputs = model_ft(inputs)[1] # 0 is LOSS, 1 is [1, 196, 3328] is PRED, 2 is [1, 196] is MASK,
# 3 is [1, 13, 224, 224] is TARGET
# !! return loss, pred, mask, target_2d
#outputs = outputs[1]
model = models_mae_mod.__dict__['mae_vit_small_patch16'](in_chans=13, feature='raw')
#loss, pred, mask, target = outputs
#print(loss, pred.shape, mask.shape)
#outputs = torch.Tensor(np.stack((loss, pred, mask, target), -1))
#model = models_mae_mod.__dict__['mae_vit_small_patch16'](in_chans=13)
outputs = model.unpatchify(outputs) #[1, 13, 224, 224]
#lab = torch.argmax(outputs, 1)
_, preds = torch.max(outputs, 1)
#pred_c = torch.argmax(preds)
predlist = torch.cat([predlist, preds.view(-1).cpu()])
lbllist = torch.cat([lbllist, classes.view(-1).cpu()])
if i > 10:
break
After debugging I got with this values:
a = torch.max(outputs, 1)
a
torch.return_types.max(
values=tensor([[[0.5419, 0.3766, 1.0952, ..., 0.9223, 0.7693, 1.0980],
[1.9111, 1.4176, 0.9902, ..., 1.3873, 0.9266, 0.6857],
[0.8174, 0.5505, 0.8097, ..., 0.8501, 0.1761, 1.0284],
...,
[0.5996, 0.4945, 0.8258, ..., 0.8206, 1.3554, 1.1564],
[0.3814, 0.7084, 0.8026, ..., 0.6130, 1.1291, 1.3241],
[1.4426, 1.3198, 0.9262, ..., 0.9011, 0.7266, 0.8977]]],
device='cuda:0'),
indices=tensor([[[ 2, 3, 4, ..., 0, 1, 10],
[ 7, 9, 3, ..., 6, 9, 4],
[ 2, 4, 3, ..., 9, 7, 1],
...,
[10, 0, 1, ..., 11, 9, 3],
[ 6, 7, 10, ..., 7, 11, 8],
[ 6, 7, 8, ..., 7, 4, 4]]], device='cuda:0'))
_.shape
torch.Size([1, 224, 224])
preds.shape
torch.Size([1, 224, 224])
This is just for one image, I know that the indices follows each value of the values tensor, however I cannot understand how to get with the probability or the prediction for this image and not the entire tensor representing the image. Do you have any idea how to get with that information? How could I get with the predicted classes for each image?
PD: The indices tensor looks like the prediction but I'm not sure, since it has values from 0 to 12 when the dataset just have 10 classes thus, seems to be other thing.

Clustering an array of values without using thresholds

I want to segment A 1D dataset where each value represents an error into 2 segments:
A cluster with the smallest values
All the others
Example:
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5, 21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
In this small example, I would like to regroup the 4 first values in a cluster and forget about the others. I do not want a solution based on a threshold. The point is that the cluster of interest centroid will not always have the same value. It might be 1e-6, or it might be 1e-3, or it might be 1.
My idea was to use a k-means clustering algorithm, which would work fine if I did know how many clusters existed in my data. In the example above, the number is 3, one around 1 (the cluster of interest), one around 22, and one around 51. But sadly, I do not know the number of clusters... Simply searching for 2 clusters will not lead to a segmentation of the dataset as intended.
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
Returns a cluster 1 way too large, which also includes the data from the cluster centered around 22.
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
I did find some interesting answers on methods to select the k, but it complexifies the algorithm and I feel like there must be a far better way to solve this problem.
I'm open to any suggestions and example which could work on the X array provided.
You might find AffinityPropagation useful here, as it does not require to specify the amount of clusters to generate. You might have to tune however the damping factor and preference, so that it produces the expected results.
On the provided example, the default parameters seem to do the job:
from sklearn.cluster import AffinityPropagation
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5,
21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
ap = AffinityPropagation(random_state=12).fit(X)
y = ap.predict(X)
print(y)
# array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], dtype=int64)
To obtain individual clusters from X, you can index using y:
first_cluster = X[y==0].ravel()
first_cluster
# array([1. , 1.5, 0.4, 1.1])
second_cluster = X[y==1].ravel()
second_cluster
# array([23. , 24. , 22.5, 21. , 20. , 25. ])

SKLearn - Unusually high performance with Random Forest using a single feature

I am using Random Forest as a binary classifier for a dataset and the results just don't seem believable, but I can't find where the problem is.
The problem lies in the fact that the examples are clearly not separable by setting a threshold, as the values for the feature of interest for the positive/negative examples are highly homogeneous. When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right? If that's the case, how can the code below result in perfect performance on the test set?
P.S. In practice I have many more than the ~30 examples shown below, but only included these as an example. Same performance when evaluating >100.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X_train = np.array([0.427948, 0.165065, 0.31179, 0.645415, 0.125764,
0.448908, 0.417467, 0.524891, 0.038428, 0.441921,
0.927511, 0.556332, 0.243668, 0.565939, 0.265502,
0.122271, 0.275983, 0.60786, 0.670742, 0.565939,
0.117031, 0.117031, 0.001747, 0.148472, 0.038428,
0.50393, 0.49607, 0.148472, 0.275983, 0.191266,
0.254148, 0.430568, 0.198253, 0.323144, 0.29869,
0.344978, 0.524891, 0.323144, 0.344978, 0.28821,
0.441921, 0.127511, 0.31179, 0.254148, 0, 0.001747,
0.243668, 0.281223, 0.281223, 0.427948, 0.548472,
0.927511, 0.417467, 0.282969, 0.367686, 0.198253,
0.572926, 0.29869, 0.570306, 0.183406, 0.310044,
1, 1, 0.60786, 0, 0.282969, 0.349345, 0.521106,
0.430568, 0.127511, 0.50393, 0.367686, 0.310044,
0.556332, 0.670742, 0.30393, 0.548472, 0.193886,
0.349345, 0.122271, 0.193886, 0.265502, 0.537991,
0.165065, 0.191266])
y_train = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0])
X_test = np.array((0.572926, 0.521106, 0.49607, 0.570306, 0.645415,
0.125764, 0.448908, 0.30393, 0.183406, 0.537991))
y_test = np.array((1, 1, 1, 0, 0, 0, 1, 1, 0, 0))
# Instantiate model and set parameters
clf = RandomForestClassifier()
clf.set_params(n_estimators=500, criterion='gini', max_features='sqrt')
# Note: reshape is because RF requires column vector format, # but
default NumPy is row
clf.fit(X_train.reshape(-1, 1), y_train)
pred = clf.predict(X_test.reshape(-1, 1))
# sort by feature value for comparison
o = np.argsort(X_test)
print('Example#\tX\t\t\tY_test\tY_true')
for i in o:
print('%d\t\t\t%f\t%d\t%d' % (i, X_test[i], y_test[i], pred[i]))
Which then returns:
Example# X Y_test Y_true
5 0.125764 0 0
8 0.183406 0 0
7 0.303930 1 1
6 0.448908 1 1
2 0.496070 1 1
1 0.521106 1 1
9 0.537991 0 0
3 0.570306 0 0
0 0.572926 1 1
4 0.645415 0 0
How can an RF model with a single feature possibly discriminate these examples? Isn't there something wrong? I've looked into the configuration of the classifier and whatnot and can't find any problems. I was thinking that maybe it was a problem of overfitting (however I'm doing 10-fold cross validation, so that seems less likely), but then I came across this quote on the official webpage for Random Forest classification - ”Random forests does not overfit. You can run as many trees as you want.” (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#remarks)
When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right?
Each branch can discriminate only by one threshold, but each tree is built up by several branches. If the X-space can be split into several intervals such that each interval has the same y-value, then as long as the classifier has enough data to get the boundaries of those intervals, it will be able to predict the test set. However, I noticed that your "test" set seems to be a subset of your train set, which defeats the purpose of having a test set. Of course if you test it on data than you trained on, the accuracy will be high. Try sorting your data by X-value, then taking X-values that aren't in your training set, but are between two adjacent X_train values that have different y-values. For instance, x=.001. You should see accuracy plummet.

Machine learning for finding even/odd number getting incorrect/correct output for two different classifiers

I tried a Machine Learning algorithm on a hypothetical problem :-
I made a fake feature vector and a fake result data set by the following python code :-
x=[]
y=[]
for i in range(0,100000):
mylist=[]
mylist.append(i)
mylist.append(i)
x.append(mylist)
if(i%2)==0:
y.append(0)
else:
y.append(1)
The above code gives me 2 python lists, namely,
x = [[0,0],[1,1],[2,2]....and so on] #this list contains the fake feature vector, with 2 same numbers
y = [0,1,0..... and so on] #this has the fake test labels, 0 for even, 1 for odd
I think the test data is good enough for a ML algorithm to learn. I use the following python code to train a couple of different machine learning models.
Approach 1 : Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x,y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=gnb.predict(x_pred)
print y_pred
I get the following incorrect output, the classifier fails to predict :-
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Approach 2 : Support Vector Machines
from sklearn import svm
clf = svm.SVC()
clf.fit(x, y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=clf.predict(x_pred)
print y_pred
I get the following correct output, the classifier fails to predict :-
[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
Can someone put light on this and explain why one approach had 50% accuracy and the other one had 100% accuracy.
Let me know if this question is tagged with a wrong category.
Naive Bayes is a parametric model: it tries to summarize your training set in nine parameters, the class prior (50% for either class) and the per-class, per-feature means and variances. However, your target value y is not a function of the means and variances of the inputs x in any way,(*) so the parameters are irrelevant and the model resorts to what is effectively random guessing.
By contrast, the support vector machine remembers its training set and compares new inputs to its training inputs using a kernel function. It's supposed to pick a subset of its training samples, but for this problem it's forced to just remember all of them:
>>> x = np.vstack([np.arange(100), np.arange(100)]).T
>>> y = x[:, 0] % 2
>>> from sklearn import svm
>>> clf = svm.SVC()
>>> clf.fit(x, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> clf.support_vectors_.shape
(100, 2)
Since you're using test samples that occurred in the training set, all it has to do is look up the label that the samples you presented had in the training set and return those, so you get 100% accuracy. If you feed the SVM samples outside of the training set, you'll see that it too starts guessing randomly:
>>> clf.predict(x * 2)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
Since multiplying by two makes all the features even, the true labeling would have been all zero and the accuracy is 50%: the accuracy of a random guess.
(*) Actually there is some dependence in the training set, but that drops off with more data.

Categories

Resources