Python Linear Regression - Error with data dimensions

Python Linear Regression - Error with data dimensions - python

I am having a 1-dimensional error within my code. I'm attempting to create a linear regression on stock prices to predict a few months in the future and, frankly, I'm very confused. I've been tweaking this program for the last few hours and I can't seem to get it right.
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 5 20:45:06 2022
#author: samwa
"""
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
from pandas_datareader import data
from sklearn.preprocessing import MinMaxScaler
HUM = data.DataReader('HUM', 'yahoo', '1970-01-01')
HUM.to_csv('HUM Stock Data.csv')
df = pd.read_csv('HUM Stock Data.csv')
df.shape
df = df['Open'].values
df = df.reshape(-1, 1)
df.shape
dataset_train = np.array(df[:int(df.shape[0]*0.8)])
dataset_test = np.array(df[int(df.shape[0]*0.8):])
print(dataset_train.shape)
print(dataset_test.shape)
scaler = MinMaxScaler(feature_range=(0, 1))
dataset_train = scaler.fit_transform(dataset_train)
dataset_train[:5]
dataset_test = scaler.transform(dataset_test)
dataset_test[:5]
def create_dataset(df):
x = []
y = []
for i in range(20, df.shape[0]):
x.append(df[i-50:i, 0])
y.append(df[i, 0])
x = np.array(x)
y = np.array(y)
return x, y
x = dataset_train
y = dataset_train
# Build dummy variables for categorical variables
x = pd.get_dummies(x)
dataset_train = pd.get_dummies(dataset_train)
dataset_test = pd.get_dummies(dataset_test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
model = LinearRegression()
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train.values, (-1, 1))
x_test = np.reshape(x_test.values, (-1, 1))
y_test = np.reshape(y_test.values, (-1, 1))
model.fit(x_train, y_train)
predictions = model.predict(x_test)
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.title("Humana Reality v Prediction", fontsize=16)
plt.xlabel('Date', fontsize=10)
fig.autofmt_xdate()
plt.ylabel("Price", fontsize=10)
plt.plot(y_test, color='green', label='Original price')
plt.plot(predictions, color='red', label='Predicted price')
plt.legend(loc="center left")
I have updated the passage below with np.reshape
model = LinearRegression()
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train.values, (-1, 1))
x_test = np.reshape(x_test.values, (-1, 1))
y_test = np.reshape(y_test.values, (-1, 1))
However I am still receiving the 1 dimensional data error. Furthermore, I don't believe my testing data is any good because when I run it the predicted value and the actual historical value nearly completely overlap. I could really use some help with this one, I'm definitely at a lose.
Error log below:
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 627, in _sanitize_ndim
raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional
runfile('C:/Users/s/Desktop/HUM to CSV.py', wdir='C:/Users/s/Desktop')
(8256, 1)
(2064, 1)
Traceback (most recent call last):
File "C:\Users\s\Desktop\HUM to CSV.py", line 81, in <module>
x = pd.get_dummies(x)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 948, in get_dummies
result = _get_dummies_1d(
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 972, in _get_dummies_1d
codes, levels = factorize_from_iterable(Series(data))
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\series.py", line 439, in __init__
data = sanitize_array(data, index, dtype, copy)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 576, in sanitize_array
subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
File "C:\Users\s\anaconda3\lib\site-packages\pandas\core\construction.py", line 627, in _sanitize_ndim
raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional`

Not easy without a minimal reproducible example, with some mock up data included.
But see
df=pd.DataFrame({'col':[1,2,3,1,2]})
df.values
#array([[1],
# [2],
# [3],
# [1],
# [2]])
pd.get_dummies(df.values)
# ValueError: Data must be 1-dimensional
df.col.values
# array([1, 2, 3, 1, 2])
pd.get_dummies(df.col.values)
# 1 2 3
#0 1 0 0
#1 0 1 0
#2 0 0 1
#3 1 0 0
#4 0 1 0
From what I see in your code, without running it, you are in the first case (single column, but yet, still a 2d-matrix of a single column). You want to be in the second case (a 1d-matrix of that column).
In your case, if depending on your data (seems to have no column names, and have only 1 column), you can get_dummies on df[0].values, or starting from your array, on x[:,0]
Unrelated side-note, but I can't avoid it: don't ever iterate over rows. There is always a better way. For example, here, I feel that what you are looking for is np.lib.stride_tricks.sliding_window_view, to get a matrix of n rows and 50 columns, showing 50 subsequent values of x (and since that is just a view, you don't actually build a matrix).
So, I would
df=pd.DataFrame({0:[1,2,3,1,2,1,2,1,1,1,2,2,1]})
def create_dataset(df, nhist=5):
xdum = pd.get_dummies(df[0]).values
nr,nc=xdum.shape
xview = np.lib.stride_tricks.as_strided(xdum.ravel(), shape=(nr-nhist, nc*nhist), strides=(nc,1))
return xview, df[0].values[nhist-1:]
Plus, I think it solve your dilemma: you need 1 column to get dummies, but after that you need to but them in 50 (5 in my example) columns of dummies
See result, assuming previous df[0] are x and df[1] are y.
(array([[1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0]], dtype=uint8),
array([2, 1, 2, 1, 1, 1, 2, 2, 1]))

Related

Why is the truth value of an python array with more than one element is ambiguous?

I have two questions. First, I want to chart the predicted survival function. The code is as follows:
from sksurv.preprocessing import OneHotEncoder
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxPHSurvivalAnalysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_x, data_y = load_veterans_lung_cancer()
data_y
data_x_numeric = OneHotEncoder().fit_transform(data_x)
estimator = CoxPHSurvivalAnalysis()
estimator.fit(data_x_numeric, data_y)
x_new = pd.DataFrame.from_dict({
1: [65, 0, 0, 1, 60, 1, 0, 1],
2: [65, 0, 0, 1, 60, 1, 0, 0],
3: [65, 0, 1, 0, 60, 1, 0, 0],
4: [65, 0, 1, 0, 60, 1, 0, 1]},
columns=data_x_numeric.columns, orient='index')
pred_surv = estimator.predict_survival_function(x_new)
When I want to plot the result:
time_points = np.arange(1, 1000)
for i, surv_func in enumerate(pred_surv):
plt.step(time_points, surv_func(time_points), where="post",
label="Sample %d" % (i + 1))
plt.ylabel("est. probability of survival $\hat{S}(t)$")
plt.xlabel("time $t$")
plt.legend(loc="best")
I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
How can I solve this problem?
The second question is, how results from an object pred_surv can be transferred to a dataframe ?

SKLearn - Unusually high performance with Random Forest using a single feature

I am using Random Forest as a binary classifier for a dataset and the results just don't seem believable, but I can't find where the problem is.
The problem lies in the fact that the examples are clearly not separable by setting a threshold, as the values for the feature of interest for the positive/negative examples are highly homogeneous. When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right? If that's the case, how can the code below result in perfect performance on the test set?
P.S. In practice I have many more than the ~30 examples shown below, but only included these as an example. Same performance when evaluating >100.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X_train = np.array([0.427948, 0.165065, 0.31179, 0.645415, 0.125764,
0.448908, 0.417467, 0.524891, 0.038428, 0.441921,
0.927511, 0.556332, 0.243668, 0.565939, 0.265502,
0.122271, 0.275983, 0.60786, 0.670742, 0.565939,
0.117031, 0.117031, 0.001747, 0.148472, 0.038428,
0.50393, 0.49607, 0.148472, 0.275983, 0.191266,
0.254148, 0.430568, 0.198253, 0.323144, 0.29869,
0.344978, 0.524891, 0.323144, 0.344978, 0.28821,
0.441921, 0.127511, 0.31179, 0.254148, 0, 0.001747,
0.243668, 0.281223, 0.281223, 0.427948, 0.548472,
0.927511, 0.417467, 0.282969, 0.367686, 0.198253,
0.572926, 0.29869, 0.570306, 0.183406, 0.310044,
1, 1, 0.60786, 0, 0.282969, 0.349345, 0.521106,
0.430568, 0.127511, 0.50393, 0.367686, 0.310044,
0.556332, 0.670742, 0.30393, 0.548472, 0.193886,
0.349345, 0.122271, 0.193886, 0.265502, 0.537991,
0.165065, 0.191266])
y_train = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0])
X_test = np.array((0.572926, 0.521106, 0.49607, 0.570306, 0.645415,
0.125764, 0.448908, 0.30393, 0.183406, 0.537991))
y_test = np.array((1, 1, 1, 0, 0, 0, 1, 1, 0, 0))
# Instantiate model and set parameters
clf = RandomForestClassifier()
clf.set_params(n_estimators=500, criterion='gini', max_features='sqrt')
# Note: reshape is because RF requires column vector format, # but
default NumPy is row
clf.fit(X_train.reshape(-1, 1), y_train)
pred = clf.predict(X_test.reshape(-1, 1))
# sort by feature value for comparison
o = np.argsort(X_test)
print('Example#\tX\t\t\tY_test\tY_true')
for i in o:
print('%d\t\t\t%f\t%d\t%d' % (i, X_test[i], y_test[i], pred[i]))
Which then returns:
Example# X Y_test Y_true
5 0.125764 0 0
8 0.183406 0 0
7 0.303930 1 1
6 0.448908 1 1
2 0.496070 1 1
1 0.521106 1 1
9 0.537991 0 0
3 0.570306 0 0
0 0.572926 1 1
4 0.645415 0 0
How can an RF model with a single feature possibly discriminate these examples? Isn't there something wrong? I've looked into the configuration of the classifier and whatnot and can't find any problems. I was thinking that maybe it was a problem of overfitting (however I'm doing 10-fold cross validation, so that seems less likely), but then I came across this quote on the official webpage for Random Forest classification - ”Random forests does not overfit. You can run as many trees as you want.” (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#remarks)

When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right?
Each branch can discriminate only by one threshold, but each tree is built up by several branches. If the X-space can be split into several intervals such that each interval has the same y-value, then as long as the classifier has enough data to get the boundaries of those intervals, it will be able to predict the test set. However, I noticed that your "test" set seems to be a subset of your train set, which defeats the purpose of having a test set. Of course if you test it on data than you trained on, the accuracy will be high. Try sorting your data by X-value, then taking X-values that aren't in your training set, but are between two adjacent X_train values that have different y-values. For instance, x=.001. You should see accuracy plummet.

Why DBSCAN clustering returns single cluster on Movie lens data set?

The Scenario:
I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:
OLD FORMAT:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
NEW FORMAT:
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN.
With KMeans I'm able to set and get clusters.
The Problem
The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)
Techniques Tried:
Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.
Snippet 1:
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
Snippet 1 (Output):
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Snippet 2:
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Snippet 2 (Output):
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.

As pointed by #faraway and #Anony-Mousse, the solution is more Mathematical on Dataset than Programming.
Could finally figure out the clusters. Here's how:
db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d
The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.

You need to choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently.
You also need to preprocess your data.
It's trivial to get "clusters" with kmeans that are meaningless...
Don't just call random functions. You need to understand what you are doing, or you are just wasting your time.

Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).
It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).
Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).
If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.

Have you tried seeing how the cluster looks in 2D space using PCA (e.g). If whole data is dense and actually forms single group probably then you might get single cluster.
Change other parameters like min_samples=5, algorithm, metric. Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.

Pyplot contourf don't fill in "0" level

I'm plotting precipitation data from weather model output. I'm contouring the data I have, using contourf. However, I don't want it to fill in the "0" level with color (only the values >0). Is there a good way to do this? I've tried messing around with the levels.
Here's the code I'm using to plot:
m = Basemap(projection='stere', lon_0=centlon, lat_0=centlat,
lat_ts=centlat, width=width, height=height)
m.drawcoastlines()
m.drawstates()
m.drawcountries()
parallels = np.arange(0., 90, 10.)
m.drawparallels(parallels, labels=[1, 0, 0, 0], fontsize=10)
meridians = np.arange(180., 360, 10.)
m.drawmeridians(meridians, labels=[0, 0, 0, 1], fontsize=10)
lons, lats = m.makegrid(nx, ny)
x, y = m(lons, lats)
cs = m.contourf(x, y, snowfall)
cbar = plt.colorbar(cs)
cbar.ax.set_ylabel("Accumulated Snow (km/m^2)")
plt.show()
And here's the image I'm getting.
An example snowfall dataset would look something like:
0 0 0 0 0 0
0 0 1 1 1 0
0 1 2 2 1 0
0 2 3 2 1 0
0 1 0 1 2 0
0 0 0 0 0 0

This can also be achieved using 'locator' with MaxNLocator('prune = 'lower') from the ticker subclass. See docs.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
a = np.array([
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 2, 2, 1, 0],
[0, 2, 3, 2, 1, 0],
[0, 1, 0, 1, 2, 0],
[0, 0, 0, 0, 0, 0]
])
fig, ax = plt.subplots(1)
p = ax.contourf(a, locator = ticker.MaxNLocator(prune = 'lower'))
fig.colorbar(p)
plt.show()
Image of output
The 'nbins' parameter can be used to control the number of intervals (levels)
p = ax.contourf(a, locator = ticker.MaxNLocator(prune = 'lower'), nbins = 5)

If you don't include 0 in your levels, you won't plot a contour at the 0 level.
For example:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 2, 2, 1, 0],
[0, 2, 3, 2, 1, 0],
[0, 1, 0, 1, 2, 0],
[0, 0, 0, 0, 0, 0]
])
fig, ax = plt.subplots(1)
p = ax.contourf(a, levels=np.linspace(0.5, 3.0, 11))
fig.colorbar(p)
plt.show()
yields:
An alternative is to mask any datapoints which are 0:
p = ax.contourf(np.ma.masked_array(a, mask=(a==0)),
levels=np.linspace(0.0, 3.0, 13))
fig.colorbar(p)
Which looks like:
I suppose its up to you which of those matches your desired plot the most.

I was able to figure things out myself, there are two ways I found of solving this problem.
Mask out all data <0.01 from the data set using
np.ma.masked_less(snowfall, 0.01)
or
Set the levels of the plot to be from 0.01 -> whatever maximum value
levels = np.linspace(0.1, 10, 100)
then
cs = m.contourf(x, y, snowfall, levels)
I found that option 1 worked best for me.

Machine learning for finding even/odd number getting incorrect/correct output for two different classifiers

I tried a Machine Learning algorithm on a hypothetical problem :-
I made a fake feature vector and a fake result data set by the following python code :-
x=[]
y=[]
for i in range(0,100000):
mylist=[]
mylist.append(i)
mylist.append(i)
x.append(mylist)
if(i%2)==0:
y.append(0)
else:
y.append(1)
The above code gives me 2 python lists, namely,
x = [[0,0],[1,1],[2,2]....and so on] #this list contains the fake feature vector, with 2 same numbers
y = [0,1,0..... and so on] #this has the fake test labels, 0 for even, 1 for odd
I think the test data is good enough for a ML algorithm to learn. I use the following python code to train a couple of different machine learning models.
Approach 1 : Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x,y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=gnb.predict(x_pred)
print y_pred
I get the following incorrect output, the classifier fails to predict :-
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Approach 2 : Support Vector Machines
from sklearn import svm
clf = svm.SVC()
clf.fit(x, y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=clf.predict(x_pred)
print y_pred
I get the following correct output, the classifier fails to predict :-
[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
Can someone put light on this and explain why one approach had 50% accuracy and the other one had 100% accuracy.
Let me know if this question is tagged with a wrong category.

Naive Bayes is a parametric model: it tries to summarize your training set in nine parameters, the class prior (50% for either class) and the per-class, per-feature means and variances. However, your target value y is not a function of the means and variances of the inputs x in any way,(*) so the parameters are irrelevant and the model resorts to what is effectively random guessing.
By contrast, the support vector machine remembers its training set and compares new inputs to its training inputs using a kernel function. It's supposed to pick a subset of its training samples, but for this problem it's forced to just remember all of them:
>>> x = np.vstack([np.arange(100), np.arange(100)]).T
>>> y = x[:, 0] % 2
>>> from sklearn import svm
>>> clf = svm.SVC()
>>> clf.fit(x, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> clf.support_vectors_.shape
(100, 2)
Since you're using test samples that occurred in the training set, all it has to do is look up the label that the samples you presented had in the training set and return those, so you get 100% accuracy. If you feed the SVM samples outside of the training set, you'll see that it too starts guessing randomly:
>>> clf.predict(x * 2)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
Since multiplying by two makes all the features even, the true labeling would have been all zero and the accuracy is 50%: the accuracy of a random guess.
(*) Actually there is some dependence in the training set, but that drops off with more data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Linear Regression - Error with data dimensions - python

Related

Why is the truth value of an python array with more than one element is ambiguous?

SKLearn - Unusually high performance with Random Forest using a single feature

Why DBSCAN clustering returns single cluster on Movie lens data set?

Pyplot contourf don't fill in "0" level

Machine learning for finding even/odd number getting incorrect/correct output for two different classifiers

Categories

Resources