Accuracy of KNN algorithm very low - python

I'm following sentdex's youtube channel ML tutorial.
So as I was coding along on how to build your own KNN algorithm, I noticed that my accuracy was very low, in the 60s almost every time. I had made a few changes, but then I used his code line by line, and the same dataset, yet somehow he gets accuracies in the range of 95-98%, while mine is 60-70%. I'm really not able to figure out the reason behind such a huge difference.
I also have a second question which has to do with the confidence of the predictions. The value of the confidence is supposed to be within 0-1 right? But for me, they're all identical, and in the 70s. Let me explain with a screenshot
My code:
# Importing libraries
import numpy as np
import pandas as pd
from collections import Counter
import warnings
import random
# Algorithm
def k_nearest(data,predict,k=5):
if len(data)>=k:
warnings.warn("stupid, your data has more dimensions than prescribed")
distances = []
for group in data: # The groups of 2s and 4s
for features in data[group]: # values in 2 and 4 respectively
#euclidean_distance = np.sqrt(np.sum((np.array(features) - np.sum(predict)) **2))
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)] # adding the sorted(ascending) group names
votes_result = Counter(votes).most_common(1)[0][0] # the most common element
confidence = float((Counter(votes).most_common(1)[0][1]))/float(k)#ocuurences of the most common element
return votes_result,confidence
#reading the data
df = pd.read_csv("breast_cancer.txt")
df.replace("?",-99999,inplace=True)
#df.replace("?", np.nan,inplace=True)
#df.dropna(inplace=True)
df.drop("id",axis = 1,inplace=True)
full_data = df.astype(float).values.tolist() # Converting to list because our function is written like that
random.shuffle(full_data)
#print(full_data[:10])
test_size = 0.2
train_set = {2:[],4:[]}
test_set = {2:[],4:[]}
train_data = full_data[:-int(test_size*len(full_data))] # Upto the last 20% of the og dateset
test_data = full_data[-int(test_size*len(full_data)):] # The last 20% of the dataset
# Populating the dictionary
for i in train_data:
train_set[i[-1]].append(i[:-1]) # appending with features and leaving out the label
for i in test_data:
test_set[i[-1]].append(i[:-1]) # appending with features and leaving out the label
# Testing
correct,total = 0,0
for group in test_set:
for data in test_set[group]:
vote,confidence = k_nearest(train_set, data,k=5)
if vote == group:
correct +=1
else:
print(confidence)
total += 1
print("Accuracy is",correct/total)
Link to the dataset breast-cancer-wisconsin.data

There's a mistake in your k_nearest function, you need to return only the top k of distances, not the whole list. So it should be:
votes = [i[1] for i in sorted(distances)[:k]]
Instead of in your code:
votes = [i[1] for i in sorted(distances)]
We can rewrite your function:
def k_nearest(data,predict,k=5):
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
votes_result = Counter(votes).most_common(1)[0][0]
confidence = float((Counter(votes).most_common(1)[0][1]))/float(k)
return votes_result,confidence
And run your code, I am not so sure about replacing "?" with -999 so I read it in as na :
import pandas as pd
from collections import Counter
import random
import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = pd.read_csv(url,header=None,na_values="?")
df = df.dropna()
full_data = df.iloc[:,1:].astype(float).values.tolist()
random.seed(999)
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[],4:[]}
test_set = {2:[],4:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct,total = 0,0
for group in test_set:
for data in test_set[group]:
vote,confidence = k_nearest(train_set, data,k=5)
if vote == group:
correct +=1
else:
print(confidence)
total += 1
print("Accuracy is",correct/total)
Gives:
1.0
0.8
1.0
0.6
0.6
0.6
0.6
Accuracy is 0.9485294117647058

Related

How to build hybrid model of RF(Random Forest) and PSO(Particle Swarm Optimizer) to find optimal discount of products?

I need to find optimal discount for each product (in e.g. A, B, C) so that I can maximize total sales. I have existing Random Forest models for each product that map discount and season to sales. How do I combine these models and feed them to an optimiser to find the optimum discount per product?
Reason for model selection:
RF: it's able to give better(w.r.t linear models) relation between predictors and response(sales_uplift_norm).
PSO: suggested in many white papers(available at researchgate/IEEE), also availability of the package in python here and here.
Input data: sample data used to build model at product level. Glance of the data as below:
Idea/Steps followed by me:
Build RF model per products
# pre-processed data
products_pre_processed_data = {key:pre_process_data(df, key) for key, df in df_basepack_dict.items()}
# rf models
products_rf_model = {key:rf_fit(df) for key, df in products_pre_processed_data .items()}
Pass the model to optimizer
Objective function: maximize sales_uplift_norm (the response variable of RF model)
Constraint:
total spend(spends of A + B + C <= 20), spends = total_units_sold_of_products * discount_percentage * mrp_of_products
lower bound of products(A, B, C): [0.0, 0.0, 0.0] # discount percentage lower bounds
upper bound of products(A, B, C): [0.3, 0.4, 0.4] # discount percentage upper bounds
sudo/sample code # as I am unable to find a way to pass the product_models into optimizer.
from pyswarm import pso
def obj(x):
model1 = products_rf_model.get('A')
model2 = products_rf_model.get('B')
model3 = products_rf_model.get('C')
return -(model1 + model2 + model3) # -ve sign as to maximize
def con(x):
x1 = x[0]
x2 = x[1]
x3 = x[2]
return np.sum(units_A*x*mrp_A + units_B*x*mrp_B + units_C* x *spend_C)-20 # spend budget
lb = [0.0, 0.0, 0.0]
ub = [0.3, 0.4, 0.4]
xopt, fopt = pso(obj, lb, ub, f_ieqcons=con)
Dear SO experts, Request your guidance(struggling to find any guidance since couple of weeks) on how to use the PSO optimizer(or any other optimizer if I am not following right one) with RF.
Adding functions used for model:
def pre_process_data(df,product):
data = df.copy().reset_index()
# print(data)
bp = product
print("----------product: {}----------".format(bp))
# Pre-processing steps
print("pre process df.shape {}".format(df.shape))
#1. Reponse var transformation
response = data.sales_uplift_norm # already transformed
#2. predictor numeric var transformation
numeric_vars = ['discount_percentage'] # may include mrp, depth
df_numeric = data[numeric_vars]
df_norm = df_numeric.apply(lambda x: scale(x), axis = 0) # center and scale
#3. char fields dummification
#select category fields
cat_cols = data.select_dtypes('category').columns
#select string fields
str_to_cat_cols = data.drop(['product'], axis = 1).select_dtypes('object').astype('category').columns
# combine all categorical fields
all_cat_cols = [*cat_cols,*str_to_cat_cols]
# print(all_cat_cols)
#convert cat to dummies
df_dummies = pd.get_dummies(data[all_cat_cols])
#4. combine num and char df together
df_combined = pd.concat([df_dummies.reset_index(drop=True), df_norm.reset_index(drop=True)], axis=1)
df_combined['sales_uplift_norm'] = response
df_processed = df_combined.copy()
print("post process df.shape {}".format(df_processed.shape))
# print("model fields: {}".format(df_processed.columns))
return(df_processed)
def rf_fit(df, random_state = 12):
train_features = df.drop('sales_uplift_norm', axis = 1)
train_labels = df['sales_uplift_norm']
# Random Forest Regressor
rf = RandomForestRegressor(n_estimators = 500,
random_state = random_state,
bootstrap = True,
oob_score=True)
# RF model
rf_fit = rf.fit(train_features, train_labels)
return(rf_fit)
EDIT: updated dataset to simplified version.
you can find a complete solution below !
The fundamental differences with your approach are the following :
Since the Random Forest model takes as input the season feature, optimal discounts must be computed for every season.
Inspecting the documentation of pyswarm, the con function yields an output that must comply with con(x) >= 0.0. The correct constraint is therefore 20 - sum(...) and not the other way around. In addition, the units and mrp variable were not given ; I just assumed a value of 1, you might want to change those values.
Additional modifications to your original code include :
Preprocessing and pipeline wrappers of sklearn in order to simplify the preprocessing steps.
Optimal parameters are stored in an output .xlsx file.
The maxiter parameter of the PSO has been set to 5 to speed-up debugging, you might want to set its value to another one (default = 100).
The code is therefore :
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone
# ====================== RF TRAINING ======================
# Preprocessing
def build_sample(season, discount_percentage):
return pd.DataFrame({
'season': [season],
'discount_percentage': [discount_percentage]
})
columns_to_encode = ["season"]
columns_to_scale = ["discount_percentage"]
encoder = OneHotEncoder()
scaler = StandardScaler()
preproc = ColumnTransformer(
transformers=[
("encoder", Pipeline([("OneHotEncoder", encoder)]), columns_to_encode),
("scaler", Pipeline([("StandardScaler", scaler)]), columns_to_scale)
]
)
# Model
myRFClassifier = RandomForestRegressor(
n_estimators = 500,
random_state = 12,
bootstrap = True,
oob_score = True)
pipeline_list = [
('preproc', preproc),
('clf', myRFClassifier)
]
pipe = Pipeline(pipeline_list)
# Dataset
df_tot = pd.read_excel("so_data.xlsx")
df_dict = {
product: df_tot[df_tot['product'] == product].drop(columns=['product']) for product in pd.unique(df_tot['product'])
}
# Fit
print("Training ...")
pipe_dict = {
product: clone(pipe) for product in df_dict.keys()
}
for product, df in df_dict.items():
X = df.drop(columns=["sales_uplift_norm"])
y = df["sales_uplift_norm"]
pipe_dict[product].fit(X,y)
# ====================== OPTIMIZATION ======================
from pyswarm import pso
# Parameter of PSO
maxiter = 5
n_product = len(pipe_dict.keys())
# Constraints
budget = 20
units = [1, 1, 1]
mrp = [1, 1, 1]
lb = [0.0, 0.0, 0.0]
ub = [0.3, 0.4, 0.4]
# Must always remain >= 0
def con(x):
s = 0
for i in range(n_product):
s += units[i] * mrp[i] * x[i]
return budget - s
print("Optimization ...")
# Save optimal discounts for every product and every season
df_opti = pd.DataFrame(data=None, columns=df_tot.columns)
for season in pd.unique(df_tot['season']):
# Objective function to minimize
def obj(x):
s = 0
for i, product in enumerate(pipe_dict.keys()):
s += pipe_dict[product].predict(build_sample(season, x[i]))
return -s
# PSO
xopt, fopt = pso(obj, lb, ub, f_ieqcons=con, maxiter=maxiter)
print("Season: {}\t xopt: {}".format(season, xopt))
# Store result
df_opti = pd.concat([
df_opti,
pd.DataFrame({
'product': list(pipe_dict.keys()),
'season': [season] * n_product,
'discount_percentage': xopt,
'sales_uplift_norm': [
pipe_dict[product].predict(build_sample(season, xopt[i]))[0] for i, product in enumerate(pipe_dict.keys())
]
})
])
# Save result
df_opti = df_opti.reset_index().drop(columns=['index'])
df_opti.to_excel("so_result.xlsx")
print("Summary")
print(df_opti)
It gives :
Training ...
Optimization ...
Stopping search: maximum iterations reached --> 5
Season: summer xopt: [0.1941521 0.11233673 0.36548761]
Stopping search: maximum iterations reached --> 5
Season: winter xopt: [0.18670604 0.37829516 0.21857777]
Stopping search: maximum iterations reached --> 5
Season: monsoon xopt: [0.14898102 0.39847885 0.18889792]
Summary
product season discount_percentage sales_uplift_norm
0 A summer 0.194152 0.175973
1 B summer 0.112337 0.229735
2 C summer 0.365488 0.374510
3 A winter 0.186706 -0.028205
4 B winter 0.378295 0.266675
5 C winter 0.218578 0.146012
6 A monsoon 0.148981 0.199073
7 B monsoon 0.398479 0.307632
8 C monsoon 0.188898 0.210134

KMeans not returning reproducible results in sklearn, even fixing random_state

The following code tests KMeans for several n_clusters and tries to find the "best" n_clusters by the inertia criterion. However, it is not reproducible: even fixing random_state, every time I call kmeans(df) on the same dataset, it generates different clustering - and even different n_clusters. Am I missing something here?
from sklearn.cluster import KMeans
from tqdm import tqdm_notebook
def kmeans(df):
inertia = []
models = {}
start = 3
end = 40
for i in tqdm_notebook(range (start, end)):
k = KMeans(n_clusters=i, init='k-means++', n_init=50, random_state=10, n_jobs=-1).fit(df.values)
inertia.append(k.inertia_)
models[i] = k
ep = np.argmax(np.gradient(np.gradient(np.array(inertia)))) + start
return models[ep]
I am having this same issue. I think that a closer solution is to freeze the model into a file and import the model and then cluster a new predict phrase, I think if the vectorizer and kmeans clustering is initialized every single time the program it will run, it seems to order the clusters in a different order every time and the hashmap will not activate correclty and give you a different number every time the function is called
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
# Sample array of string sentences
df = pd.read_csv('/workspaces/codespaces-flask//data/shuffled.csv')
df = shuffle(df)
sentences = df['text'].values
# Convert the sentences into TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
# Perform K-Means clustering
kmeans = KMeans(n_clusters=8, random_state=42)
clusters = kmeans.fit_predict(X)
output = zip(sentences, clusters)
# Print the cluster assignments for each sentence
for sentence, cluster in zip(sentences, clusters):
print("Sentence:", sentence, "Cluster:", cluster)
df = pd.DataFrame(output)
db_file_name = '/workspaces/codespaces-flask/ThrAive/data/database1.db'
conn = sqlite3.connect(db_file_name)
cursor = conn.cursor()
cursor.execute("SELECT journal_text FROM Journal JOIN User ON Journal.id
= user.id
rows = cursor.fetchall()
conn.commit()
conn.close()
df1 = pd.DataFrame(rows)
df1 = df1.applymap(lambda x: " ".join(x.split()) if isinstance(x, str)
else x)
entry = df1
entry = entry
print(entry)
entry = entry[0].iloc[-1].lower()
entry = [entry]
new_X = vectorizer.transform(entry)
# Predict the cluster assignments for the new sentences
new_clusters = kmeans.predict(new_X)
for entry, new_cluster in zip(entry, new_clusters):
print("Sentence:", entry, "Cluster:", new_cluster)
zipper = zip(entry, new_clusters)
df = pd.DataFrame(zipper)
df = df.applymap(lambda x: " ".join(x.split()) if isinstance(x, str)
else x)
df = df.to_string( header=False, index=False)
entry = df
output = entry
numbers = ['0', '1', '2', '3', '4','5','6','7','8']
names =
# Create a dictionary that maps numbers to names
number_to_name = {number: name for number, name in zip(numbers, names)}
print(output[-1])
output = number_to_name[output[-1]]
json_string = json.dumps(str(output))
I think that the solution is saving the model to disk
import pickle
# Train a scikit-learn model
model = ///
# Save the model to disk
with open('model.pkl', 'wb') as file:
pickle.dump(model, file)
and then load the pickle file and test it on the k-means without re-initializing the cluster.

sample X examples from each class label

l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you
Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling
Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)
I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]
You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array

Python Error: Found array with 0 sample(s) (shape=(0, 262)) while a minimum of 1 is required

So I am trying to build a natural learning processor in python, and I was using some code I found online, then adapting my own stuff to it. But now, it just doesnt want to work. It keeps giving me
ValueError: Found array with 0 sample(s) (shape=(0, 262)) while a minimum of 1 is required.
Here is my code. I apologize if it is messy I just copied it straight off the internet:
from collections import Counter
import pandas
from nltk.corpus import stopwords
import pandas as pd
import numpy
headlines = []
apps = pd.read_csv('DataUse.csv')
for e in apps['title_lower']:
headlines.append(e)
testdata = pd.read_csv('testdata.csv')
# Find all the unique words in the headlines.
unique_words = list(set(" ".join(headlines).split(" ")))
def make_matrix(headlines, vocab):
matrix = []
for headline in headlines:
# Count each word in the headline, and make a dictionary.
counter = Counter(headline)
# Turn the dictionary into a matrix row using the vocab.
row = [counter.get(w, 0) for w in vocab]
matrix.append(row)
df = pandas.DataFrame(matrix)
df.columns = unique_words
return df
print(make_matrix(headlines, unique_words))
import re
# Lowercase, then replace any non-letter, space, or digit character in the headlines.
new_headlines = [re.sub(r'[^\w\s\d]','',h.lower()) for h in headlines]
# Replace sequences of whitespace with a space character.
new_headlines = [re.sub("\s+", " ", h) for h in new_headlines]
unique_words = list(set(" ".join(new_headlines).split(" ")))
# We've reduced the number of columns in the matrix a bit.
print(make_matrix(new_headlines, unique_words))
stopwords = set(stopwords.words('english'))
stopwords = [re.sub(r'[^\w\s\d]','',s.lower()) for s in stopwords]
unique_words = list(set(" ".join(new_headlines).split(" ")))
# Remove stopwords from the vocabulary.
unique_words = [w for w in unique_words if w not in stopwords]
# We're down to 34 columns, which is way better!
print(make_matrix(new_headlines, unique_words))
##
##
##
##
from sklearn.feature_extraction.text import CountVectorizer
# Construct a bag of words matrix.
# This will lowercase everything, and ignore all punctuation by default.
# It will also remove stop words.
vectorizer = CountVectorizer(lowercase=True, stop_words="english")
matrix = vectorizer.fit_transform(headlines)
# We created our bag of words matrix with far fewer commands.
print(matrix.todense())
# Let's apply the same method to all the headlines in all 100000 submissions.
# We'll also add the url of the submission to the end of the headline so we can take it into account.
full_matrix = vectorizer.fit_transform(apps['title_lower'])
print(full_matrix.shape)
##
##
##
##
##
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Convert the upvotes variable to binary so it works with a chi-squared test.
col = apps["total_shares"].copy(deep=True)
col_mean = col.mean()
col[col < col_mean] = 0
col[(col > 0) & (col > col_mean)] = 1
print col
# Find the 1000 most informative columns
selector = SelectKBest(chi2, k='all')
selector.fit(full_matrix, col)
top_words = selector.get_support().nonzero()
# Pick only the most informative columns in the data.
chi_matrix = full_matrix[:,top_words[0]]
##
##
##
##
##
##
import numpy as numpy
transform_functions = [
lambda x: len(x),
lambda x: x.count(" "),
lambda x: x.count("."),
lambda x: x.count("!"),
lambda x: x.count("?"),
lambda x: len(x) / (x.count(" ") + 1),
lambda x: x.count(" ") / (x.count(".") + 1),
lambda x: len(re.findall("\d", x)),
lambda x: len(re.findall("[A-Z]", x)),
]
# Apply each function and put the results into a list.
columns = []
for func in transform_functions:
columns.append(apps["title_lower"].apply(func))
# Convert the meta features to a numpy array.
meta = numpy.asarray(columns).T
##
##
##
##
##
##
##
features = numpy.hstack([chi_matrix.todense()])
from sklearn.linear_model import Ridge
import random
train_rows = 262
# Set a seed to get the same "random" shuffle every time.
random.seed(1)
# Shuffle the indices for the matrix.
indices = list(range(features.shape[0]))
random.shuffle(indices)
# Create train and test sets.
train = features[indices[:train_rows], :]
test = features[indices[train_rows:], :]
print test
train_upvotes = apps['total_shares'].iloc[indices[:train_rows]]
test_upvotes = apps['total_shares'].iloc[indices[train_rows:]]
train = numpy.nan_to_num(train)
print (test)
# Run the regression and generate predictions for the test set.
reg = Ridge(alpha=.1)
reg.fit(train, train_upvotes)
predictions = reg.predict(test)
##
##
##
##
##
### We're going to use mean absolute error as an error metric.
### Our error is about 13.6 upvotes, which means that, on average,
### our prediction is 13.6 upvotes away from the actual number of upvotes.
##print(sum(abs(predictions - test_upvotes)) / len(predictions))
##
### As a baseline, we'll use the average number of upvotes
### across all submissions.
### The error here is 17.2 -- our estimate is better, but not hugely so.
### There either isn't a ton of predictive value encoded in the
### data we have, or we aren't extracting it well.
##average_upvotes = sum(test_upvotes)/len(test_upvotes)
##print(sum(abs(average_upvotes - test_upvotes)) / len(predictions))
##
EDIT: Here is the error:
Traceback (most recent call last):
File "C:/Users/Tucker Siegel/Desktop/Machines/Test.py", line 156, in <module>
predictions = reg.predict(test)
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 200, in predict
return self._decision_function(X)
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 183, in _decision_function
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 262)) while a minimum of 1 is required.

KNN distance and class vote

Can you please tell me how to calculate distance between every point in my testData properly.
For now I am getting only one single value, whereas I should get distance from each point in data set and be able to assign it a class. I have to use numpy for this.
========================================================================
Now the problem is that I am getting this error and don't know how to fix it.
KeyError: 0
I am trying to obtain accuracy of classified labels.
Any ideas, please?
import matplotlib.pyplot as plt
import random
import numpy as np
import operator
from sklearn.cross_validation import train_test_split
# In[1]
def readFile():
f = open('iris.data', 'r')
d = np.dtype([ ('features',np.float,(4,)),('class',np.str_,20)])
data = np.genfromtxt(f, dtype = d ,delimiter=",")
dataPoints = data['features']
labels = data['class']
return dataPoints, labels
# In[2]
def normalizeData(dataPoints):
#normalize the data so the values will be between 0 and 1
dataPointsNorm = (dataPoints - dataPoints.min())/(dataPoints.max() - dataPoints.min())
return dataPointsNorm
def crossVal(dataPointsNorm):
# spliting for train and test set for crossvalidation
trainData, testData = train_test_split(dataPointsNorm, test_size=0.20, random_state=25)
return trainData, testData
def calculateDistance(trainData, testData):
#Euclidean distance calculation on numpy arrays
distance = np.sqrt(np.sum((trainData - testData)**2, axis=-1))
# Argsort sorts indices from closest to furthest neighbor, in ascending order
sortDistance = distance.argsort()
return distance, sortDistance
# In[4]
def classifyKnn(testData, trainData, labels, k):
# Calculating nearest neighbours and based on majority vote assigning the class
classCount = {}
for i in range(k):
distance, sortedDistIndices = calculateDistance(trainData, testData[i])
voteLabel = labels[sortedDistIndices][i]
#print voteLabel
classCount[voteLabel] = classCount.get(voteLabel,0)+1
print 'Class Count: ', classCount
# Sorting dictionary to return voted class
sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0], classCount
def testAccuracy(testData, classCount):
correct = 0
for x in range(len(testData)):
print 'HERE !!!!!!!!!!!!!!'
if testData[x][-1] is classCount[x]:
correct += 1
return (correct/float(len(testData))) * 100.0
def main():
dataPoints, labels = readFile()
dataPointsNorm = normalizeData(dataPoints)
trainData, testData = crossVal(dataPointsNorm)
result, classCount = classifyKnn(testData, trainData, labels, 5)
print result
accuracy = testAccuracy(testData, classCount)
print accuracy
main()
I have it normalized, split into train and test calc distance (wrong).
Thanks for any tips.

Categories

Resources