I'm working on a project using Python(3.6) and Sklearn.I have done classifications but when I try to apply it for reshaping in order to use it with fit method of sklearn it returns an error.
Here's what I have tried:
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors = 20,
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Then it returns the following error:
ValueError: could not convert string to float: '301.48 Change: $0.00'
and it's pointed to `clf.fit(X)` line.
What have I configured wrong?
We can convert out dataset to numeric data values on the base of their uniqueness and you can also drop un-necessary columns form the dataset.
Here's how you can try that:
df_full = pd.read_excel('input/samp.xlsx', sheet_name=0,)
df_full = df_full[df_full.filter(regex='^(?!Unnamed)').columns]
df_full.drop(['paymentdetails',], 1, inplace=True)
df_full.drop(['timestamp'], 1, inplace=True)
# Handle non numaric data
def handle_non_numaric_data(df_full):
columns = df_full.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df_full[column].dtype != np.int64 and df_full[column].dtype != np.float64:
column_contents = df_full[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df_full[column] = list(map(convert_to_int, df_full[column]))
return df_full
Related
I have a function which takes an 2 arrays, a array of model predictions and an array of true values. It works fine when dealing with 1d arrays, but I need to adjust it for multidimensional arrays. I would like to compute my threshold_acc again but this time for each column. How do I go about this?
threshold_acc which represents the proportion of set with error below the specified threshold.
Also do I need to change my threshold to change my threshold to +- since I've started seeing negative values in the multidimensional array or is there a better error measure I could use?
import numpy as np
import pandas as pd
# sample data
np.random.seed(20)
dd = np.random.uniform(low=-20., high=20, size=(25, 4))
dp = np.random.uniform(low=5, high=25, size=(25, 4))
data = [dd, dp]
def inference( dummy_data, error_threshold=10):
rel_err_list = []
AE_error_list = []
mse_list = []
input_var = []
true_var =[]
pred_var = []
n_correct = 0; n_wrong = 0; n_inf =0
# Iterate through data loader and inference and evaluate data
targets, outputs = data[0], data[1]
for idx, (outputs, targets) in enumerate(zip(outputs, targets)):
rel_error = np.abs(outputs- targets )/targets
rel_error = rel_error * 100
AE_error = np.abs(outputs- targets )
if np.isfinite(rel_error).all():
rel_err_list.append(rel_error)
AE_error_list.append(AE_error)
# Negative errors
print(f"error: {rel_error} output: {outputs} target: {targets}")
else: n_inf +=1
if rel_error.all() < error_threshold:
n_correct +=1
else: n_wrong += 1
true_var.append(targets)
pred_var.append(outputs)
median_err, max_err, min_err = np.median(rel_err_list), np.max(rel_err_list), np.min(rel_err_list)
threshold_acc = ((n_correct * 1.0) / 25) * 100
true_var = np.array(true_var)
pred_var = np.array(pred_var)
err_var = np.array(rel_err_list)
AE_var = np.array(AE_error_list)
true_var = np.reshape(true_var, dummy_data[0].shape)
pred_var = np.reshape(pred_var, dummy_data[0].shape)
err_var = np.reshape(err_var, dummy_data[0].shape)
AE_var = np.reshape(AE_var, dummy_data[0].shape)
results = np.concatenate([true_var, pred_var, err_var, AE_var], axis=1)
results_df = pd.DataFrame(results)
return median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var
dd = np.random.uniform(low=1., high=20, size=(25, 1))
dp = np.random.uniform(low=5, high=25, size=(25, 1))
median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var = inference(data, 10)
print(f"\nAverage relative error over valid predictions : {median_err:.3f} \nMax error over valid predictions : {max_err:.3f} \nMin error over valid predictions : {min_err:.3f}\nProportion of test set with accuracy over 90%: {threshold_acc:.3f}\n\n\
{n_inf} null predictions \n{n_wrong} incorrect (<90%) predictions \n{n_inf+n_wrong} null or incorrect predictions out of 25")
median_err, max_err, min_err, threshold_acc, n_inf, n_wrong, results_df, pred_var = inference(data, 10)
print(f"\nAverage relative error over valid predictions : {median_err:.3f} \nMax error over valid predictions : {max_err:.3f} \nMin error over valid predictions : {min_err:.3f}\nProportion of test set with accuracy over 90%: {threshold_acc:.3f}\n\n\
{n_inf} null predictions \n{n_wrong} incorrect (<90%) predictions \n{n_inf+n_wrong} null or incorrect predictions out of 25")
Found this snippet code to basically increase my negative reviews to better train my model. When I went to run it through I am getting this error. Looks to be around the idx. Does anyone have a good solution for this?
Passing list-likes to .loc or [] with any missing labels is no longer supported
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
from sklearn.utils import shuffle
import numpy as np
labels, num = np.unique(y_train, return_counts=True)
#print(labels)
u=min(labels)
intial = 1
#set the desired size of the oversampled cells
maxcnt = np.int(max(num)/2)
for labl, n in zip(labels, num):
x0 = X_train[y_train==labl]
y0 = y_train[y_train==labl]
# print (x0)
remain = maxcnt
print (remain)
while remain >= n;
if label == u and initial == 1;
X_Train = x0
y_Train = y0
remain -= n
initial = 0
else:
X_Train = np.concatenate((X_Train, x0), axis=0)
y_Train = np.concatenate((y_Train, y0), axis=0)
remain -= n
if remain > 0 and remain < n:
idx = np.random.choice(np.arange(len(y0)), remain, replace=False)
#print(idx)
X_Train = np.concatenate((X_Train, x0[idx]), axis=0)
y_Train = np.concatenate((y_Train, y0[idx]), axis=0)
remain -= n
X_Train, y_Train = shuffle(X_Train, y_Train)
np.unique(X_Train, return_counts=True)
Below is Youtuber Sentdex's machine learning code, and I couldn't understand some parts.
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, model_selection
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('titanic.xls')
original_df = pd.DataFrame.copy(df)
df.drop(['body', 'name'], 1, inplace=True)
df.fillna(0, inplace=True)
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
# creating dict that contains new
# id per unique string
text_digit_vals[unique] = x
x += 1
df[column] = list(map(convert_to_int, df[column]))
return df
df = handle_non_numerical_data(df)
df.drop(['ticket', 'home.dest'], 1, inplace=True)
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = MeanShift()
clf.fit(X)
labels= clf.labels_ ###Can't understand###
cluster_centers = clf.cluster_centers_
original_df['cluster_group'] = np.nan
for i in range(len(X)):
original_df['cluster_group'].iloc[i] = labels[i]
n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
temp_df = original_df[(original_df['cluster_group'] == float(i))]
# print(temp_df.head())
survival_cluster = temp_df[(temp_df['survived'] == 1)]
survival_rate = len(survival_cluster) / len(temp_df)
# print(i,survival_rate)
survival_rates[i] = survival_rate
print(survival_rates)
Supposedly in "labels = clf.labels_", labels are [0 : 5] (when I ran program and I got those numbers). But here's the question. Where do those numbers come from? and why 0,1,2? why not bigger number?
scikitlearn's documentation on Meanshift provides an explanation of the labels_ attribute that you seem confused about. Taken directly from the documentation
labels_ :
Labels of each point.
If you're more confused about what these labels represent, a brief explanation would be that the number refers to what bin that specific point was clustered into. So all the points with a value of 0 would all belong to the same cluster, and all the points with a value of 1 would all belong to the same cluster, and so on. What the value of these labels are doesn't really matter, they're just here to be able to identify which cluster the data point belongs to.
I'd recommend reading more about clustering if you're still confused about why you would want to label the data.
So, I have code that works for knn.predict() if I have data that has 1 feature to predict the next outcome. To put this into context, I have stock data (Open, High, Low, Close) where I use "Open" as "X" data and "Close" as "Y" data and knn.predict will predict the next value of Y.
When I try to use "Open, High, Low" columns (3 features) for my X data, I get the following error:
File "sklearn\neighbors\binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query
ValueError: query data dimension must match training data dimension
I believe it's because of my X.shape and Y.shape where X is not the same size as Y but I don't understand how to fix it. How do you use KNN for multifeature analysis if X and Y must be the same size?
Some of the Code:
df = df[['Date','Time', 'Open', 'High', 'Low', 'Close']]
df.head()
# Predictor Variables
df['Open'] = df.Open
df['High'] = df.High
df['Low'] = df.Low
df['Close'] = df.Close
df = df.dropna()
#Data = np.delete(arr = df, obj=0, axis = 0)
X = np.array(df.ix[:, 2:6])
#X.head()
print X.shape
# Target Variable
Y = np.where(df['Close'].shift(-1)>df['Close'],1,-1)
#print (Y)
#Predict
u = df['Close'].iloc[-1]
#print u
new_prediction = knn.predict(u)
print new_prediction
For training, you're using
X = np.array(df.ix[:, 2:6])
i.e., a matrix with 6 - 2 = 4 columns, meaning that the neighbors are 4-tuples.
For predicting, you're using
u = df['Close'].iloc[-1]
which is a scalar.
The nearest neighbor is undefined, and sklearn is very unhappy.
l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you
Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling
Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)
I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]
You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array