Error: "Found input variables with inconsistent numbers of samples: [5114, 3409]" - python

I wish to follow the below steps:
Load data
Divide into label & feature sets
Normalize data
Divide into test & training sets
Implement oversampling (smote)
Is this the correct order of steps or am I doing anything wrong? I keep getting an error saying "Found input variables with inconsistent numbers of samples: [5114, 3409]".
This error occurs on line: X_train,Y_train = smote.fit_sample(X_train,Y_train)
#data loading
dataset = pd.read_csv('data.csv')
#view data and check for null values
print(dataset.isnull().values.any())
print(dataset.shape)
# Dividing dataset into label and feature sets
X = dataset.drop('Bankrupt?', axis = 1) # Features
Y = dataset['Bankrupt?'] # Labels
print(type(X))
print(type(Y))
print(X.shape)
print(Y.shape)
# Normalizing numerical features so that each feature has mean 0 and variance 1
feature_scaler = StandardScaler()
X_scaled = feature_scaler.fit_transform(X)
# Dividing dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split( X_scaled, Y, test_size = 0.5, random_state = 100)
print(X_train.shape)
print(X_test.shape)
X = dataset.iloc[:,1:].values
y = dataset.iloc[:,0].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Implementing Oversampling to balance the dataset;
print("Number of observations in each class before oversampling (training data): \n", pd.Series(Y_train).value_counts())
smote = SMOTE(random_state = 101)
X_train,Y_train = smote.fit_sample(X_train,Y_train)
print("Number of observations in each class after oversampling (training data): \n", pd.Series(Y_train).value_counts())

Related

How to correctly pre-proccess data from dask dataframe to feed into ML model

i'm working on a project with a very big dataset NF-UQ-NIDS. I couldn't even fit in a pandas so I decided to use dask, but I'm having problems.
I might be doing something else wrong, but when I try to train_test_split X and y I can't do it without converting them to dask_array. The train_test_split results in the incorrect shape of y, which should be 7, since I use 7 classification labels, but it results in it being shape (x, 42), which is the same shape as X.
here is a reproducable sample, dataset is in the link above:
df = dd.read_hdf(root_folder+"hdf/"+hdf_name,hdf_name.split(".")[0])
def encode_numeric_zscore(df, name, mean=None, standard_deviation=None):
if mean is None:
mean = df[name].mean()
if standard_deviation is None:
standard_deviation = df[name].std()
df[name] = (df[name] - mean) / standard_deviation
for column in df.columns:
if(column != 'attack_map'): encode_numeric_zscore(df,column)
X_columns = df.columns.drop('attack_map')
X = df[X_columns].values
y = dd.get_dummies(df['attack_map'].to_frame().categorize()).values
print(type(X))
print(type(y))
X = df.to_dask_array(lengths=True)
y = df.to_dask_array(lengths=True)
print(type(X))
print(type(y))
X.compute()
y.compute()
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=2)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
If you are facing problem in train test split, then use the one from dask-ml while using a dask dataframe / series / array and not a sklearn train test split.
Link : https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html

Relate the predicted value to it index/identification number

I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.

fill missing values (nan) by regression of other columns

I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.

What do the results on a Sci-Kit machine learning program represent?

I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.

How to get a specific row for testing and other for training?

I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.

Categories

Resources