Function for cross validation and oversampling (SMOTE) - python

I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1). y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE.
def Learning(n, est, X, y):
s_k_fold = StratifiedKFold(n_splits = n)
acc_scores = []
rec_scores = []
f1_scores = []
for train_index, test_index in s_k_fold.split(X, y):
X_train = X[train_index]
y_train = y[train_index]
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
X_test = X[test_index]
y_test = y[test_index]
est.fit(X_resampled, y_resampled)
y_pred = est.predict(X_test)
acc_scores.append(accuracy_score(y_test, y_pred))
rec_scores.append(recall_score(y_test, y_pred))
f1_scores.append(f1_score(y_test, y_pred))
print('Accuracy:',np.mean(acc_scores))
print('Recall:',np.mean(rec_scores))
print('F1:',np.mean(f1_scores))
Learning(3, SGDClassifier(), X_train_s_pca, y_train)
When I run the code, I get the below error:
None of [Int64Index([ 4231, 4235, 4246, 4250, 4255, 4295, 4317,
4344, 4381,\n 4387,\n ...\n 13122,
13123, 13124, 13125, 13126, 13127, 13128, 13129, 13130,\n
13131],\n dtype='int64', length=8754)] are in the [columns]"
Help to make it run is appreciated.

If you observe the error stack trace (which is important but you don't include) carefully, you should see that the error comes from these line (and will come from other similar lines):
X_train = X[train_index]
This way of selecting rows only applicable for Numpy array. Since you are using Pandas DataFrame, you should use loc:
X_train = X.loc[train_index]
Alternatively, you can convert the DataFrame to Numpy array instead (to minimize code change) by using values:
Learning(3, SGDClassifier(), X_train_s_pca.values, y_train.values)

Related

How to correctly pre-proccess data from dask dataframe to feed into ML model

i'm working on a project with a very big dataset NF-UQ-NIDS. I couldn't even fit in a pandas so I decided to use dask, but I'm having problems.
I might be doing something else wrong, but when I try to train_test_split X and y I can't do it without converting them to dask_array. The train_test_split results in the incorrect shape of y, which should be 7, since I use 7 classification labels, but it results in it being shape (x, 42), which is the same shape as X.
here is a reproducable sample, dataset is in the link above:
df = dd.read_hdf(root_folder+"hdf/"+hdf_name,hdf_name.split(".")[0])
def encode_numeric_zscore(df, name, mean=None, standard_deviation=None):
if mean is None:
mean = df[name].mean()
if standard_deviation is None:
standard_deviation = df[name].std()
df[name] = (df[name] - mean) / standard_deviation
for column in df.columns:
if(column != 'attack_map'): encode_numeric_zscore(df,column)
X_columns = df.columns.drop('attack_map')
X = df[X_columns].values
y = dd.get_dummies(df['attack_map'].to_frame().categorize()).values
print(type(X))
print(type(y))
X = df.to_dask_array(lengths=True)
y = df.to_dask_array(lengths=True)
print(type(X))
print(type(y))
X.compute()
y.compute()
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=2)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
If you are facing problem in train test split, then use the one from dask-ml while using a dask dataframe / series / array and not a sklearn train test split.
Link : https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html

How to output predicted values as a string in excel?

so I was able to output my predicted numerical values into an excel file but I was wondering if it is possible to instead of the numerical value, it actual exports the string instead.
Currently it looks like this,
Column 1
Answer Key
Predicted
Something Something
Cars
3
Instead of it returning 3, I would like for 3 be replaced as the actual string its associated with (for example, idk Truck).
here is my code so far, I know I have to mess with the exporting part of the code but I cannot seem to figure this out.
texts = df['without_Tags'].astype('str')
vector = TfidfVectorizer(ngram_range=(1, 2), min_df = 2, max_df = .95)
X = vector.fit_transform(texts) #features
LE = LabelEncoder()
df['tower_values'] = LE.fit_transform(df['Tower'])
y = df['tower_values'].values
print(X.shape)
print(y.shape)
lsa = TruncatedSVD (n_components=100, n_iter=10, random_state=3)
X = lsa.fit_transform(X)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, shuffle = True, stratify = y, random_state = 3)
SG = SGDClassifier(random_state=3, loss='log')
SG.fit(X_train, y_train)
y_pred = SG.predict(X_test)
print("SG model accuracy:", accuracy_score(y_test, y_pred))
print("SG model Recall:", recall_score(y_test, y_pred, average="macro"))
print("SG model Precision:", precision_score(y_test, y_pred, average="macro"))
print("SG model F1 Score:", f1_score(y_test, y_pred, average="macro"))
y_pred = pd.DataFrame(y_pred, columns=['predictions']).to_csv('prediction.csv')
final = pd.read_csv('prediction.csv')
final['pre'] = y_pred
df.to_csv('prediction.csv')
Try using inverse transform method of LabelEncoder()
y_pred = LE.inverse_transform(y_pred)

fill missing values (nan) by regression of other columns

I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.

Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook
I am confused on the pipeline + coding structure.
Should you over-sample after train/test splitting?
If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline
however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.
I understand the issue however how does one solve this problem when wanting to perform over-sampling to reduce class imbalance?
X = df
#X = df.drop("label", axis=1)
y = df["label"]
X_train,\
X_test,\
y_train,\
y_test = train_test_split(X,\
y,\
test_size=0.2,\
random_state=11,\
shuffle=True,\
stratify=target)
target_count = df.label.value_counts()
print('Class 1:', target_count[0])
print('Class 0:', target_count[1])
print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')
target_count.plot(kind='bar', title='Count (target)');
# Class count
count_class_index_0, count_class_index_1 = X_train.label.value_counts()
# Divide by class
count_class_index_0 = X_train[X_train['label'] == '1']
count_class_index_1 = X_train[X_train['label'] == '0']
df_class_1_over = df_class_1.sample(count_class_index_0, replace=True)
df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0)
print('Random over-sampling:')
print(df_test_over.label.value_counts())
Random over-sampling:
1 12682
0 12682
df_test_over.label.value_counts().plot(kind='bar', title='Count (target)')
# drop label for new X_train and X_test
X_train_OS = df_test_over.drop("label", axis=1)
X_test = X_test.drop("label", axis=1)
print(X_train_OS.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(25364, 9)
(3552, 9)
(14207,)
(3552,)
cat_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
text_transformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
text_transformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
FE = ColumnTransformer(
transformers=[
('cat', cat_transformer, CAT_FEATURES),
('num', num_transformer, NUM_FEATURES),
('text0', text_transformer_0, TEXT_FEATURES[0]),
('text1', text_transformer_1, TEXT_FEATURES[1])])
pipe = Pipeline(steps=[('feature_engineer', FE),
("scales", MaxAbsScaler()),
('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])
random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
"rand_forest__n_estimators": sp_randint(10, 100),\
"rand_forest__max_features": ["auto", "sqrt", "log2", None],\
"rand_forest__bootstrap": [True, False],\
"rand_forest__criterion": ["gini", "entropy"]}
strat_shuffle_fold = StratifiedKFold(n_splits=5,\
random_state=123,\
shuffle=True)
cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
cv_train.fit(X_train_OS, y_train)
from sklearn.metrics import classification_report, confusion_matrix
preds = cv_train.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))
The problem you are having here gets very easily (and arguably more elegantly) solved by SMOTE. It's easy to use and allows you to keep the X_train, X_test, y_train, y_test syntax from train_test_split because it will perform the oversampling both on X and y at the same time.
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X,y)
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
So I believe I solved my own question ... the problem was how I was splitting the data ... I normally always follow the standard X_train, X_test, y_train, y_test train_test_split however it was causing the row count mismatch in the X_train and y_train when over-sampling so I did this instead and everything appears to be working. Please let me know if anyone has any recommendations! Thanks!
features = df_
target = df_l["label"]
train_set, test_set = train_test_split(features, test_size=0.2,\
random_state=11,\
shuffle=True)
print(train_set.shape)
print(test_set.shape)
(11561, 10)
(2891, 10)
count_class_1, count_class_0 = train_set.label.value_counts()
# Divide by class
df_class_1 = train_set[train_set['label'] == 1]
df_class_0 = train_set[train_set['label'] == 0]
df_class_0_over = df_class_0.sample(count_class_1, replace=True)
df_train_OS = pd.concat([df_class_1, df_class_0_over], axis=0)
print('Random over-sampling:')
print(df_train_OS.label.value_counts())
1 10146
0 10146
df_train_OS.label.value_counts().plot(kind='bar', title='Count (target)');
X_train_OS = df_train_OS.drop("label", axis=1)
y_train_OS = df_train_OS["label"]
X_test = test_set.drop("label", axis=1)
y_test = test_set["label"]
print(X_train_OS.shape)
print(y_train_OS.shape)
print(X_test.shape)
print(y_test.shape)
(20295, 9)
(20295,)
(2891, 9)
(2891,)

Error when attempting cross validation in python

I am currently trying to implement cross validation with linear regression. The linear regression works, but when I try cross validation I get this error:
TypeError: only integer scalar arrays can be converted to a scalar index
I get this error on line 5 of my code.
Here is my code:
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
linreg.fit(X_train, Y_train)
# p = np.array([linreg.predict(xi) for xi in x[test]])
p = linreg.predict(X_test)
e = p-Y_test
xval_err += np.dot(e,e)
rmse_10cv = np.sqrt(xval_err/len(X_train))
Can someone please help me with this problem?
Thanks in advance!
There are a few problems with your code.
In line 5 Y_train is not defined. I think you want the lowercase y_train.
Similarly you want e = p-y_test on line 8.
In rmse_10cv = np.sqrt(xval_err/len(X_train)) X_train is defined inside your loop, so it will take the value on the last iteration of your loop. Watch your output where to print your training indices for each fold to make sure the length of X_train is always the same, otherwise your calculation of rmse_10cv will not be valid.
I ran your code with the fixes I described and with the following before the loop:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
X = X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
linreg = LinearRegression()
xval_err = 0
and I did not receive any errors.

Categories

Resources