Undersampling numpy array - python

I have a train set with 10192 samples of '0' and 2512 samples of '1'.
I've applied a PCA on the set to reduce the dimensionality.
I want to undersample this numpy array.
Here's my code :
df = read_csv("train.csv")
X = df.drop(['label'], axis = 1)
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = PCA(n_components = 19)
model.fit(X_train)
X_train_pca = model.transform(X_train)
X_validation_pca = model.transform(X_validation)
X_train = np.array(X_train_pca)
X_validation = np.array(X_validation_pca)
y_train = np.array(y_train)
y_validation = np.array(y_validation)
How can I undersample '0' class from X_train numpy array?

Try after importing csv into df:
# class count
count_class_0, count_class_1 = df.label.value_counts()
# separate according to `label`
df_class_0 = df[df['label'] == 0]
df_class_1 = df[df['label'] == 1]
# sample only from class 0 quantity of rows of class 1
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)
Then perform all calculations on df_test_under data frame.
Alternatively use RandomUnderSampler:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

Related

Input data cannot be a list XGBoost

Here is my code.
import pandas as pd
import numpy as np
import json
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
training_data = pd.read_csv('/Users/aus10/Desktop/MLB_Data/Test_Training_Data/MLB_Training_Data.csv')
df_model = training_data.copy()
scaler = StandardScaler()
features = [['OBS', 'Runs']]
for feature in features:
df_model[feature] = scaler.fit_transform(df_model[feature])
test_data = pd.read_csv('/Users/aus10/Desktop/MLB_Data/Test_Training_Data/Test_Data.csv')
X = training_data.iloc[:,1] #independent columns
y = training_data.iloc[:,-1] #target column
X = X.values.reshape(-1,1)
results = []
# fit final model
model = XGBRegressor(objective="reg:squarederror", random_state=42)
model.fit(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=4)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('MSE train: %.3f, test: %.3f' % (
round(mean_squared_error(y_train, y_train_pred),2),
round(mean_squared_error(y_test, y_test_pred),2)
))
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred), r2_score(y_test, y_test_pred)))
# define one new data instance
index = 0
count = 0
while count < len(test_data):
team = test_data.loc[index].at['Team']
OBS = test_data.loc[index].at['OBS']
Xnew = [[ OBS ]]
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
results.append(
{
'Team': team,
'Runs': (round(ynew[0],2))
})
index += 1
count += 1
sorted_results = sorted(results, key=lambda k: k['Runs'], reverse=True)
df = pd.DataFrame(sorted_results, columns=[
'Team', 'Runs'])
writer = pd.ExcelWriter('/Users/aus10/Desktop/MLB_Data/ML/Results/Projected_Runs_XGBoost.xlsx', engine='xlsxwriter') # pylint: disable=abstract-class-instantiated
df.to_excel(writer, sheet_name='Sheet1', index=False)
df.style.set_properties(**{'text-align': 'center'})
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
writer.save()
and the error I'm getting is TypeError: Input data can not be a list.
The data coming from test_data is a csv with a team name and obs which is a float
like this NYY 0.324
Every way to solve it I've seen is just to put it in a 2d array like I did - Xnew = [[ OBS ]],
but I'm still getting the error.
Is there something else I need to do to the test_data coming in? I tried using values.reshape, but that didn't fix it either.
You need to transform your Xnew:
Xnew = np.array(Xnew).reshape((1,-1))

Plot scatter for classification algorithm

Please help me to create scatter graph for this classification algorithm. Here in y i have a column of labels( 0, 1) i want the predicted labels in two different colors for both labels.
X = np.array(df.iloc[: , [0, 1,2,3,4,5,6,7,8,9,10,]].values)
y = df.iloc[: , 17].values
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
I don't have access to your dataframes, but here is a minimum working example, assuming I understood right.
The point is that you have to use logical indexing for your numpy arrays during plotting. This is exemplified by the last two lines.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
X = np.zeros((100,2))
X[:,0] = np.array(list(range(100)))
X[:,1] = np.array(list(range(100)))
y = list([0] * 50 + [1] * 50)
dtc = DecisionTreeClassifier()
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size = 0.8, shuffle = True)
kf = KFold(n_splits = 5)
dtc=dtc.fit(train_x, train_y)
dtc_labels = dtc.predict(test_x)
plt.scatter(test_x[dtc_labels == 0,0],test_x[dtc_labels == 0,1])
plt.scatter(test_x[dtc_labels == 1,0],test_x[dtc_labels == 1,1])

Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook
I am confused on the pipeline + coding structure.
Should you over-sample after train/test splitting?
If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline
however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.
I understand the issue however how does one solve this problem when wanting to perform over-sampling to reduce class imbalance?
X = df
#X = df.drop("label", axis=1)
y = df["label"]
X_train,\
X_test,\
y_train,\
y_test = train_test_split(X,\
y,\
test_size=0.2,\
random_state=11,\
shuffle=True,\
stratify=target)
target_count = df.label.value_counts()
print('Class 1:', target_count[0])
print('Class 0:', target_count[1])
print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')
target_count.plot(kind='bar', title='Count (target)');
# Class count
count_class_index_0, count_class_index_1 = X_train.label.value_counts()
# Divide by class
count_class_index_0 = X_train[X_train['label'] == '1']
count_class_index_1 = X_train[X_train['label'] == '0']
df_class_1_over = df_class_1.sample(count_class_index_0, replace=True)
df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0)
print('Random over-sampling:')
print(df_test_over.label.value_counts())
Random over-sampling:
1 12682
0 12682
df_test_over.label.value_counts().plot(kind='bar', title='Count (target)')
# drop label for new X_train and X_test
X_train_OS = df_test_over.drop("label", axis=1)
X_test = X_test.drop("label", axis=1)
print(X_train_OS.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(25364, 9)
(3552, 9)
(14207,)
(3552,)
cat_transformer = Pipeline(steps=[
('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
num_transformer = Pipeline(steps=[
('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
('num_scaler', StandardScaler())])
text_transformer_0 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
text_transformer_1 = Pipeline(steps=[
('text_bow', CountVectorizer(lowercase=True,\
token_pattern=SPLIT_PATTERN,\
stop_words=stopwords))])
# SelectKBest()
# TruncatedSVD()
FE = ColumnTransformer(
transformers=[
('cat', cat_transformer, CAT_FEATURES),
('num', num_transformer, NUM_FEATURES),
('text0', text_transformer_0, TEXT_FEATURES[0]),
('text1', text_transformer_1, TEXT_FEATURES[1])])
pipe = Pipeline(steps=[('feature_engineer', FE),
("scales", MaxAbsScaler()),
('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])
random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
"rand_forest__n_estimators": sp_randint(10, 100),\
"rand_forest__max_features": ["auto", "sqrt", "log2", None],\
"rand_forest__bootstrap": [True, False],\
"rand_forest__criterion": ["gini", "entropy"]}
strat_shuffle_fold = StratifiedKFold(n_splits=5,\
random_state=123,\
shuffle=True)
cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
cv_train.fit(X_train_OS, y_train)
from sklearn.metrics import classification_report, confusion_matrix
preds = cv_train.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))
The problem you are having here gets very easily (and arguably more elegantly) solved by SMOTE. It's easy to use and allows you to keep the X_train, X_test, y_train, y_test syntax from train_test_split because it will perform the oversampling both on X and y at the same time.
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X,y)
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
So I believe I solved my own question ... the problem was how I was splitting the data ... I normally always follow the standard X_train, X_test, y_train, y_test train_test_split however it was causing the row count mismatch in the X_train and y_train when over-sampling so I did this instead and everything appears to be working. Please let me know if anyone has any recommendations! Thanks!
features = df_
target = df_l["label"]
train_set, test_set = train_test_split(features, test_size=0.2,\
random_state=11,\
shuffle=True)
print(train_set.shape)
print(test_set.shape)
(11561, 10)
(2891, 10)
count_class_1, count_class_0 = train_set.label.value_counts()
# Divide by class
df_class_1 = train_set[train_set['label'] == 1]
df_class_0 = train_set[train_set['label'] == 0]
df_class_0_over = df_class_0.sample(count_class_1, replace=True)
df_train_OS = pd.concat([df_class_1, df_class_0_over], axis=0)
print('Random over-sampling:')
print(df_train_OS.label.value_counts())
1 10146
0 10146
df_train_OS.label.value_counts().plot(kind='bar', title='Count (target)');
X_train_OS = df_train_OS.drop("label", axis=1)
y_train_OS = df_train_OS["label"]
X_test = test_set.drop("label", axis=1)
y_test = test_set["label"]
print(X_train_OS.shape)
print(y_train_OS.shape)
print(X_test.shape)
print(y_test.shape)
(20295, 9)
(20295,)
(2891, 9)
(2891,)

Make prediction from Pandas DataFrame

I am very new to DataScience/Pandas in general. I mainly followed this and could get it to work using different classifiers.
import pandas as pd
import src.helper as helper
import time
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
# Headings
headings = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing',
'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
# Load the data
shrooms = pd.read_csv('data/shrooms_no_header.csv', names=headings, converters={"header": float})
# Replace the ? in 'stalk-root' with 0
shrooms.loc[shrooms['stalk-root'] == '?', 'stalk-root'] = np.nan
shrooms.fillna(0, inplace=True)
# Remove columns with only one unique value
for col in shrooms.columns.values:
if len(shrooms[col].unique()) <= 1:
print("Removing column {}, which only contains the value: {}".format(col, shrooms[col].unique()[0]))
shrooms.drop(col, axis=1, inplace=True)
# Col to predict later
col_predict = 'class'
# Binary Encoding
all_cols = list(shrooms.columns.values)
all_cols.remove(col_predict)
helper.encode(shrooms, [col_predict])
# Expand Shrooms DataFrame to Binary Values
helper.expand(shrooms, all_cols)
# Remove the class we want to predict
x_all = list(shrooms.columns.values)
x_all.remove(col_predict)
# Set Train/Test ratio
ratio = 0.7
# Split the DF
df_train, df_test, X_train, Y_train, X_test, Y_test = helper.split_df(shrooms, col_predict, x_all, ratio)
# Try different classifier
# TODO: Batch Use to compare
classifier = GradientBoostingClassifier(n_estimators=1000)
# TODO: Optimize Hyperparamter (where applicable)
# Time the training
timer_start = time.process_time()
classifier.fit(X_train, Y_train)
timer_stop = time.process_time()
time_diff = timer_stop - timer_start
# Get the score
score_train = classifier.score(X_train, Y_train)
score_test = classifier.score(X_test, Y_test)
print('Train Score {}, Test Score {}, Time {}'.format(score_train, score_test, time_diff))
# TODO: Test a manual DataFrame
The "helpers" are functions I don't quite understand fully, but they work:
import numpy as np
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
def split_df(df, y_col, x_cols, ratio):
"""
This method transforms a dataframe into a train and test set, for this you need to specify:
1. the ratio train : test (usually 0.7)
2. the column with the Y_values
"""
mask = np.random.rand(len(df)) < ratio
train = df[mask]
test = df[~mask]
y_train = train[y_col].values
y_test = test[y_col].values
x_train = train[x_cols].values
x_test = test[x_cols].values
return train, test, x_train, y_train, x_test, y_test
def encode(df, columns):
for col in columns:
le = LabelEncoder()
col_values_unique = list(df[col].unique())
le_fitted = le.fit(col_values_unique)
col_values = list(df[col].values)
le.classes_
col_values_transformed = le.transform(col_values)
df[col] = col_values_transformed
def expand(df, list_columns):
for col in list_columns:
colvalues = df[col].unique()
for colvalue in colvalues:
newcol_name = "{}_is_{}".format(col, colvalue)
df.loc[df[col] == colvalue, newcol_name] = 1
df.loc[df[col] != colvalue, newcol_name] = 0
df.drop(list_columns, inplace=True, axis=1)
def correlation_to(df, col):
correlation_matrix = df.corr()
correlation_type = correlation_matrix[col].copy()
abs_correlation_type = correlation_type.apply(lambda x: abs(x))
desc_corr_values = abs_correlation_type.sort_values(ascending=False)
y_values = list(desc_corr_values.values)[1:]
x_values = range(0, len(y_values))
xlabels = list(desc_corr_values.keys())[1:]
fig, ax = plt.subplots(figsize=(8, 8))
ax.bar(x_values, y_values)
ax.set_title('The correlation of all features with {}'.format(col), fontsize=20)
ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16)
plt.xticks(x_values, xlabels, rotation='vertical')
plt.show()
I would like to have a "manual" test, such as entering x attributes and getting a prediction based on that.
So for example, I hardcode a DataFrame like the following:
manual = pd.DataFrame({
"cap-shape": ["x"],
"cap-surface": ["s"],
"cap-color": ["n"],
"bruises": ["f"],
"odor": ["n"],
"gill-attachment": ["a"],
"gill-spacing": ["c"],
"gill-size": ["b"],
"gill-color": ["y"],
"stalk-shape": ["e"],
"stalk-root": ["?"],
"stalk-surface-above-ring": ["s"],
"stalk-surface-below-ring": ["s"],
"stalk-color-above-ring": ["o"],
"stalk-color-below-ring": ["o"],
"veil-type": ["p"],
"veil-color": ["o"],
"ring-number": ["o"],
"ring-type": ["p"],
"spore-print-color": ["o"],
"population": ["c"],
"habitat": ["l"]
})
How would I apply the same encoding? My code says helper.encode(manual, [col_predict]) but the manual ofc does not have a col_predict?
Please bear in mind I am a complete beginner, I searched the web a l ot, but I cannot come up with a proper source/tutorial that lets me test a single set.
The full code can be found here.
Try this:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
data = pd.read_csv('agaricus-lepiota.data.txt', header=None) #read data
data.rename(columns={0: 'y'}, inplace = True) #rename predict column (edible or not)
le = LabelEncoder() # encoder to do label encoder
data = data.apply(lambda x: le.fit_transform(x)) #apply LE to all columns
X = data.drop('y', 1) # X without predict column
y = data['y'] #predict column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier()#you can pass arguments
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test) #it is predict for objects in test
print(accuracy_score(y_test, y_pred)) #check accuracy
I think you can read more about this in sklearn site.
Is this example what you want?
To check your manual data:
manual = manual.apply(lambda x: le.fit_transform(x))
clf.predict(manual)

LDA(n_components = 2) + fit_transform return 1-dim matrix instead of 2-dim

While applying some LDA on my Churn_Modelling.csv file, eveything goes well until the point where my X_train return (8000, 1) except of (8000, 2) as expected :
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_train is before-hand "hot-encoded" and "feature scaled" as followed :
# LDA
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
While doing the same on an other .csv file I have no troubles... do you have any idea why ?
Thank you very very much for your help !
I think I have the answer but I would prefer to have confirmation if possible :-)
The maximal number of columns I can hope to obtain using transform. is n-1 so, in my case, 2 classes (True, False) yields maximally 1 column (n-1).
Am I right ? Thank you again.

Categories

Resources