SKLearn train_test_split jumbles indexes - unreproducable with dummy data - python

I have a setup where I read a csv into a dataframe, add a calculated column, then do a train_test_split. A dummy solution would be:
import pandas as pd
import random
asd = random.sample(range(1, 1456165166), 500)
index = list(range(500))
data = {
"calories": asd,
"lol": asd,
"ix": index
}
df = pd.DataFrame(data)
df = df.set_index("ix")
df['target_cat'] = np.where(df['lol'] > 154683526, 0, 1)
X = df.loc[:,df.columns != "lol"]
y = df['lol']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
This works perfectly, and it keeps the indices in X_train and y_train consistent, e.g.:
This would be the expected behavior. Now if I do the same with my own data read from a csv, all the indices get jumbled:
train = pd.read_csv(all_files[2], delimiter=",", index_col=None, header=0, encoding="UTF-8")
train = train.set_index("id")
train['target_cat'] = np.where(train['target_reg'] == 0, 0, 1)
X = train.loc[:,train.columns != "target_cat"]
y = train['target_cat']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
When I do this, the indices won't line up, seem to be off by 1:
How on Earth can this happen? I've been sitting here baffled and have no idea what I'm missing...

Related

print customer number next to the own result in python

my code:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('orderlist.csv', skiprows=1, delimiter=';', encoding="utf8")
df.columns = ["date", "customer_number", "item_code", "quantity"]
df['customer_item'] = df.customer_number + ', ' + df.item_code
df['date'] = pd.to_datetime(df['date'])
df["quantity"] = df["quantity"].astype(int, errors='ignore')
df["week"]=df.date.dt.week
df_grup = df.groupby(by=['week',"customer_item"]).quantity.sum().reset_index()
df_dum = pd.get_dummies(df_grup)
X, y = df_dum, df_dum["quantity"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
dtree = DecisionTreeClassifier().fit(X_train, y_train)
predict = dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)
pred_quantity = dtree.predict(df_dum)
print("predict quantity:")
print(pred_quantity)
result:
predict quantity:
[100 5 450 ... 295 22 639]
I need to print customer number next to own result .
the nth item of pred_quantity corresponds to the nth item in df['customer_number']
so you can either add pred_quantity as a column to df
df['pred_quantity'] = pred_quantity
print(df[['customer_number', 'pred_quantity']])
or use zip (docs) to print them side by side
for number, quantity in zip(df['customer_number'], pred_quantity)
print(number, quantity)

Scikit-Learn ColumnTransformer gives "TypeError: zip argument #1 must support iteration"

I am attempting to transform some columns of my data frame with the MinMaxScaler() from Scikit-Learn. The data looks as such:
The columns I wish to transform:
ct_columns = ['Number_of_Cigarettes', 'Nicotine_Content', 'Tar_Content', 'Price', 'Units_Sold_Per_Week', 'Profits_Per_Week']
Pass them to a column transformer:
ct = ColumnTransformer( (MinMaxScaler(),
ct_columns)
)
Assign the input features and label, then pass them to train_test_split:
X = one_hot_cigarette_df.drop('Units_Sold_Per_Week', axis=1)
y = one_hot_cigarette_df['Units_Sold_Per_Week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
When I attempt to then pass the column transformer to the fit method I get the issue:
ct.fit(X_train)
271 return
272
--> 273 names, transformers, _ = zip(*self.transformers)
274
275 # validate names
TypeError: zip argument #1 must support iteration
You should replace ct = ColumnTransformer((MinMaxScaler(), ct_columns)) with
ct = ColumnTransformer([('scaler', MinMaxScaler(), ct_columns)])
You should also drop the label name 'Units_Sold_Per_Week' from ct_columns if you are planning to apply the transformer only to the feature matrix.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
np.random.seed(0)
# generate the data
df = pd.DataFrame(
columns=['Units_Sold_Per_Week', 'Number_of_Cigarettes', 'Nicotine_Content', 'Tar_Content', 'Price', 'Profits_Per_Week'],
data=np.random.lognormal(1, 0.5, (100, 6))
)
# extract the features and target
X = df.drop('Units_Sold_Per_Week', axis=1)
y = df['Units_Sold_Per_Week']
# split the data
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=100)
# scale the features
ct_columns = ['Number_of_Cigarettes', 'Nicotine_Content', 'Tar_Content', 'Price', 'Profits_Per_Week']
ct = ColumnTransformer([('scaler', MinMaxScaler(), ct_columns)])
ct.fit(X_train)
X_train_scaled = ct.transform(X_train)
X_test_scaled = ct.transform(X_test)

Undersampling numpy array

I have a train set with 10192 samples of '0' and 2512 samples of '1'.
I've applied a PCA on the set to reduce the dimensionality.
I want to undersample this numpy array.
Here's my code :
df = read_csv("train.csv")
X = df.drop(['label'], axis = 1)
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = PCA(n_components = 19)
model.fit(X_train)
X_train_pca = model.transform(X_train)
X_validation_pca = model.transform(X_validation)
X_train = np.array(X_train_pca)
X_validation = np.array(X_validation_pca)
y_train = np.array(y_train)
y_validation = np.array(y_validation)
How can I undersample '0' class from X_train numpy array?
Try after importing csv into df:
# class count
count_class_0, count_class_1 = df.label.value_counts()
# separate according to `label`
df_class_0 = df[df['label'] == 0]
df_class_1 = df[df['label'] == 1]
# sample only from class 0 quantity of rows of class 1
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)
Then perform all calculations on df_test_under data frame.
Alternatively use RandomUnderSampler:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

fill missing values (nan) by regression of other columns

I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.

Make prediction from Pandas DataFrame

I am very new to DataScience/Pandas in general. I mainly followed this and could get it to work using different classifiers.
import pandas as pd
import src.helper as helper
import time
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
# Headings
headings = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing',
'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
# Load the data
shrooms = pd.read_csv('data/shrooms_no_header.csv', names=headings, converters={"header": float})
# Replace the ? in 'stalk-root' with 0
shrooms.loc[shrooms['stalk-root'] == '?', 'stalk-root'] = np.nan
shrooms.fillna(0, inplace=True)
# Remove columns with only one unique value
for col in shrooms.columns.values:
if len(shrooms[col].unique()) <= 1:
print("Removing column {}, which only contains the value: {}".format(col, shrooms[col].unique()[0]))
shrooms.drop(col, axis=1, inplace=True)
# Col to predict later
col_predict = 'class'
# Binary Encoding
all_cols = list(shrooms.columns.values)
all_cols.remove(col_predict)
helper.encode(shrooms, [col_predict])
# Expand Shrooms DataFrame to Binary Values
helper.expand(shrooms, all_cols)
# Remove the class we want to predict
x_all = list(shrooms.columns.values)
x_all.remove(col_predict)
# Set Train/Test ratio
ratio = 0.7
# Split the DF
df_train, df_test, X_train, Y_train, X_test, Y_test = helper.split_df(shrooms, col_predict, x_all, ratio)
# Try different classifier
# TODO: Batch Use to compare
classifier = GradientBoostingClassifier(n_estimators=1000)
# TODO: Optimize Hyperparamter (where applicable)
# Time the training
timer_start = time.process_time()
classifier.fit(X_train, Y_train)
timer_stop = time.process_time()
time_diff = timer_stop - timer_start
# Get the score
score_train = classifier.score(X_train, Y_train)
score_test = classifier.score(X_test, Y_test)
print('Train Score {}, Test Score {}, Time {}'.format(score_train, score_test, time_diff))
# TODO: Test a manual DataFrame
The "helpers" are functions I don't quite understand fully, but they work:
import numpy as np
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
def split_df(df, y_col, x_cols, ratio):
"""
This method transforms a dataframe into a train and test set, for this you need to specify:
1. the ratio train : test (usually 0.7)
2. the column with the Y_values
"""
mask = np.random.rand(len(df)) < ratio
train = df[mask]
test = df[~mask]
y_train = train[y_col].values
y_test = test[y_col].values
x_train = train[x_cols].values
x_test = test[x_cols].values
return train, test, x_train, y_train, x_test, y_test
def encode(df, columns):
for col in columns:
le = LabelEncoder()
col_values_unique = list(df[col].unique())
le_fitted = le.fit(col_values_unique)
col_values = list(df[col].values)
le.classes_
col_values_transformed = le.transform(col_values)
df[col] = col_values_transformed
def expand(df, list_columns):
for col in list_columns:
colvalues = df[col].unique()
for colvalue in colvalues:
newcol_name = "{}_is_{}".format(col, colvalue)
df.loc[df[col] == colvalue, newcol_name] = 1
df.loc[df[col] != colvalue, newcol_name] = 0
df.drop(list_columns, inplace=True, axis=1)
def correlation_to(df, col):
correlation_matrix = df.corr()
correlation_type = correlation_matrix[col].copy()
abs_correlation_type = correlation_type.apply(lambda x: abs(x))
desc_corr_values = abs_correlation_type.sort_values(ascending=False)
y_values = list(desc_corr_values.values)[1:]
x_values = range(0, len(y_values))
xlabels = list(desc_corr_values.keys())[1:]
fig, ax = plt.subplots(figsize=(8, 8))
ax.bar(x_values, y_values)
ax.set_title('The correlation of all features with {}'.format(col), fontsize=20)
ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16)
plt.xticks(x_values, xlabels, rotation='vertical')
plt.show()
I would like to have a "manual" test, such as entering x attributes and getting a prediction based on that.
So for example, I hardcode a DataFrame like the following:
manual = pd.DataFrame({
"cap-shape": ["x"],
"cap-surface": ["s"],
"cap-color": ["n"],
"bruises": ["f"],
"odor": ["n"],
"gill-attachment": ["a"],
"gill-spacing": ["c"],
"gill-size": ["b"],
"gill-color": ["y"],
"stalk-shape": ["e"],
"stalk-root": ["?"],
"stalk-surface-above-ring": ["s"],
"stalk-surface-below-ring": ["s"],
"stalk-color-above-ring": ["o"],
"stalk-color-below-ring": ["o"],
"veil-type": ["p"],
"veil-color": ["o"],
"ring-number": ["o"],
"ring-type": ["p"],
"spore-print-color": ["o"],
"population": ["c"],
"habitat": ["l"]
})
How would I apply the same encoding? My code says helper.encode(manual, [col_predict]) but the manual ofc does not have a col_predict?
Please bear in mind I am a complete beginner, I searched the web a l ot, but I cannot come up with a proper source/tutorial that lets me test a single set.
The full code can be found here.
Try this:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
data = pd.read_csv('agaricus-lepiota.data.txt', header=None) #read data
data.rename(columns={0: 'y'}, inplace = True) #rename predict column (edible or not)
le = LabelEncoder() # encoder to do label encoder
data = data.apply(lambda x: le.fit_transform(x)) #apply LE to all columns
X = data.drop('y', 1) # X without predict column
y = data['y'] #predict column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier()#you can pass arguments
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test) #it is predict for objects in test
print(accuracy_score(y_test, y_pred)) #check accuracy
I think you can read more about this in sklearn site.
Is this example what you want?
To check your manual data:
manual = manual.apply(lambda x: le.fit_transform(x))
clf.predict(manual)

Categories

Resources