Why is sklearn.metrics support value changing every time?

Why is sklearn.metrics support value changing every time? - python

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?

Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

Related

How to automatically choose meaning num_features_to_select with best result in select_features from CatBoostClassifier?

I'm writting a class on Python, where I'm trying to automatically pick up a value of num_features_to_select in CatBoostClassifier().select_features(). Right now, function uses enumeration of num_features_to_select values.
Code:
def CatBoost(X_var=df.drop(columns=['status']), y_var=df[['creation_date','status']]):
from catboost import CatBoostClassifier, Pool, EShapCalcType, EFeaturesSelectionAlgorithm
from sklearn.model_selection import train_test_split
from datetime import datetime, timedelta # подключаем библиотеку datetime для работы с датами
import os
os.environ['OPENBLAS_NUM_THREADS'] = '10'
valid_time_border = X_var['creation_date'].max()-timedelta(days=7)
X_train, X_test, y_train, y_test = train_test_split(X_var[X_var['creation_date']<=valid_time_border]\
.drop(columns=['creation_date']),\
y_var[y_var['creation_date']<=valid_time_border]['status'],\
test_size=0.3)
X_valid = X_var[X_var['creation_date']>valid_time_border].drop(columns=['creation_date'])
y_valid = y_var[y_var['creation_date']>valid_time_border]['status']
best_accurancy = 0
mas_num_features_to_select = [10,20,30,40,50,60]
for i in mas_num_features_to_select:
# Определяем все переменные
predict_columns = X_train.columns.to_list()
# определяем категориальные переменные
cat_features_num = np.where(np.isin(X_train[X_train.columns].dtypes, ['bool', 'object']))[0]
train_pool = Pool(X_train, y_train, cat_features=cat_features_num, feature_names=list(predict_columns))
test_pool = Pool(X_test, y_test, cat_features=cat_features_num, feature_names=list(predict_columns))
model = CatBoostClassifier(iterations=round(200), eval_metric='AUC', thread_count = 10)
summary = model.select_features(
train_pool,
eval_set=test_pool,
features_for_select=predict_columns,
num_features_to_select=i,
steps=15,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Regular,
train_final_model=False,
logging_level='Silent',
plot=False
)
predict_columns = summary['selected_features_names']
model.fit(X_train, y_train)
y_pred = model.predict(X_valid) # предсказываем новые данные
mislabel = np.sum((y_valid!=y_pred)) # считаем неправильно посчитанные значения
accurancy = 1 - mislabel/len(y_pred)
print(accurancy)
if accurancy > best_accurancy:
best_accurancy = accurancy
best_predict_columns = predict_columns
print('Лучшая точность предсказания: '+str(best_accurancy))
print('Лучшие фичи:')
print(best_predict_columns)
return(best_predict_columns)
I can't find any information about methods which afford to use built in function of automatic feature selection. Is it even possible using CatBoost?

Use the summary dictionary output to find your best point. If you want an interactive plot to define it, you can use the following:
import matplotlib.pyplot as plt
line = plt.plot(summary["loss_graph"]["removed_features_count"], summary["loss_graph"]["loss_values"], picker=True)
x = plt.ginput(n=1, timeout=30, show_clicks=True)
print(x)

If I understand your question correctly, you're looking for a way of using select_features to determine how many and which features to include in the model such that performance is maintained/improved while eliminating the maximum number of features. Sadly, your approach seems to be the best for an automated function. CatBoost does not return the features from the iteration with the best performance, only the features remaining after pruning down to the number of features specified in num_features_to_select by iterating steps number of times.
If you can compromise and add a manual step, you can set plot=True and see at which number of features the loss value is minimized, such as in CatBoost's documentation here:
If you set steps to the number of features, features will be removed one by one, and you can see the loss for the removal of each feature. You could then manually select the number of features to match that iteration. It would be nice if CatBoost had a "train_best_model" parameter instead of just a "train_final_model" parameter! I don't know if theres a way to capture what this function logs to stdout or outputs in the plot, but that contains the loss value, and would allow you set the value.
Edit: I thought of one more approach that is still a form of iterating over num_features_to_select parameter, but may be interesting.
Set train_final_model=True, steps=1, and num_features_to_select to the width of your dataset
Iteratively subtract 1 from num_features_to_select
At the end of each loop, test the performance of the model
Stop if negative performance change exceeds a threshold (e.g., -5% or -2%)
This may take a while, depending on how long the training takes, but would automatically pick the num_features_to_select as you desire.

sklearn inverse_transform return huge values and drops MAE to zero

I'm experimenting with an LSTM via Tensorflow/Keras and am having some trouble rescaling the target/predicted variables. (Well, all of them, really)
I'm using the MixMaxScaler() to scale most, but not all, of my input columns:
columns_to_scale = ['myColumn1', 'myColumn2', ..., 'myTarget']
scaler = MinMaxScaler()
train[columns_to_scale] = scaler.fit_transform(train[columns_to_scale])
test = test_df
test[columns_to_scale] = scaler.fit_transform(test[columns_to_scale])
When I train my LSTM, I get MAE of ~0.1 on the scaled target. Then, I try to rescale things by rescaling test:
test_y = test
test_y[columns_to_scale] = scaler.inverse_transform(test_y[columns_to_scale])
and by replacing the target column from test with my predicted values and rescaling:
test_pred = test
test_pred['myTarget'] = model_pred_output
test_pred[columns_to_scale] = scaler.inverse_transform(test_pred[columns_to_scale])
When I do this, a couple things go wrong:
All rescaled columns are wildly different than their original values. (For instance, myColumn1 had an original value of 71850. After the above inverse_transform, it has a value of 3.261187e+10. This is how it is with every single column.
Not only is my target variable and predicted variable rescaled wrong (same issue as the others, but not as huge numbers, only in the 10K range where the originals were in the hundreds), but it is giving me the exact same numbers for both the myTarget and predicted rescaled values, which then makes the rescaled MAE 0. Which, of course, is not right.
All the columns are in the same order as the scaling operation, and there are the same number in the rescaling.
What am I missing?

I figured it out. For anyone else that runs into something similar, I was creating views of the test data, not copies.
When I set
test_pred = test
test_pred['myTarget'] = model_pred_output
test_y = test
that created a view into test, so further changes to test_pred also changed test. So, when I assigned the predicted values to test_pred['myTarget'], they also got applied to test['myTarget'], which then affected test_y.
To fix, I just had to call copy() on the original assignment:
test_pred = test.copy()
Will leave this up in case anyone wants to add or give a better explanation.

How to handle Shift in Forecasted value

I implemented a forecasting model using LSTM in Keras. The dataset is 15mints seperated and I am forecasting for 12 future steps.
The model performs good for the problem. But there is a small problem with the forecast made. It is showing a small shift effect. To get a more clear picture see the below attached figure.
How to handle this problem.? How the data must be transformed to handle this kind of issue.?
The model I used is given below
init_lstm = RandomUniform(minval=-.05, maxval=.05)
init_dense_1 = RandomUniform(minval=-.03, maxval=.06)
model = Sequential()
model.add(LSTM(15, input_shape=(X.shape[1], X.shape[2]), kernel_initializer=init_lstm, recurrent_dropout=0.33))
model.add(Dense(1, kernel_initializer=init_dense_1, activation='linear'))
model.compile(loss='mae', optimizer=Adam(lr=1e-4))
history = model.fit(X, y, epochs=1000, batch_size=16, validation_data=(X_valid, y_valid), verbose=1, shuffle=False)
I made the forecasts like this
my_forecasts = model.predict(X_valid, batch_size=16)
Time series data is transformed to supervised to feed the LSTM using this function
# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
super_data = series_to_supervised(data, 12, 1)
My timeseries is a multi-variate one. var2 is the one that I need to forecast. I dropped the future var1 like
del super_data['var1(t)']
Seperated train and valid like this
features = super_data[feat_names]
values = super_data[val_name]
ntest = 3444
train_feats, test_feats = features[0:-n_test], features[-n_test:]
train_vals, test_vals = values [0:-n_test], values [-n_test:]
X, y = train_feats.values, train_vals.values
X = X.reshape(X.shape[0], 1, X.shape[1])
X_valid, y_valid = test_feats .values, test_vals .values
X_valid = X_valid.reshape(X_valid.shape[0], 1, X_valid.shape[1])
I haven't made the data stationary for this forecast. I also tried taking difference and making the model as stationary as I can, but the issue remains the same.
I have also tried different scaling ranges for the min-max scaler, hoping it may help the model. But the forecasts are getting worsened.
Other Things I have tried
=> Tried other optimizers
=> Tried mse loss and custom log-mae loss functions
=> Tried varying batch_size
=> Tried adding more past timesteps
=> Tried training with sliding window and TimeSeriesSplit
I understand that the model is replicating the last known value to it, thereby minimizing the loss as good as it can
The validation and training loss remains low enough through out the training process. This makes me think whether I need to come up with a new loss function for this purpose.
Is that necessary.? If so what loss function should I go for.?
I have tried all the methods that I stumbled upon. I can't find any resource at all that points to this kind of issue. Is this the problem of data.? Is this because the problem is very hard to be learned by a LSTM .?

you asked for my help at:
stock prediction : GRU model predicting same given values instead of future stock price
Hope not late. What you can try is that you can divert the numerical explicitness of your features. Let me explain:
Similar to my answer in the previous topic; the regression algorithm will use the value from the time-window you give as a sample, to minimize the error. Let's assume you are trying to predict the closing price of BTC at time t. One of your features consists of previous closing prices and you are giving a time-series window of last 20 inputs from t-20 to t-1. A regressor probably will learn to choose the closing value at time step t-1 or t-2 or a close value in this case, cheating. Think like that: if closing price was $6340 at t-1, predicting $6340 or something close at t+1 would minimize the error at strongest. But actually the algorithm did not learn any patterns; it just replicates, so it basically does nothing but accomplishing its optimization duty.
Think analogously from my example: By diverting the explicitness, what I mean is that: do not give the closing prices directly, but scale them or do not use explicit ones at all. Do not use any features explicitly showing the closing prices to the algorithm, do not use open, high, low etc for every time step. You will need to be creative here, engineer the features to get rid of explicit ones; you can give squared close differences (regressor can still steal from past with linear differences, with experience), its ratio to volume. Or, can make the features categorical by digitizing them in a manner that would make sense to use. The point is do not give direct intuition to what it should predict, only provide patterns for algorithm to work on.
A faster approach may be suggested depending on your task. You can do multi-class classification if predicting how much percent of change that your labels is enough for you, just be careful about class imbalance situations. If even just the up/down fluctuations are enough for you, you can directly go for the binary classification. Replication or shifting problems are only seen at the regression tasks, if you are not leaking data from training to the test set. If possible, get rid out of regression for time-series windowed applications.
If anything misunderstood or missing, I will be around. Hope I could help. Good Luck.

Most likely your LSTM is learning to guess roughly what its previous input value was (modulated a bit). That's why you see a "shift".
So let's say your data looks like:
x = [1, 1, 1, 4, 5, 4, 1, 1]
And your LSTM learned to just output the previous input for the current timestep. Then your output would look like:
y = [?, 1, 1, 1, 4, 5, 4, 1]
Because your network has some complicated machinery it is not quite this straightforward but in principle the "shift" you see is caused by this phenomenon.

Predicting a single value in classification network

I am currently trying to get into machine learning and neural networks, but my lack of programming skills is kind of hindering me at the moment. I am following an online tutorial in which these lines of code were made to evaluate the created model:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)
predictions = list(model.predict(input_fn=pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])
final_preds[:10]
from sklearn.metrics import classification_report
print(classification_report(y_test,final_preds))
This works very well for me an tells me the precision I achieved on these 10 inputs I chose from X_test. Unfortunately, I can't really figure out how to be able to predict a particular, single value from X_test or maybe even a manually input value that has the same dimensions as an element of X_test.
X_test is a pandas.core.frame.DataFrame and includes 15 columns and thousands of rows. Therefore, I would find it helpful to maybe predict or evaluate a certain value.
If I missed any essential information, that I should have included, let me know. Thanks in advance!

Why don't you just take sections of the X_test dataframe, or pass in single values as a dataframe with a single row.
Sectioning a dataframe:
temp = X_test[i:i+1]
to test with the ith row, use temp now instead of X_test.
Or create a new dataframe with required data:
temp = pandas.DataFrame(data, columns = X_test.columns)
where data is input as a list (iterable) [[a1,a2,a3...a15]].
again use temp instead of X_test in your code.

XGBoost difference in train and test features after converting to DMatrix

Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)

One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.
Solutions:
The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.

This situation can happen after one-hot encoding. For example,
ar = np.array([
[1, 2],
[1, 0]
])
enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)
print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2
So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)
Usually, I add a fake row to matrix which contents 1 in all columns.

Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)
I resolved the issue with this:
X_rus = pd.DataFrame(X_rus, columns = X_train.columns)
Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.
This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.