Predicting a single value in classification network - python

I am currently trying to get into machine learning and neural networks, but my lack of programming skills is kind of hindering me at the moment. I am following an online tutorial in which these lines of code were made to evaluate the created model:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)
predictions = list(model.predict(input_fn=pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])
final_preds[:10]
from sklearn.metrics import classification_report
print(classification_report(y_test,final_preds))
This works very well for me an tells me the precision I achieved on these 10 inputs I chose from X_test. Unfortunately, I can't really figure out how to be able to predict a particular, single value from X_test or maybe even a manually input value that has the same dimensions as an element of X_test.
X_test is a pandas.core.frame.DataFrame and includes 15 columns and thousands of rows. Therefore, I would find it helpful to maybe predict or evaluate a certain value.
If I missed any essential information, that I should have included, let me know. Thanks in advance!

Why don't you just take sections of the X_test dataframe, or pass in single values as a dataframe with a single row.
Sectioning a dataframe:
temp = X_test[i:i+1]
to test with the ith row, use temp now instead of X_test.
Or create a new dataframe with required data:
temp = pandas.DataFrame(data, columns = X_test.columns)
where data is input as a list (iterable) [[a1,a2,a3...a15]].
again use temp instead of X_test in your code.

Related

CatBoost Post-Training Feature Information

I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier, Pool
train_pool = Pool(pd.DataFrame({'size': [1,1,2,1],
'shape': ['square','square','square', 'circle']}),
[1,1,0,1],
feature_names = ['size','shape'],
cat_features= ['shape'])
model = CatBoostClassifier(iterations=2,
cat_features = ['shape'],
ctr_leaf_count_limit=1)
model.fit(train_pool, plot=False)
I would now like to run a function on the model object to obtain the following:
Numerical Feature size has minimum value 0, and max value 1 (this should be part of CatBoosts split logic for numerical features)
Categorical Feature shape has the following training values:
values=['square', None].
Notice that circle is not in values because the car_leaf_count_limit=1 would have selected the most occurring value, which in this case is 'square'. I've put None here because I'm pretty sure cat boost will assign None to any unseen classes.
Next, I've chosen the above data example to make sure that CatBoost decides to split on shape=='square'. Ideally I'd like to see an array used_values=['square'] which emphasizes that there was at least one split on this square value.
It's important to emphasize here that I want to operate on the model object only. Obviously, one can get some of these details by running functions onto of the training data. My motivation is to double-make-sure that I completely understand the training-range of inputs into the model, and what it may do to them in preprocessing.

Why is sklearn.metrics support value changing every time?

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?
Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

How can I predict the following value of time series using Tensorflow predict method?

I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
"lstm" is normaly used to predict 3D data => target
which input has same time frame number(n , t ,f)
"n" for data number "t" for frame number , "f" for feature number
what you want to predict is 1D data => target
which is (t ,f )
when you have f = 0 then you can only use F(t) => y
if you can estimate function F then u can get y . NN can't help here

loading EMNIST-letters dataset

I have been trying to find a way to load the EMNIST-letters dataset but without much success. I have found interesting stuff in the structure and can't wrap my head around what is happening. Here is what I mean:
I downloaded the .mat format in here
I can load the data using
import scipy.io
mat = scipy.io.loadmat('letter_data.mat') # renamed for conveniance
it is a dictionnary with the keys as follow:
dict_keys(['__header__', '__version__', '__globals__', 'dataset'])
the only key with interest is dataset, which I havent been able to gather data from. printing the shape of it give this:
>>>print(mat['dataset'].shape)
(1, 1)
I dug deeper and deeper to find a shape that looks somewhat like a real dataset and came across this:
>>>print(mat['dataset'][0][0][0][0][0][0].shape)
(124800, 784)
which is exactly what I wanted but I cant find the labels nor the test data, I tried many things but cant seem to understand the structure of this dataset.
If someone could tell me what is going on with this I would appreciate it
Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0] and the array of label arrays with mat['dataset'][0][0][0][0][0][1]. For instance, print(mat['dataset'][0][0][0][0][0][0][0]) will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0]) will print the first image's label.
For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
#Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.
The data itself has already been split up in to a training and test set. Here's how I accessed the data:
from scipy import io as sio
mat = sio.loadmat('emnist-letters.mat')
data = mat['dataset']
X_train = data['train'][0,0]['images'][0,0]
y_train = data['train'][0,0]['labels'][0,0]
X_test = data['test'][0,0]['images'][0,0]
y_test = data['test'][0,0]['labels'][0,0]
There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]) that distinguishes the original sample writer. Finally, there is another field data['mapping'], but I'm not sure what it is mapping the digits to.
In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.
val_start = X_train.shape[0] - X_test.shape[0]
X_val = X_train[val_start:X_train.shape[0],:]
y_val = y_train[val_start:X_train.shape[0]]
X_train = X_train[0:val_start,:]
y_train = y_train[0:val_start]
If you don't want a validation set it is fine to leave these samples in the training set.
Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -
X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')
An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)
This lets you pip install emnist in your environment then import the datasets (they will download when you run the program for the first time).
Example from the site:
>>> from emnist import extract_training_samples
>>> images, labels = extract_training_samples('digits')
>>> images.shape
(240000, 28, 28)
>>> labels.shape
(240000,)
You can also list the datasets
>>> from emnist import list_datasets
>>> list_datasets()
['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']
And replace 'digits' in the first example with your choice.
This gives you all the data in numpy arrays which I have found makes things easy to work with.
I suggest downloading the 'Binary format as the original MNIST dataset' from the Yann LeCun website.
Unzip the downloaded File and then with Python:
import idx2numpy
X_train = idx2numpy.convert_from_file('./emnist-letters-train-images-idx3-ubyte')
y_train = idx2numpy.convert_from_file('./emnist-letters-train-labels-idx1-ubyte')
X_test = idx2numpy.convert_from_file('./emnist-letters-test-images-idx3-ubyte')
y_test = idx2numpy.convert_from_file('./emnist-letters-test-labels-idx1-ubyte')

XGBoost difference in train and test features after converting to DMatrix

Just wondering how is possible next case:
def fit(self, train, target):
xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
self.model = xgb.train(self.params, xgtrain, self.num_rounds)
I passed the train dataset as csr_matrix with 5233 columns, and after converting to DMatrix I got 5322 features.
Later on predict step, I got an error as cause of above bug :(
def predict(self, test):
if not self.model:
return -1
xgtest = xgb.DMatrix(test)
return self.model.predict(xgtest)
Error: ... training data did not have the following fields: f5232
How can I guarantee correct converting my train/test datasets to DMatrix?
Are there any chance to use in Python something similar to R?
# get same columns for test/train sparse matrixes
col_order <- intersect(colnames(X_train_sparse), colnames(X_test_sparse))
X_train_sparse <- X_train_sparse[,col_order]
X_test_sparse <- X_test_sparse[,col_order]
My approach doesn't work, unfortunately:
def _normalize_columns(self):
columns = (set(self.xgtest.feature_names) - set(self.xgtrain.feature_names)) | \
(set(self.xgtrain.feature_names) - set(self.xgtest.feature_names))
for item in columns:
if item in self.xgtest.feature_names:
self.xgtest.feature_names.remove(item)
else:
# seems, it's immutable structure and can not add any new item!!!
self.xgtest.feature_names.append(item)
One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:
If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.
Solutions:
The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding.
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.
This situation can happen after one-hot encoding. For example,
ar = np.array([
[1, 2],
[1, 0]
])
enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)
print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2
So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)
Usually, I add a fake row to matrix which contents 1 in all columns.
Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)
I resolved the issue with this:
X_rus = pd.DataFrame(X_rus, columns = X_train.columns)
Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.
This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.

Categories

Resources