I'm attempting to do a toy linear regression in Python with TensorFlow, using the pre-built estimator tf.contrib.learn.LinearRegressor instead of building my own estimator.
The inputs I'm using are real-valued numbers between 0 and 1, and the outputs are just 3*inputs. TensorFlow seems to fit the data (no errors raised), but the outputs have no correlation to what they should be.
I'm not sure I'm getting the predictions done correctly- the documentation for the predict() function is pretty sparse.
Any ideas for how to improve the fitting?
import numpy as np
import pandas as pd
import tensorflow as tf
import itertools
import matplotlib.pyplot as plt
#Defining data set
x = np.random.rand(200)
y = 3.0*x
data = pd.DataFrame({'X':x, 'Y':y})
training_data = data[50:]
test_data= data[:50]
COLUMNS = ['Y','X']
FEATURES = ['X']
LABELS = 'Y'
#Wrapper function for the inputs of LinearRegressor
def get_input_fn(data_set, num_epochs=None, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame(data_set[FEATURES]),
y=pd.Series(data_set[LABELS]),
num_epochs=num_epochs,
shuffle=shuffle)
feature_cols = [tf.feature_column.numeric_column(k) for k in FEATURES]
regressor = tf.contrib.learn.LinearRegressor(feature_columns=feature_cols)
regressor.fit(input_fn=get_input_fn(test_data), steps=100)
results = regressor.predict(input_fn=get_input_fn(test_data,
num_epochs=1))
predictions = list(itertools.islice(results, 50))
#Visualizing the results
fig = plt.figure(figsize=[8,8])
ax = fig.add_subplot(111)
ax.scatter(test_data[LABELS], predictions)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
plt.show()
Scatter plot of results
Figured out the answer, answering here for posterity-
my input function to LinearRegressor had shuffle=True set as an argument, and my predict() call did not set shuffle=False. So the outputs were shuffled around, making them look like they didn't converge!
Related
I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)
I am sorry if this is a long post, but i have some questions related to Confusion Matrix metric and Cross-Validation that i really need help with.
This picture from Sklearn CV link, shows that our whole dataset should be split into train and test. Then, the train set is split again into a validation part and we train our model in k-1 folds and validate in the remaining one (repeat this k times). And lastly, we test our model with the test set from the beggining.
In my problem, i have a dataset for a unbalanced binary classification problem with 42372 samples. 3615 belong to class 1, the rest are class 0.
Since my dataset is unbalanced, i was using StratifiedShuffleSplit with 5 folds, and got this:
As result, using a MLPClassfier i got the following confusion matrix:
As you can see from that matrix, half my dataset is being used for test (19361+19+1782+28 = 21190).
After this, i changed the CV strategy, and tried StratifiedKfold:
And, as Confusion Matrix, i got this:
As you can see from this second confusion matrix, my whole dataset is being used for test (38644+113+3329+286 = 42372).
So, here are my questions:
1 - Do i need to split my whole data into train/test (e.g., using train_test_split), and then feed CV iterators (KFold, StratifiedKFold, StratifiedShuffleSplit, etc) only with the train part? Or should i feed my whole data into the iterators and they will do the job of splitting it into train/test and split again this train into train and validation?
2 - About the CV strategies i tried, why StratifiedShuffleSplit is using half the data? and why StratifiedKFold uses all the data? Any of those CV is wrong? Are both wrong or are both correct? What i am missing here?
EDIT: The original code to generate the Confusion Matrix i found here. I have just modified it a little bit to fit my needs, and here it goes:
import itertools
import time as time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
# from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
n_splits = 5 # Num of Folds
stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0)
# stratshufkfold = KFold(n_splits=n_splits)
def generate_confusion_matrix(cnf_matrix, classes, normalize=False, title='Matriz de Confusão'):
if normalize:
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
print("Matriz de confusão normalizada")
else:
print('Matriz de confusão, sem normalização')
plt.imshow(cnf_matrix, interpolation='nearest', cmap=plt.get_cmap('Blues'))
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cnf_matrix.max() / 2.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
plt.text(j, i, format(cnf_matrix[i, j], fmt), horizontalalignment="center",
color="white" if cnf_matrix[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('Real')
plt.xlabel('Predito')
return cnf_matrix
def plot_confusion_matrix(predicted_labels_list, y_test_list):
cnf_matrix = confusion_matrix(y_test_list, predicted_labels_list)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
generate_confusion_matrix(cnf_matrix, classes=class_names, title='Matriz de confusão, sem normalização')
plt.show()
# Plot normalized confusion matrix
plt.figure()
generate_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Matriz de confusão normalizada')
plt.show()
def evaluate_model_MLP(x, y):
predicted_targets = np.array([])
actual_targets = np.array([])
global t_inicial_MLP
global t_final_MLP
t_inicial_MLP = time.time()
for train_ix, test_ix in stratshufkfold.split(x, y):
train_x, train_y, test_x, test_y = x[train_ix], y[train_ix], x[test_ix], y[test_ix]
# Fit
classifier = MLPClassifier(activation='relu', batch_size=56, solver='sgd').fit(train_x, train_y)
predicted_labels = classifier.predict(test_x)
predicted_targets = np.append(predicted_targets, predicted_labels)
actual_targets = np.append(actual_targets, test_y)
t_final_MLP = time.time()
return predicted_targets, actual_targets
predicted_target_MLP, actual_target_MLP = evaluate_model_MLP(x, y)
plot_confusion_matrix(predicted_target_MLP, actual_target_MLP)
acuracia_MLP = accuracy_score(actual_target_MLP, predicted_target_MLP)
As specified within the comment, for what concerns the first question, the first option is the way to go. Namely, splitting the whole dataset via train_test_split and then calling method .split() of the chosen cross-validator object on the training set.
For the second point, the issue is hidden behind some default parameters of StratifiedKFold and StratifiedShuffleSplit and on the sligthly different meaning of parameter n_splits.
For what concerns StratifiedKFold, the parameter n_splits identifies the number of folds you're considering as per documentation. Therefore, imposing n_splits=5 means that the model will be trained on 4-folds (80% of the training set) and tested on one fold (20% of the training set), for each possible combination.
For what concerns StratifiedShuffleSplit, the parameter n_splits specifies the number of reshuffling and splitting iterations. On the other side, it is the parameter train_size (together with test_size) to define how big the folds will be (relatively to the size of the training set). In particular, according to the docs, the default setting defines that, if none of them is specified, train_size=0.9 (90% of the training set) and test_size=0.1 (10% of the training set).
Therefore specifying test_size within the StratifiedShuffleSplit constructor - eg - should solve your problem:
stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0, test_size=0.2)
I noticed that there are two possible implementations of XGBoost in Python as discussed here and here
When I tried running the same dataset through the two possible implementations I noticed that the results were different.
Code
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)
Y = df["target"]
X = df.drop('target', axis=1)
#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)
predictions = model.predict(dtrain)
#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )
#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)
predictions2 = model.predict(X)
print(np.absolute(predictions-predictions2).sum())
Absolute difference sum using sklearn boston dataset
62.687134
When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.
Absolute difference sum using sklearn diabetes dataset
0.0011711121
Make sure random seeds are the same.
For both approaches set the same seed
param['seed'] = 123
EDIT: then there are a couple of different things.
First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)
I've not set the "missing" parameter for the sklearn implementation. Once that was set the values were matching.
Also as Noah pointed out, sklearn wrapper has a few different default values which needs to be matched in order to exactly match the results.
I want to check my loss values using MSE during the training process, how to fetching the loss values using MSE at each of iteration?., thank you.
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
dataset = open_dataset("forex.csv")
dataset_vector = [float(i[-1]) for i in dataset]
normalized_dataset_vector = normalize_vector(dataset_vector)
training_vector, validation_vector, testing_vector = split_dataset(training_size, validation_size, testing_size, normalized_dataset_vector)
training_features = get_features(training_vector)
training_fact = get_fact(training_vector)
validation_features = get_features(validation_vector)
validation_fact = get_fact(validation_vector)
model = MLPRegressor(activation=activation, alpha=alpha, hidden_layer_sizes=(neural_net_structure[1],), max_iter=number_of_iteration, random_state=seed)
model.fit(training_features, training_fact)
pred = model.predict(training_features)
err = mean_absolute_error(pred, validation_fact)
print(err)
There's no callbacks object like there is in Keras so you'll have to loop over the fitting process to get it for each iteration. Something like the below will work for you
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_absolute_error
# create some toy data
X = np.random.random((100, 5))
y = np.random.choice([0, 1], 100)
max_iter = 500
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=max_iter)
errors = []
for i in range(max_iter):
mlp.partial_fit(X, y, classes=[0, 1])
pred = mlp.predict(X)
errors.append(mean_absolute_error(y, pred))
Which throws an annoying DeprecationWarning at the moment, but that can be ignored. The only problem with using this method is that you have to manually keep track of whether or not your model has converged. Personally I would suggest using Keras instead of sklearn if you want to work with neural networks.
I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)
It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.
A couple restrictions:
X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = xgb.XGBClassifier()
clf.fit(X, y)
So now when I try to use plot_decision_regions passing X/y as numpy arrays...
# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
y=y.values,
clf=clf,
legend=2)
I (understandably) get an error that the model can't find the column names of the dataset it was trained on
ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0
In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?
Try to change:
X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values
and proceed to:
clf = xgb.XGBClassifier()
clf.fit(X, y)
plot_decision_regions(X=X,
y=y,
clf=clf,
legend=2)
OR fit & plot using X.values and y.values