MultiOutputClassifier only returns learned data - python

As the title says, I am testing python MultiOutputClassifier, to fix a problem that requires determining coordinates (x,y) as output, given 3 inputs and it only returns as prediction the closest learned value, not the 'extrapolated' one.
My sample is code is as follows:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
train_data = np.array([
[-30,-60,-90,0,0],
[-50,-50,-50,10,0],
[-90,-60,-30,20,0],
[-50,-50,-95,0,10],
[-60,-30,-60,10,10],
[-95,-50,-50,20,10],
])
# These I just made up
test_data_x = np.array([
[-35,-50,-90],
])
x = train_data[:, :3]
y = train_data[:, 3:]
forest = RandomForestClassifier(n_estimators=100, random_state=1)
classifier = MultiOutputClassifier(forest, n_jobs=-1)
classifier.fit(x,y)
print classifier.predict(test_data_x)
This returns 0,10, but I would expect that for the given inputs the output should be something like 5,5; somewhere between two of the learned values.
I see that there is something I am doing wrong or I misunderstood. Any help with this issue? Is it that the MultiOutputClassifier is not the right thing?

The problem here is that a (Random Forest) Classifier won't extrapolate. It can only output values it has already seen. You probably want to use a regressor.
Replacing "Classifier" with "Regressor" in your code yields (0.8, 5.8) as output, which seems closer to what you expected.
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
train_data = np.array([
[-30,-60,-90,0,0],
[-50,-50,-50,10,0],
[-90,-60,-30,20,0],
[-50,-50,-95,0,10],
[-60,-30,-60,10,10],
[-95,-50,-50,20,10],
])
test_data_x = np.array([
[-35,-50,-90],
])
x = train_data[:, :3]
y = train_data[:, 3:]
forest = RandomForestRegressor(n_estimators=100, random_state=1)
classifier = MultiOutputRegressor(forest, n_jobs=-1)
classifier.fit(x,y)
print(classifier.predict(test_data_x))

Related

Train, test score disrepancy with datasize

I'm trying to apply ML on atomic structures using descriptors. My problem is that I get very different score values depending on the datasize I use, I suspect that something is wrong with my model, any suggestions would be appreciated. I used dataset from this paper (Dataset MoS2(single)).
Here is the my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ase
from dscribe.descriptors import SOAP
from dscribe.descriptors import CoulombMatrix
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from ase.io import read
materials = read('structures.xyz', index=':')
materials = materials[:5000]
energies = pd.read_csv('Energy.csv')
energies = np.array(energies['b'])
energies = energies[:5000]
species = ["H", 'Mo', 'S']
rcut = 8.0
nmax = 1
lmax = 1
# Setting up the SOAP descriptor
soap = SOAP(
species=species,
periodic=False,
rcut=rcut,
nmax=nmax,
lmax=lmax,
)
coulomb_matrices = soap.create(materials, positions=[[51]]*len(materials))
nsamples, nx, ny = coulomb_matrices.shape
d2_train_dataset = coulomb_matrices.reshape((nsamples,nx*ny))
df = pd.DataFrame(d2_train_dataset)
df['target'] = energies
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, 0:12].values
y = df.iloc[:, 12:].values
st_x = StandardScaler()
st_y = StandardScaler()
X = st_x.fit_transform(X)
y = st_y.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#krr = GridSearchCV(
# KernelRidge(kernel="rbf", gamma=0.1),
# param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3], "gamma": np.logspace(-2, 2, 5)},
#)
svr = GridSearchCV(
SVR(kernel="rbf", gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3], "gamma": np.logspace(-2, 2, 5)},
)
svr = svr.fit(X_train, y_train.ravel())
print("Training set score: {:.4f}".format(svr.score(X_train, y_train)))
print("Test set score: {:.4f}".format(svr.score(X_test, y_test)))
and score:
Training set score: 0.0414
Test set score: 0.9126
I don't have a full answer to your problem as recreating it would be very cumbersome, but here are some questions to check:
a) You are training on 5 CrossValidation folds (default). First you should check the results of all parameter combinations right after the fitting process with "svr.best_score_" (or more detailed with "svr.cv_results_dict") and see what mean score your folds actually produced. If the score is really is as low as 0.04 (I assume higher is better, which these scores usually do), taking the reciprocal prediction would actually be really accurate! If you know you're always wrong, it's really easy to be right. ;D
b) You could go ahead and just use the svr.best_params_ in order to train again on the whole X_train-set instead of the folds (this can also be achieved with the "refit"-option of RandomSearchCV as well) and then check with the test set again. Here could also be the actual error: The documentation for the score method of GridSearchCV reads: "Return the score on the given data, if the estimator has been refit." This is not the case in your gridsearch! Try turning the refit option on. Maybe that works? ... sorry, your code was too cumbersome to be replicated fast, so I didn't check myself ...

How to interpret base_value of multi-class classification problem when using SHAP?

I am using shap library for ML interpretability to better understand k-means segmentation algorithm clusters. In a nutshell I make some blogs, use k-means to cluster them and then take the clusters as label and xgboost to try to predict them. I have 5 clusters so it is a signle-label multi-class classification problem.
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
data = pd.DataFrame(np.concatenate((X, y.reshape(500,1)), axis=1), columns=['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'cluster_id'])
data['cluster_id'] = data['cluster_id'].astype(int).astype(str)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:,:-1])
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
kmeans.fit(scaled_features)
data['predicted_cluster_id'] = kmeans.labels_.astype(int).astype(str)
clf = xgb.XGBClassifier()
clf.fit(scaled_data.iloc[:,:-1], scaled_data['predicted_cluster_id'])
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(scaled_data.iloc[0,:-1].values.reshape(1,-1))
shap.force_plot(explainer.expected_value[0], shap_values[0], link='logit') # repeat changing 0 for i in range(0, 5)
The pictures above make sense as the class is '3'. But why this base_value, shouldn't it be 1/5? I asked myself a while ago a similar question but this time I set already link='logit'.
link="logit" does not seem right for multiclass, as it's only suitable for binary output. This is why you do not see probabilities summing up to 1.
Let's streamline your code:
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
from scipy.special import softmax, logit, expit
np.random.seed(42)
X, y_true = make_blobs(n_samples=500, centers=5, n_features=3, random_state=0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5)
y_predicted = kmeans.fit_predict(X_scaled, )
clf = xgb.XGBClassifier()
clf.fit(X_scaled, y_predicted)
shap.initjs()
Then, what you see as expected values in:
explainer = shap.TreeExplainer(clf)
explainer.expected_value
array([0.67111245, 0.60223354, 0.53357694, 0.50821152, 0.50145331])
are base scores in raw space.
The multi-class raw scores can be converted to probabilities with softmax:
softmax(explainer.expected_value)
array([0.22229282, 0.20749694, 0.19372895, 0.18887673, 0.18760457])
shap.force_plot(..., link="logit") doesn't make sense for multiclass, and it seems impossible to switch from raw to probability and still maintain additivity (because softmax(x+y) ≠ softmax(x) + softmax(y)).
Should you wish to analyze your data in probability space try KernelExplainer:
from shap import KernelExplainer
masker = shap.maskers.Independent(X_scaled, 100)
ke = KernelExplainer(clf.predict_proba, data=masker.data)
ke.expected_value
# array([0.18976762, 0.1900516 , 0.20042894, 0.19995041, 0.21980143])
shap_values=ke.shap_values(masker.data)
shap.force_plot(ke.expected_value[0], shap_values[0][0])
or summary plot:
from shap import Explanation
shap.waterfall_plot(Explanation(shap_values[0][0],ke.expected_value[0]))
which are now additive for shap values in probability space and align well with both base probabilities (see above) and predicted probabilities for 0th datapoint:
clf.predict_proba(masker.data[0].reshape(1,-1))
array([[2.2844513e-04, 8.1287889e-04, 6.5225776e-04, 9.9737883e-01,
9.2762709e-04]], dtype=float32)

Why is my MSE so high when the difference between test and prediction values are so close?

In Python, I have conducted a small multiple linear regression model to explain house prices in areas based on other variables (all of which are percentages multiplied by 100) such as percentage of people with bachelor degrees in an area, percentage of people who work from home. I have conducted this in R and it works fine, but I am new to Python ML. I have shown the output of y_pred = regressor.predict(X_test) and the MSE I get. I have included a sample of my data where avgincome PctSingleDetached and PctDrivetoWork are X, and AvgHousingPrice is the Y.
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
sample data:
avgincome PctSingleDetached PctDrivetoWork AvgHousingPrice
0 44388.0 61.528497 81.151832 448954
1 40650.0 54.372197 77.882798 349758
2 43350.0 68.393782 79.553265 428740
X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer.
# Instructs to find missing and replace it with mean
# Fit method in SimpleImputer will connect imputer to our matrix of features
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))
print(X)
print(Y)
## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
### Feature Scaling ###
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both
X_test[:, 0:] = sc.transform(X_test[:, 0:])
print(X_train)
print(X_test)
## Training ##
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)
### Predicting Test Set results ###
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, y_pred)
print(mse)
OUTPUT:
[[489066.76 300334. ]
[227458.2 200352. ]
[928249.59 946729. ]
[339032.27 350116. ]
[689668.21 600322. ]
[489179.58 577936. ]]
...
...
MSE = 2375985640.8102403
You can calculate mse yourself to check if there is something wrong. In my opinion the obtained result is coherent. Anyway I built a simple my_mse function to check the result output by sklearn, with your example data
from sklearn.metrics import mean_squared_error
list_ = [[489066.76, 300334.],
[227458.2, 200352. ],
[928249.59, 946729. ],
[339032.27, 350116. ],
[689668.21, 600322. ],
[489179.58, 577936. ]]
y_true = [y[0] for y in list_]
y_pred = [y[1] for y in list_]
mse = mean_squared_error(y_true, y_pred)
print(mse)
# 8779930962.14985
def my_mse(y_true, y_pred):
diff = 0
for couple in zip(y_true, y_pred):
diff+=pow(couple[0]-couple[1], 2)
return diff/len(y_true)
print(my_mse(y_true, y_pred))
# 8779930962.14985
Remember the mse is the mean squared error. (Each error is squared in the sum)
If you are asking if your model is bad or good, it depends on the main objective. Anyway, I think that your model is performing poor because it's a linear model. A model with more complexity could handle the problem and output better results

Using sklearn voting ensemble with partial fit

Can someone please tell how to use ensembles in sklearn using partial fit.
I don't want to retrain my model.
Alternatively, can we pass pre-trained models for ensembling ?
I have seen that voting classifier for example does not support training using partial fit.
The Mlxtend library has an implementation of VotingEnsemble which allows you to pass in pre-fitted models. For example if you have three pre-trained models clf1, clf2, clf3. The following code would work.
from mlxtend.classifier import EnsembleVoteClassifier
import copy
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1], fit_base_estimators=False)
When set to false the fit_base_estimators argument in EnsembleVoteClassifier ensures that the classifiers are not refit.
In general, when looking for more advanced technical features that sci-kit learn does not provide, look to mlxtend as a first point of reference.
Workaround:
VotingClassifier checks that estimators_ is set in order to understand whether it is fitted, and is using the estimators in estimators_ list for prediction.
If you have pre trained classifiers, you can put them in estimators_ directly like the code below.
However, it is also using LabelEnconder, so it assumes labels are like 0,1,2,... and you also need to set le_ and classes_ (see below).
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
clf_list = [clf1, clf2, clf3]
eclf = VotingClassifier(estimators = [('1' ,clf1), ('2', clf2), ('3', clf3)], voting='soft')
eclf.estimators_ = clf_list
eclf.le_ = LabelEncoder().fit(y)
eclf.classes_ = seclf.le_.classes_
# Now it will work without calling fit
eclf.predict(X,y)
Unfortunately, currently this is not possible in scikit VotingClassifier.
But you can use http://sebastianraschka.com/Articles/2014_ensemble_classifier.html (from which VotingClassifer is implemented) to try and implement your own voting classifier which can take pre-fitted models.
Also we can look at the source code here and modify it to our use:
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_ = LabelEncoder()
# When you do partial_fit, the first fit of any classifier requires
all available labels (output classes),
you should supply all same labels here in y.
le_.fit(y)
# Fill below list with fitted or partial fitted estimators
clf_list = [clf1, clf2, clf3, ... ]
# Fill weights -> array-like, shape = [n_classifiers] or None
weights = [clf1_wgt, clf2_wgt, ... ]
weights = None
#For hard voting:
pred = np.asarray([clf.predict(X) for clf in clf_list]).T
pred = np.apply_along_axis(lambda x:
np.argmax(np.bincount(x, weights=weights)),
axis=1,
arr=pred.astype('int'))
#For soft voting:
pred = np.asarray([clf.predict_proba(X) for clf in clf_list])
pred = np.average(pred, axis=0, weights=weights)
pred = np.argmax(pred, axis=1)
#Finally, reverse transform the labels for correct output:
pred = le_.inverse_transform(np.argmax(pred, axis=1))
It's not too hard to implement the voting. Here's my implementation:
import numpy as np
class VotingClassifier(object):
""" Implements a voting classifier for pre-trained classifiers"""
def __init__(self, estimators):
self.estimators = estimators
def predict(self, X):
# get values
Y = np.zeros([X.shape[0], len(self.estimators)], dtype=int)
for i, clf in enumerate(self.estimators):
Y[:, i] = clf.predict(X)
# apply voting
y = np.zeros(X.shape[0])
for i in range(X.shape[0]):
y[i] = np.argmax(np.bincount(Y[i,:]))
return y
The Mlxtend library has an implementation works, you still need to call the fit function for the EnsembleVoteClassifier. Seems the fit function doesn't really modify any parameters rather checking the possible label values. In the example below, you have to give an array contains all the possible values appear in original y(in this case 1,2) to eclf2.fit It doesn't matter for X.
import numpy as np
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
import copy
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
for clf in (clf1, clf2, clf3):
clf.fit(X, y)
eclf2 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],voting="soft",refit=False)
eclf2.fit(None,np.array([1,2]))
print(eclf2.predict(X))

Scikit Learn: How can I set the SVM Output range in regression?

I have some input data in the range [-1 , 1] and output data in the range [ 0, 1]. When I use the SMV regression to predict the output
I have that the predicted output values are between-1 and 1. What am I
missing? The code is:
svr=svm.SVR(C=0.1, gamma=0.01,kernel='rbf')
y_rbf =svr.fit(TrainingIn,TrainingOut)
y_hat=svr.predict(TestIn)
Thank you!
Given the information here, it's impossible to reconstruct your problem. I'm pretty sure though, that it has to do with the preprocessing/scaling of your data. An example snippet to get SVR running might look like this (feel free to adapt it to your needs):
from sklearn.svm import SVR
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
# replace this parth with your data, e.g. TrainingIn/TrainingOut
boston = load_boston()
X, y = boston.data, boston.target
X1, X2, y1, y2 = train_test_split(X, y)
svr = SVR(C=80)
scaler = StandardScaler()
svr.fit(scaler.fit_transform(X1), y1)
y_pred = svr.predict(scaler.transform(X2))
print mean_squared_error(y2, y_pred)
Am keeping this answer only for future reference (it does not directly answer PSan's question).
It's important to note that (perhaps contrary to its name) sklearn.svm.SVR can be used as both a predictor and a classifier. If fed labeled data, predict will output {-1, +1}.

Categories

Resources