Why does AdaBoost not work with DecisionTree? - python

I'm using sklearn 0.19.1 with DecisionTree and AdaBoost.
I have a DecisionTree classifier that works fine:
clf = tree.DecisionTreeClassifier()
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
clf.fit(train_pdf_x, train_pdf_y)
pred2 = clf.predict(test_pdf_x)
But when trying to add AdaBoost, it throws an error on the predict function:
treeclf = tree.DecisionTreeClassifier(max_depth=3)
adaclf = AdaBoostClassifier(base_estimator=treeclf, n_estimators=500, learning_rate=0.5)
train_split_perc = 10000
test_split_perc = pdf.shape[0] - train_split_perc
train_pdf_x = pdf[:train_split_perc]
train_pdf_y = YY[:train_split_perc]
test_pdf_x = pdf[-test_split_perc:]
test_pdf_y = YY[-test_split_perc:]
adaclf.fit(train_pdf_x, train_pdf_y)
pred2 = adaclf.predict(test_pdf_x)
Specifically the error says:
ValueError: bad input shape (236821, 6)
The dataset that it seems to be pointing to is train_pdf_y because it has a shape of (236821, 6) and I don't understand why.
From even the description of the AdaBoostClassifier in the docs I can understand that the actual classifier that uses the data is the DecisionTree:
An AdaBoost 1 classifier is a meta-estimator that begins by fitting
a classifier on the original dataset and then fits additional copies
of the classifier on the same dataset but where the weights of
incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases
But still I'm getting this error.
In the code examples I've found, even on sklearn's website with how to use AdaBoost and I can't understand what I'm doing wrong.
Any help is appreciated.

It looks like you are trying to perform a Multi-Output classification problem, given the shape of y, otherwise it does not make sense that you are feeding and n-dimensional y to adaclf.fit(train_pdf_x, train_pdf_y).
So assuming that is the case, the problem is that indeed Scikit-Learn's DecisionTreeClassifier does support Multi-output problems, this is, y inputs with shape [n_samples, n_outputs]. However that is not the case for the AdaBoostClassifier, given that, from the documentation, the labels must be:
y : array-like of shape = [n_samples]

Related

sklearn FeatureUnion & HalvingGridSearchCV & PCA ValueError: n_components=20 must be between 0 and min(n_samples, n_features)=15 with svd_solver='full

I am trying to put a FeatureUnion of a PCA, IncrementalPCA and FastICA into a pipeline with a RandomForestClassifier and searching the optimal parameters of the forest with a HalvingGridSearchCV.
Excerpts from the code look like this:
for n_components in range(20,80,10):
# all decomposers use the same parameters
decomposer_pars = {
'n_components':n_components,
'whiten':True,
}
# define the list of decomposers
pipe_preprocessing = [
('pca',PCA(**decomposer_pars)),
('fastica',FastICA(**decomposer_pars)),
('incpca',IncrementalPCA(**decomposer_pars))
]
# define clf
clf = RandomForestClassifier(n_estimators=50,...)
# model
pipe_model = Pipeline(steps=[
('rf', clf)
])
# join to parallel feature union
pipe_preprocessing = FeatureUnion(pipe_preprocessing)
# full pipeline preprocessing + model
pipe = Pipeline(steps=[('preprocessing',pipe_preprocessing),*pipe_model.steps])
# halving gridsearch with crossvalidation
sh = HalvingGridSearchCV(estimator = pipe,
param_grid = {
'rf__min_weight_fraction_leaf' : [0,0.001,0.01,0.1],
'rf__min_samples_split' : [0.001,0.01,0.1],
'rf__max_features' : [3,5],
'rf__min_impurity_decrease' : [0,0.001,0.01],
},
cv = cv, # NOTE: see description below
factor = 2,
scoring = make_scorer(accuracy_score),
resource = 'n_samples',
min_resources = 375,
max_resources = 3000,
aggressive_elimination = False,
refit = False,
return_train_score = False,
n_jobs = n_jobs,
verbose = 0,
error_score='raise')
res = sh.fit(X_train.values,y_train.reindex(X_train.index).values)
Notes:
The generator cv is custom written an generates training / validation folds of size 2794 / 279, respectively. The generator should result in n_splits=24 folds.
The overall training matrix X_train has a shape (69844, 80).
The classifier clf is simply an instance of RandomForestClassifier with n_estimators=50.
Execution of this code throws this error:
ValueError: n_components=20 must be between 0 and min(n_samples, n_features)=15 with svd_solver='full'
It's clear that the PCA components need not be larger than either the number of features or the number of samples. What I don't understand is why I get this error. The training fold that I feed in are of shape (2794,80), thus the error above should only occur for n_components>=min(n_samples, n_features)=80. I do not understand why the data is interpreted as having min(n_samples, n_features)=15. When I set n_components<15, the code works.
I don't understand what I am doing wrong here. In my understanding, FeatureUnion applies the three decomposers independently to the input training data, and should (internally) return a part of the feature matrix with shape=(n_components,2794). Thus, the transformed feature matrix would be (3*n_components,2794) and subsequent fitting of the clf should work fine.
I tried increasing the size of the validation folds (although this does not make sense in theory). Did not change anything.
Also, I increased the size of the train folds to 9978. Still the same error.
HOWEVER, increasing min_resources in HalvingGridSearchCV to 1000 does resolve the issue and the code runs up to n_components=40. Then, again, the same error.
Obviously, min_resources is limiting n_samples. But, the smallest value possible in my code above is 375, which would still result in folds of shape (375,80), such that the error should not occur for any value of n_components that I scan over.
Thus, min_resources seems to work differently than in my understanding. How does min_resources excatly affect the size of the internal training folds?
Thank you!
EDIT
I manually performed the transformation with the FeatureUnion with all values of n_components, and it works fine. This speaks for the fact that the problem must be caused by min_resources in HalvingGridSearchCV. Still did not find a solution for that.

fit and transform error on Cross validation and test data

I need help with the code here. i am trying to fit and transform the train data and then transform the cross validation and the test data. but when i do that i get the error that - ValueError: X has 24155 features, but Normalizer is expecting 49041 features as input.
Can someone please help me to solve this issue.
my code snippet-
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))
print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)
The transform function expects a 2D array as (samples, features)
The error indicates that second dimension of X_train['price'] and x_cv['price'] or x_test['price'] are not the same.
As the code reflects, you have 1 feature (price), and many samples. So, as the above explanation (samples, features), your input shape should be like (n_samples,1), since you have one feature. Now, consider to change the reshape to (-1,1) instead of (1,-1).
X_train_price_norm = normalizer.fit_transform(X_train['price'].values.reshape(-1,1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(-1,1))

How can I solve inverse_transform with shape problem?

here is my code
scaler = MinMaxScaler() #default set 0~1
dataset= scaler.fit_transform(dataset)
...
make model
...
predicted = model.predict(X_test) #shape : (5, 1)
and when I run predict = scaler.inverse_transform(predicted)
ValueError occur ValueError: non-broadcastable output operand with shape (5,1) doesn't match the broadcast shape (5,2)
My model have 2 feature as input
I tried scaler.inverse_transform(predict)[:, [0]] and reshape in several directions
but occur same ValueError
how can I solve this Problem? please give me some advice
I need your priceless opinion and will be very much appreciated.
You are using inverse_transform in a wrong way: while you have used fit_transform to your features, you are using inverse_transform to your predictions, which are of a different shape, hence the error.
This is not the intended usage of inverse_transform; have a look at the docs for more:
inverse_transform(self, X)
Undo the scaling of X according to feature_range.
Parameters: X : array-like, shape [n_samples, n_features]
Input data that will be transformed. It cannot be sparse.
It is not clear from your post why you attempt to "transform back" your predictions; this only makes sense if you already have transformed your labels (it is not clear from your post if you have done so), and you want, say, to scale back measures like MSE in the original scale of the labels. In such a case, you should use a separate scaler for your labels - see own answer in How to interpret MSE in Keras Regressor for details (the example there is with StandardScaler, but the rationale is the same).

Any classifier in `sklearn` that can handle each `M[i,j]` as an array/tuple/or distribution?

I've been learning how to use machine-learning classifiers lately and got started thinking if there was anything in sklearn that could take in either an array or a distribution for each i,j cell as training data? Does such a classification algorithm exist in scikit-learn? If so, how is it used? If not, can someone provide some insight into any algorithms that are known to handle this type of data?
Somebody asked kind of a similar question: https://stats.stackexchange.com/questions/178109/linear-regression-problem-with-multi-dimensional-vectors-instead-of-scalar-value#comment443880_178109 but it was for Regression and it was also never answered.
I tried using just a RandomForestClassifier but it didn't like the array instead of the scalar. If it's more a Bayesian problem, I would be keen on using PyMC3 but I don't even know what algorithms to look at to even start the process.
from sklearn.ensemble import RandomForestClassifier
# Create 2 Distinguishable Classes, 20 of each
# Where each column has a fixed size for the array (e.g. `attr_0`=3, `attr_1`=5, `attr_2`=2)
class_A = np.vstack([[np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)] for k in range(20)])
class_B = np.vstack([[np.random.normal(loc=15, scale=1,size=3),
np.random.normal(loc=20, scale=1,size=5),
np.random.normal(loc=30, scale=1,size=2)] for k in range(20)])
# Merge them
Ar_data = np.concatenate([class_A,class_B], axis=0)
X = pd.DataFrame(Ar_data, columns=["attr_0","attr_1","attr_2"])
# Create target vector
y = np.array(20*[0] + 20*[1])
# Test data
X_test = [np.random.normal(loc=0, scale=1,size=3),
np.random.normal(loc=5, scale=1,size=5),
np.random.normal(loc=10, scale=1,size=2)]
X_test
# [array([-0.15510844, 0.04567395, -0.66192602]),
# array([ 4.5412568 , 4.32526163, 4.56558114, 5.48178697, 5.2559264 ]),
# array([ 9.17293292, 10.19746434])]
I tried fitting a RandomForestClassifier but it didn't work :(
Mod_rf = RandomForestClassifier()
Mod_rf.fit(X,y)
# ValueError: setting an array element with a sequence.

Error in model prediction using hmmlearn

Hi I have a dataframe test, I am trying to predict using a Gaussian HMM with hmmlearn.
When I do this:
y = model.predict(test)
y
I get the hmm working fine producing and array of states
however if i do this:
for i in range(0,len(test)):
y = model.predict(test[:i])
all I get is y being set to 1.
Can anyone help?
UPDATE
here is the code that does work iterating through
The training set was 0-249:
for i in range(251,len(X)):
test = X[:i]
y = model.predict(test)
print(y[len(y)-1])
HMM models sequences of observations. If you feed a single observation into predict (which does Viterbi decoding by default) you essentially reduce the prediction to the argmax over
(model.startprob_ * model.predict_proba(test[i:i + 1])).argmax()
which can be dominated by startprob_, e.g. if startprob = [10**-8, 1 - 10**-8]. This could explain the all-ones behaviour you're seeing.

Categories

Resources