[![enter image description here][1]][1]I want to visualize tree decision classifier that i have applied to my data in pdf or png file. I tried visualizing with graphviz via the code below:
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=1)
clf =tree.DecisionTreeClassifier(max_depth=43)
clf = clf.fit(X_train, y_train)
from sklearn.externals.six import StringIO
import pydot
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph[0].write_pdf("tree.pdf")
But the procedure can not be completed. Once i got the error of running out of memory and for the second time i got the error "dot stop working". Due to this issues i wanted to have an idea about the tree by knowing where is left children where is right children or left children? thanks for any response and help
If you are getting an error something like below:
Program terminated with status: -11. stderr follows: dot: graph is too large for cairo-renderer bitmaps.
Then to understand the tree, you can try to put it in a tree text format on screen like below:
from sklearn.tree import export_text
r = export_text(clf, feature_names=df_X_train.columns)
print(r)
Related
I'm trying to save all models generated by autosklearn, but I can only get the best model.
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
# Load data
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder='/tmp/autosklearn_classification_example_tmp',
output_folder='/tmp/autosklearn_classification_example_out',
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')
# Show all models
print(automl.show_models())
# Here it uses the best model
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Is show_models() printing models being used in the best ensamble model?
Is there any way to get other models?
Autosklearn pipelines may be checked in some ways, one of which is by using PipelineProfiler; this allows you to examine all of the best-performing pipelines.
To begin, install the pipeline profile on your machine:
pip install pipelineprofiler
Then write the following code:
import PipelineProfiler
# automl is an object Which has already been created.
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)
Your output should be a plot that shows all the pipelines.
I am new to Sklearn, and I am trying to combine KNN, Decision Tree, SVM, and Gaussian NB for BaggingClassifier.
Part of my code looks like this:
best_KNN = KNeighborsClassifier(n_neighbors=5, p=1)
best_KNN.fit(X_train, y_train)
majority_voting = VotingClassifier(estimators=[('KNN', best_KNN), ('DT', best_DT), ('SVM', best_SVM), ('gaussian', gaussian_NB)], voting='hard')
majority_voting.fit(X_train, y_train)
bagging = BaggingClassifier(base_estimator=majority_voting)
bagging.fit(X_train, y_train)
But this causes an error saying:
TypeError: Underlying estimator KNeighborsClassifier does not support sample weights.
The "bagging" part worked fine if I remove KNN.
Does anyone have any idea to solve this issue? Thank you for your time.
In BaggingClassifier you can only use base estimators that support sample weights because it relies on score method, which takes in sample_weightparam.
You can list all the available classifiers like:
import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
if 'sample_weight' in inspect.getargspec(clf.fit)[0]:
print(name)
I have a problem to create and display decision tree in Jupyter Notebook using Python.
My code is as below:
X = data.drop(["Risk"], axis=1)
y = data["Risk"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.tree import DecisionTreeClassifier
klasyfikator = DecisionTreeClassifier(criterion = "gini", random_state=0, max_depth=4, min_samples_leaf=1)
klasyfikator.fit(X = X, y = y)
data = export_graphviz(klasyfikator,out_file=None,feature_names=X.columns,class_names=["0", "1"],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(data)
graph
Generally this decision tree concerns credit risk research 0 - will not pay 1 - will pay.
When I use code above, I have error like this:
ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH
I have already tried many solutions from StackOverflow, for example:
pip install graphviz
Conta install graphiz
I Downloaded Graphviz from http://www.graphviz.org/download/
I added to the PATH environment variable:
C:\Program Files (x86)\Graphviz2.38\bin
And there is still error described above. What can I do? what should I do ? Please help me guys because I'm losing hope of being able to draw this tree. Thank you!
Moreover, when I added by using this code:
import os
os.environ["PATH"] += os.pathsep + 'C:\Program Files (x86)\Graphviz2.38\bin'
I have in PATH something like this: C:\\Program Files (x86)\\Graphviz2.38\x08in it is not the same, what can I do ?
With latest version of sklearn, you can directly plot the decision tree without graphviz.
Use:
from sklearn.tree import plot_tree
plot_tree(klasyfikator)
Read more here.
I've read through a few pages but need someone to help explain how to make this work for.
I'm using TPOTRegressor() to get an optimal pipeline, but from there I would love to be able to plot the .feature_importances_ of the pipeline it returns:
best_model = TPOTRegressor(cv=folds, generations=2, population_size=10, verbosity=2, random_state=seed) #memory='./PipelineCache', memory='auto',
best_model.fit(X_train, Y_train)
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
I saw this kind of set up from a now closed issue on Github, but currently I get the error:
Best pipeline: LassoLarsCV(input_matrix, normalize=True)
Traceback (most recent call last):
File "main2.py", line 313, in <module>
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
AttributeError: 'LassoLarsCV' object has no attribute 'feature_importances_'
So, how would I get these feature importances from the optimal pipeline, regardless of which one it lands on? Or is this even possible? Or does someone have a better way of going about trying to plot feature importances from a TPOT run?
Thanks!
UPDATE
For clarification, what is meant by Feature Importance is the determination of how important each feature (X's) of your dataset is in determining the predicted (Y) label, using a barchart to plot each feature's level of importance in coming up with its predictions. TPOT doesn't do this directly (I don't think), so I was thinking I'd grab the pipeline it came up with, re-run it on the training data, and then somehow use a .feature_imprtances_ to then be able to graph the feature importances, as these are all sklearn regressor's I'm using?
Very nice question.
You just need to fit again the best model in order to get the feature importances.
best_model.fit(X_train, Y_train)
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]
The last line returns the best model based on the CV.
You can then use:
exctracted_best_model.fit(X_train, Y_train)
to train it. If the best model has the desired attribure, then you will be able to access it after exctracted_best_model.fit(X_train, Y_train)
More details (in my comments) and a Toy example:
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# reduce training features for time sake
X_train = X_train[:100,:]
y_train = y_train[:100]
# Fit the TPOT pipeline
tpot = TPOTRegressor(cv=2, generations=5, population_size=50, verbosity=2)
# Fit the pipeline
tpot.fit(X_train, y_train)
# Get the best model
exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
print(exctracted_best_model)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square',
n_estimators=100, random_state=None)
# Train the `exctracted_best_model` using THE WHOLE DATASET.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X, y) # X,y IMPORTNANT
# Access it's features
exctracted_best_model.feature_importances_
# Plot them using barplot
# Here I fitted the model on X_train, y_train and not on the whole dataset for TIME SAKE
# So I got importances only for the features in `X_train`
# If you use `exctracted_best_model.fit(X, y)` we will have importances for all the features !!!
positions= range(exctracted_best_model.feature_importances_.shape[0])
plt.bar(positions, exctracted_best_model.feature_importances_)
plt.show()
IMPORTNANT NOTE: *In the above example, the best model based on the pipeline was AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square'). This model indeed has the attribute feature_importances_.
In the case where the best model does not have an attribute feature_importances_, the exact same code will not work. You will need to read the docs and see the attributes of each returned best model. E.g. if the best model was LassoCV then you would use the coef_ attribute.
Output:
I am trying to dump the classifier and its parameters into a table as such:
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = DecisionTreeClassifier().fit(X, y)
When I print clf I get the following:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
How can I dump this into a .txt or even better into a table that contains this info under columns. For example, under the Algorithm Name column it would say C4.5 etc...
I tried using from sklearn.externals import joblib and did: joblib.dump(clf, "outputfile.txt"). I would get messed up text or non-ASCII characters.
Ideal output:
I understand this maybe a far fetch but my question is just how to output the classifier properly and capture all of the required info.
If you want to load the object/ model as it was then joblib is the way (or pickle but scikit suggests joblib).
If you want to keep the parameters and use them:
from sklearn.tree import DecisionTreeClassifier
import json
dt = DecisionTreeClassifier()
# do your stuff
# ...
# you can dump the parameters to json or to any other type of storage, load them and re use them.
with open("somefile.json", "wb") as f:
json.dump(dt.get_params(), f)
# ...
# and load them...with some proper error handling...
with open("somefile.json") as f:
dt.set_params(**json.load(f))
In general, for what you are asking for, you'll have to do something custom. ( I too am in the process of implementing something to hold the information in a database in order to be able to re-use it but I have not found a workaround for joblib yet)