Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier - python

I have a decision tree classifier from sklearn and I use pydotplus to show it.
However I don't really like when there is a lot of informations on each nodes for my presentation (entropy, samples and value).
To explain it easier to people I would like to only keep the decision and the class on it.
Where can I modify the code to do it ?
Thank you.

Accoring to the documentation, it is not possible to abstain from setting the additional information inside boxes. The only thing that you may implicitly omit is the impurity parameter.
However, I have done it the other explicit way which is somewhat crooked. First, I save the .dot file setting the impurity to False. Then, I open it up and convert it to a string format. I use regex to subtract the redundant labels and resave it.
The code goes like this:
import pydotplus # pydot library: install it via pip install pydot
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.datasets import load_iris
data = load_iris()
clf = DecisionTreeClassifier()
clf.fit(data.data, data.target)
export_graphviz(clf, out_file='tree.dot', impurity=False, class_names=True)
PATH = '/path/to/dotfile/tree.dot'
f = pydot.graph_from_dot_file(PATH).to_string()
f = re.sub('(\\\\nsamples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])', '', f)
f = re.sub('(samples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])\\\\n', '', f)
with open('tree_modified.dot', 'w') as file:
file.write(f)
Here are the images before and after modification:
In your case, there seems to be more parameters in boxes, so you may want to tweak the code a little bit.
I hope that helps!

Related

Python Databricks cannot visualise dtreeviz decision tree

I need to visualize a decision tree in dtreeviz in Databricks.
The code seems to be working fine.
However, instead of showing the decision tree it throws the following:
Out[23]: <dtreeviz.trees.DTreeViz at 0x7f5b27a91160>
Running the following code:
import pandas as pd
from sklearn import preprocessing, tree
from dtreeviz.trees import dtreeviz
Things = {'Feature01': [3,4,5,0],
'Feature02': [4,5,6,0],
'Feature03': [1,2,3,8],
'Target01': ['Red','Blue','Teal','Red']}
df = pd.DataFrame(Things,
columns= ['Feature01', 'Feature02',
'Feature02', 'Target01'])
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(df.Target01)
df['target'] = label_encoder.transform(df.Target01)
classifier = tree.DecisionTreeClassifier()
classifier.fit(df.iloc[:,:3], df.target)
dtreeviz(classifier,
df.iloc[:,:3],
df.target,
target_name='toy',
feature_names=df.columns[0:3],
class_names=list(label_encoder.classes_)
)
if you look into dtreeviz documentation you'll see that dtreeviz method just creates an object, and then you need to use function like .view() to show it. On Databricks, view won't work, but you can use .svg() method to generate output as SVG, and then use displayHTML function to show it. Following code:
viz = dtreeviz(classifier,
...)
displayHTML(viz.svg())
will give you desired output:
P.S. You need to have the dot command-line tool to generate output. It could be installed by executing in a cell of the notebook:
%sh apt-get install -y graphviz

How do I train two sets of data given both files separately?

I am doing a project in which I need to estimate the age of an individual, given an X-Ray of their hand. I am given a testing set, which contains a large collection of images (in a folder on my computer), all NUMBERED, and I am also given a CSV file that corresponds each image number with 2 pieces of information: the age(in months), as well as whether the individual is male (this is given as "true" or "false." Also, I believe I have successfully imported both of these files into python(the image folder, as well as the CSV file)
I have looked at many TensorFlow tutorials, but I am struggling to figure out how I can associate the image numbers together, as well as train the data set. Any help would be greatly appreciated!!
I have attached blocks of my code, as well as how the data is presented to me, up until this point.
import pandas as pd
import numpy as np
import os
import tensorflow as tf
import cv2
from tensorflow import keras
from tensorflow.keras.layers import Dense, Input, InputLayer, Flatten
from tensorflow.keras.models import Sequential, Model
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
import random
%matplotlib inline
import matplotlib.pyplot as plt
--This simply imports libraries that I use, or anticipate using later on.
plt.figure(figsize=(20,20))
train_images=r'/Users/FOLDER/downloads/Boneage_competition/training_dataset/boneage-training-dataset'
for i in range(5):
file = random.choice(os.listdir(train_images))
image_path= os.path.join(train_images, file)
img=mpimg.imread(image_path)
ax=plt.subplot(1,5,i+1)
ax.title.set_text(file)
plt.imshow(img)
-- This successfully imports the image folder, as well as prints 5 random images to test if the importing worked.
This screenshot provides an example of how the pictures are depicted
IMG_WIDTH=200
IMG_HEIGHT=200
img_folder=r'/Users/FOLDER/downloads/Boneage_competition/training_dataset/'
-- I believe this resizes all the images to the specified dimensions
label_file = '/Users/FOLDER/downloads/train.csv'
train_labels = pd.read_csv (r'/Users/FOLDER/downloads/train.csv')
print (train_labels)
-- This successfully imports the data from the CSV file, and prints it, to make sure it worked.
If you have any ideas on how to connect these two datasets and train the data, I would greatly appreciate it.
Thank you!
The approach is simple create a map between the image_data and the label. After that you can create two lists/np.array and use the same to pass the train and label info to you model. Following code should help in getting the same.
import os
import glob
dic = {}
# assuming you have .png format files else change the same into the glob statement
train_images='/Users/FOLDER/downloads/Boneage_competition/training_dataset/boneage-training-dataset'
for file in glob.glob(train_images+'/*.png'):
b_name = os.path.basename(file).split('.')[0]
dic[b_name] = mpimg.imread(file)
dic_label_match = {}
label_file = '/Users/FOLDER/downloads/train.csv'
train_labels = pd.read_csv (r'/Users/rkrishna/downloads/train.csv')
for i in range(len(train_labels)):
# given your first column is age and image no starts from 1
dic_label_match[i+1] = str(train_labels.iloc[i][0])
# you can use the below line too
# dic_label_match[i+1] = str(train_labels.iloc[i][age])
# now you have dict with keys and values
# create two lists / arrays and you can pass the same to the keram model
train_x = []
label_ = []
for val in dic:
if val in dic and val in dic_label_match:
train_x.append(dic[val])
label_.append(dic_label_match[val])

TensorFlow, what does the tensorflow_datasets.load() exactly return?

I am following a tutorial of TensorFlow ML and I am new to Python. I come from a background of languages like Java. Here is the link to the tutorial.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras import layers
# Download the Flowers Dataset using TensorFlow Datasets
(training_set, validation_set), dataset_info = tfds.load(
'tf_flowers',
split=['train[:70%]', 'train[70%:]'],
with_info=True,
as_supervised=True,
)
for example in training_set:
num_training_examples += 1
# Reformat Images and Create Batches
IMAGE_RES = 224
def format_image(image, label):
image = tf.image.resize(image, (IMAGE_RES, IMAGE_RES))/255.0
return image, label
BATCH_SIZE = 32
train_batches = training_set.shuffle(num_training_examples//4).map(format_image).batch(BATCH_SIZE).prefetch(1)
validation_batches = validation_set.map(format_image).batch(BATCH_SIZE).prefetch(1)
I don't understand how this code operates: (training_set, validation_set), dataset_info = tfds.load. The function tfds.load downloads images of flowers. How come that training_set is iterable like some sort of array, when it should be a folder perhaps?
for example in training_set:
num_training_examples += 1
Also how come each element in it is used in the following line as two arguments to the function format_image(image, label) in this line:
train_batches = training_set.shuffle(num_training_examples//4).map(format_image).batch(BATCH_SIZE).prefetch(1)
What is training_set exactly? Why is it not a folder that contains the following structure:
flowers_a
file1, file2, file3 ... etc
flowers_b
file1, file2, file3 ... etc
flowers_c
file1, file2, file3 ... etc
etc ...
instead its some sort of an array with each element containing an image and its label? It is not clear in the documentation what is happening for a beginner in Python such as I.
Like the name suggests, Tensorflow exists to "make the tensors flow". It's an entire ecosystem with data loading, preprocessing, and machine learning capabilities. So it's not built as an intuitive library that deals with numpy arrays. Tensorflow doesn't keep everything in memory so what TFDS returns is literally a "Tensorflow Dataset". You need to manipulate it as such. This means that you can't get basic information, like the count, intuitively. You need to iterate through the whole thing. For instance this line you gave:
for example in training_set:
num_training_examples += 1
It's passing all the samples and counting them. For this part:
(training_set, validation_set), dataset_info = tfds.load...
It loads the "Tensorflow Dataset" as supervised, meaning that it's 2 tuples for data and label. If you remove the as_supervised=True, it will be a dictionary, and you can iterate through them with dataset['image'] and dataset['label'].
Let me know if you want me to explain anything else.

Using graphviz to plot decision tree in python

I am following the answer presented to a previous post: Is it possible to print the decision tree in scikit-learn?
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
clf = tree.DecisionTreeClassifier()
iris = load_iris()
clf = clf.fit(iris.data, iris.target)
tree.export_graphviz(clf, out_file='tree.dot')
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris.pdf")
Unfortunately, I cannot figure out the following error:
'list' object has no attribute 'write_pdf'
Does anyone know a way around this as the structure of the generated tree.dot file is a list?
Update
I have attempted using the web application http://webgraphviz.com/. This works, however, the decision tree conditions, together with the classes are not displayed. Is there any way to include these in the tree.dot file?
Looks like data that you collect in graph is of type list.
graph = pydot.graph_from_dot_data(dot_data.getvalue())
type(graph)
<type 'list'>
We are only interested in first element of the list.
So you can do this one of following of two ways,
1) Change line where you collect dot_data value in graph to
(graph, ) = pydot.graph_from_dot_data(dot_data.getvalue())
2) Or collect entire list in graph but just use first element to be sent to pdf
graph[0].write_pdf("iris.pdf")
Here is what I get as output of iris.pdf
Update
To get around path error,
Exception: "dot.exe" not found in path.
Install graphviz from here
Then use either following in your code.
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
Or simply add following to your windows path in control panel.
C:\Program Files (x86)\Graphviz2.38\bin
As per graphviz documentation, it does not get added to windows path during installation.

OneClassSVM scikit learn

I have two data sets, trainig and test. They have labels "1" and "0". I need to evaluate these data sets using "oneClassSVM" Algorithm with "rbf" kernel in scikit learn. I loaded training data set, but I have no idea how to evaluate that with test data set. Below is my code,
from sklearn import svm
import numpy as np
input_file_data = "/home/anuradha/TrainData.csv"
dataset = np.loadtxt(input_file_iris, delimiter=",")
X = dataset[:,0:4]
y = dataset[:,4]
estimator= svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Please some one can help me to solve this problem ?
It's as simple as adding the following two lines of code at the end of your script:
estimator.fit(X_train)
y_pred_test = estimator.predict(X_test)
The first line tells svn which training data to use and the second one makes prediction on the test set (be sure to load both datasets and to change variable names accordingly).
Here there is a complete example on how to use OneClassSVM and here the class reference.

Categories

Resources