Python Databricks cannot visualise dtreeviz decision tree

Python Databricks cannot visualise dtreeviz decision tree - python

I need to visualize a decision tree in dtreeviz in Databricks.
The code seems to be working fine.
However, instead of showing the decision tree it throws the following:
Out[23]: <dtreeviz.trees.DTreeViz at 0x7f5b27a91160>
Running the following code:
import pandas as pd
from sklearn import preprocessing, tree
from dtreeviz.trees import dtreeviz
Things = {'Feature01': [3,4,5,0],
'Feature02': [4,5,6,0],
'Feature03': [1,2,3,8],
'Target01': ['Red','Blue','Teal','Red']}
df = pd.DataFrame(Things,
columns= ['Feature01', 'Feature02',
'Feature02', 'Target01'])
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(df.Target01)
df['target'] = label_encoder.transform(df.Target01)
classifier = tree.DecisionTreeClassifier()
classifier.fit(df.iloc[:,:3], df.target)
dtreeviz(classifier,
df.iloc[:,:3],
df.target,
target_name='toy',
feature_names=df.columns[0:3],
class_names=list(label_encoder.classes_)
)

if you look into dtreeviz documentation you'll see that dtreeviz method just creates an object, and then you need to use function like .view() to show it. On Databricks, view won't work, but you can use .svg() method to generate output as SVG, and then use displayHTML function to show it. Following code:
viz = dtreeviz(classifier,
...)
displayHTML(viz.svg())
will give you desired output:
P.S. You need to have the dot command-line tool to generate output. It could be installed by executing in a cell of the notebook:
%sh apt-get install -y graphviz

Related

Using a R function in python notebook to visualize missing data

naniar is a common R package for visualizing missing data. I am trying to use rpy2 to call an R function vis_miss() in naniar to plot the missing data.
Python is giving me a data frame as output instead of a plot in my notebook and I would like to solve this. The idea is to use the vis_miss package in a python notebook.
Below is a working example using iris dataset:
# install rpy2 to run R in python
!pip3 install rpy2
%load_ext rpy2.ipython
from sklearn.datasets import load_iris
%R install.packages("naniar")
%R library(naniar)
%R library(ggplot2)
# Load Iris data
iris = load_iris()
# Run vis_miss function, expecting to see a graph showing missing data
%R naniar::vis_miss(iris)
My output should now be an image of missing data but instead I get:
ListVector with 10 elements.
data R/rpy2 DataFrame (750 x 4)
rows variable valueType value
... ... ... ...
layers ListVector with 1 elements.
[no name] [RTYPES.ENVSXP]
scales add: function clone: function find: function get_scales: function has_scale: function input: function n: function non_position_scales: function scales: list super:
... ...
plot_env
labels ListVector with 4 elements.
x [RTYPES.STRSXP]
y [RTYPES.STRSXP]
text [RTYPES.STRSXP]
fill [RTYPES.STRSXP]
guides ListVector with 1 elements.
fill [RTYPES.VECSXP]
How can I get the required output that would occur in R, within a cell in this python notebook?
Would I perhaps use matplotlib or ggplot2 here?

Use cell magic (%%R) to get the output as an image:
%%R
naniar::vis_miss(iris)
The cell magic also allows to customize width/height/dpi and format, see: IPython magic integration.

social network analysis using python

I have two csv files. names.csv is containing name of person and its corresponding node and nodelinks.csv file is containing the link weight between nodes(persons). nodelinks.csv contains information about how many times a person calls other person(how many times is represented as weight column).
I want to create a network which is divided into sub-networks according to leaders, followers, marginals, outliers and bridges in the network.
I searched internet and I found out networkx library in python. So I tried networkx and it gave me an output of the whole network but it is very clustered i.e. nodes are drawn on top of each other in the output. I'd like to get an output of the network that can be easily understood and also i want to find out sub-networks, leaders, followers, marginals, outliers and bridges in that network.
What I've tried so far
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.read_csv('Nodelinks.csv')
df.columns = ['Source', 'Destination', 'Link']
df.head()
graph = nx.from_pandas_edgelist(df, source = 'Source', target =
'Destination', edge_attr = 'Link',create_using = nx.DiGraph())
plt.figure(figsize = (10,9))
nx.draw(graph, node_size=1200, node_color='lightblue',
linewidths=0.25, font_size=10, font_weight='bold', with_labels=True,
dpi=1000)
plt.show()
Install networkx library using pip or conda.
I tried using pip but it was showing me error. I tried to install it using conda and it worked.
The dataset and jupyter notebook is uploaded on mega.
I don't know how I should proceed next to get what I want as the output. Also, is there any other way to go about this topic?(preferably easier way if there is one)

Using graphviz to plot decision tree in python

I am following the answer presented to a previous post: Is it possible to print the decision tree in scikit-learn?
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
clf = tree.DecisionTreeClassifier()
iris = load_iris()
clf = clf.fit(iris.data, iris.target)
tree.export_graphviz(clf, out_file='tree.dot')
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris.pdf")
Unfortunately, I cannot figure out the following error:
'list' object has no attribute 'write_pdf'
Does anyone know a way around this as the structure of the generated tree.dot file is a list?
Update
I have attempted using the web application http://webgraphviz.com/. This works, however, the decision tree conditions, together with the classes are not displayed. Is there any way to include these in the tree.dot file?

Looks like data that you collect in graph is of type list.
graph = pydot.graph_from_dot_data(dot_data.getvalue())
type(graph)
<type 'list'>
We are only interested in first element of the list.
So you can do this one of following of two ways,
1) Change line where you collect dot_data value in graph to
(graph, ) = pydot.graph_from_dot_data(dot_data.getvalue())
2) Or collect entire list in graph but just use first element to be sent to pdf
graph[0].write_pdf("iris.pdf")
Here is what I get as output of iris.pdf
Update
To get around path error,
Exception: "dot.exe" not found in path.
Install graphviz from here
Then use either following in your code.
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
Or simply add following to your windows path in control panel.
C:\Program Files (x86)\Graphviz2.38\bin
As per graphviz documentation, it does not get added to windows path during installation.

Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier

I have a decision tree classifier from sklearn and I use pydotplus to show it.
However I don't really like when there is a lot of informations on each nodes for my presentation (entropy, samples and value).
To explain it easier to people I would like to only keep the decision and the class on it.
Where can I modify the code to do it ?
Thank you.

Accoring to the documentation, it is not possible to abstain from setting the additional information inside boxes. The only thing that you may implicitly omit is the impurity parameter.
However, I have done it the other explicit way which is somewhat crooked. First, I save the .dot file setting the impurity to False. Then, I open it up and convert it to a string format. I use regex to subtract the redundant labels and resave it.
The code goes like this:
import pydotplus # pydot library: install it via pip install pydot
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.datasets import load_iris
data = load_iris()
clf = DecisionTreeClassifier()
clf.fit(data.data, data.target)
export_graphviz(clf, out_file='tree.dot', impurity=False, class_names=True)
PATH = '/path/to/dotfile/tree.dot'
f = pydot.graph_from_dot_file(PATH).to_string()
f = re.sub('(\\\\nsamples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])', '', f)
f = re.sub('(samples = [0-9]+)(\\\\nvalue = \[[0-9]+, [0-9]+, [0-9]+\])\\\\n', '', f)
with open('tree_modified.dot', 'w') as file:
file.write(f)
Here are the images before and after modification:
In your case, there seems to be more parameters in boxes, so you may want to tweak the code a little bit.
I hope that helps!

Python, PyDot and DecisionTree

I'm trying to visualize my DecisionTree, but getting the error
The code is:
X = [i[1:] for i in dataset]#attribute
y = [i[0] for i in dataset]
clf = tree.DecisionTreeClassifier()
dot_data = StringIO()
tree.export_graphviz(clf.fit(train_X, train_y), out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("tree.pdf")
And the error is
Traceback (most recent call last):
if data.startswith(codecs.BOM_UTF8):
TypeError: startswith first arg must be str or a tuple of str, not bytes
Can anyone explain me whats the problem? Thank you a lot!

In case of using Python 3, just use pydotplus instead of pydot. It will also have a soft installation process by pip.
import pydotplus
<your code>
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris.pdf")

I had the same exact problem and just spent a couple hours trying to figure this out. I can't guarantee what I share here will work for others but it may be worth a shot.
I tried installing official pydot packages but I have Python 3 and they simply did not work. After finding a note in a thread from one of the many websites I scoured through, I ended up installing this forked repository of pydot.
I went to graphviz.org and installed their software on my Windows 7 machine. If you don't have Windows, look under their Download section for your system.
After successful install, in Environment Variables (Control Panel\All Control Panel Items\System\Advanced system settings > click Environment Variables button > under System variables I found the variable path > click Edit... > I added ;C:\Program Files (x86)\Graphviz2.38\bin to the end in the Variable value: field.
To confirm I can now use dot commands in the Command Line (Windows Command Processor), I typed dot -V which returned dot - graphviz version 2.38.0 (20140413.2041).
In the below code, keep in mind that I'm reading a dataframe from my clipboard. You might be reading it from file or whathaveyou.
In IPython Notebook:
import pandas as pd
import numpy as np
from sklearn import tree
import pydot
from IPython.display import Image
from sklearn.externals.six import StringIO
df = pd.read_clipboard()
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
dtr = tree.DecisionTreeRegressor(max_depth=3)
dtr.fit(X, y)
dot_data = StringIO()
tree.export_graphviz(dtr, out_file=dot_data, feature_names=X.columns)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Alternatively, if you're not using IPython, you can generate your own image from the command line as long as you have graphviz installed (step 2 above). Using my same example code above, you use this line after fitting the model:
tree.export_graphviz(dtr.tree_, out_file='treepic.dot', feature_names=X.columns)
then open up command prompt where the treepic.dot file is and enter this command line:
dot -T png treepic.dot -o treepic.png
A .png file should be created with your decision tree.

The line in question is checking to see if the stream/file is encoded as UTF-8
Instead of:
if data.startswith(codecs.BOM_UTF8):
use:
if codecs.BOM_UTF8 in data:
You will likely have more success...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Databricks cannot visualise dtreeviz decision tree - python

Related

Using a R function in python notebook to visualize missing data

social network analysis using python

Using graphviz to plot decision tree in python

Python - Graphviz - Remove legend on nodes of DecisionTreeClassifier

Python, PyDot and DecisionTree

Categories

Resources