I need to prune a sklearn decision tree classifier in such a way that the indicated probability (the value on the right in the image) is monotonous increasing. For example, if you program a basic tree in python, you have:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree._tree import TREE_LEAF
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data[:, 0].reshape(-1,1), np.where(iris.target==0,0,1)
tree = DecisionTreeClassifier(max_depth=3, random_state=123)
tree.fit(X,y)
percentages = tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)
Now the leaves that do not follow the monotony, as indicated must be eliminated.
remaining as follows:
Although the indicated example does not show it, a rule to consider is that if the leaves have different parents, then the leave with the largest amount of data is kept. To deal with this I have been trying to do a brute force algorithm, but it only performs the first iteration and i need apply the algorithm for bigger trees. The answer is probably using recursion, but with the sklearn tree structure, I don't really know how to do it.
Performing the following sustains the pruning requirements you suggested: A traversal on the tree, identification of non-monotonic leaves, each time removing the non-monotonic leaves of the parent node with least members and repeating this until the monotonicity between leaves is sustained. Even though this each-time-remove-one-node approach adds time complexity, the trees usually have limited depth. The conference paper "Pruning for Monotone Classification Trees" helped me understand the monotonicity in trees. Then I have derived this approach to sustain your scenario.
Since the need is to identify non-monotonic leaves from left to right, the first step is to post-order traverse the tree. If you are not familiar with tree traversals, this is completely normal. I suggest understanding the mechanics of it via studying from the Internet sources before understanding the function. You could run the traversal function to see its findings. Practical output will help you understand.
#We will define a traversal algorithm which will scan the nodes and leaves from left to right
#The traversal is recursive, we declare global lists to collect values from each recursion
traversal=[] #List to collect traversal steps
parents=[]#List to collect the parents of the collected nodes or leaves
is_leaves=[] #List to collect if the collected traversal item are leaves or not
# A function to do postorder tree traversal
def postOrderTraversal(tree,root,parent):
if root!=-1:
#Recursion on left child
postOrderTraversal(tree,tree.tree_.children_left[root],root)
#Recursion on right child
postOrderTraversal(tree,tree.tree_.children_right[root],root)
traversal.append(root) #Collect the name of node or leaf
parents.append(parent) #Collect the parent of the collected node or leaf
is_leaves.append(is_leaf(tree,root)) #Collect if the collected object is leaf
Above, we call the left and right children of nodes with recursion, this is via the provided methods of the decision tree structure. The used is_leaf() is a helper function as below.
def is_leaf(tree,node):
if tree.tree_.children_left[node]==-1:
return True
else:
return False
The decision tree nodes always have two leaves. Therefore checking only the existence of left child yields the information whether object in question is a node or leaf. The tree returns -1 if the child asked does not exist.
As you have defined the non-monotonicity condition, the ratios of classes of 1 within leaves are required. I have called this positive_ratio() (This is what you called "percentages".)
def positive_ratio(tree): #The frequency of 1 values of leaves in binary classification tree:
#Number of samples with value 1 in leaves/total number of samples in nodes/leaves
return tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)
The final helper function below returns the tree index of the node (1,2,3 etc.) with the minimum number of samples. This function requires the list of nodes of which leaves exhibit non-monotonic behavior. We call n_node_samples property of tree structure within this helper function. The found node is the node to remove its leaves.
def min_samples_node(tree, nodes): #Finds the node with the minimum number of samples among the provided list
#Make a dictionary of number of samples of given nodes, and their index in the nodes list
samples_dict={tree.tree_.n_node_samples[node]:i for i,node in enumerate(nodes)}
min_samples=min(samples_dict.keys()) #The minimum number of samples among the samples of nodes
i_min=samples_dict[min_samples] #Index of the node with minimum number of samples
return nodes[i_min] #The number of node with the minimum number of samples
After defining the helper functions, the wrapper function that performs the pruning iterates until the monotonicity of the tree is sustained. Desired monotonic tree is returned.
def prune_nonmonotonic(tree): #Prune non-monotonic nodes of a binary classification tree
while True: #Repeat until monotonicity is sustained
#Clear the traversal lists for a new scan
traversal.clear()
parents.clear()
is_leaves.clear()
#Do a post-order traversal of tree so that the leaves will be returned in order from left to right
postOrderTraversal(tree,0,None)
#Filter the traversal outputs by keeping only leaves and leaving out the nodes
leaves=[traversal[i] for i,leaf in enumerate(is_leaves) if leaf == True]
leaves_parents=[parents[i] for i,leaf in enumerate(is_leaves) if leaf == True]
pos_ratio=positive_ratio(tree) #List of positive samples ratio of the nodes of binary classification tree
leaves_pos_ratio=[pos_ratio[i] for i in leaves] #List of positive samples ratio of the traversed leaves
#Detect the non-monotonic pairs by comparing the leaves side-by-side
nonmonotone_pairs=[[leaves[i],leaves[i+1]] for i,ratio in enumerate(leaves_pos_ratio[:-1]) if (ratio>=leaves_pos_ratio[i+1])]
#Make a flattened and unique list of leaves out of pairs
nonmonotone_leaves=[]
for pair in nonmonotone_pairs:
for leaf in pair:
if leaf not in nonmonotone_leaves:
nonmonotone_leaves.append(leaf)
if len(nonmonotone_leaves)==0: #If all leaves show monotonic properties, then break
break
#List the parent nodes of the non-monotonic leaves
nonmonotone_leaves_parents=[leaves_parents[i] for i in [leaves.index(leave) for leave in nonmonotone_leaves]]
node_min=min_samples_node(tree, nonmonotone_leaves_parents) #The node with minimum number of samples
#Prune the tree by removing the children of the detected non-monotonic and lowest number of samples node
tree.tree_.children_left[node_min]=-1
tree.tree_.children_right[node_min]=-1
return tree
The all containing "while" loop continues until the iteration where traversed leaves exhibit non-monotonicity no more. The min_samples_node() identifies the node which contains non-monotonic leaves, and it is the lowest membered among alike. When its left and right children are replaced with the value "-1", the tree is pruned and the next "while" iteration will yield a completely different tree traversal to identify and remove the remaining non-monotonicity.
The below images show the unpruned and pruned trees, respectively.
Related
I was going through the Maximum Binary Tree leetcode problem. The TL;DR is that you have an array, such as this one:
[3,2,1,6,0,5]
You're supposed to take the maximum element and make that the root of your tree. Then split the array into the part to the left of that element and the part to its right, and these are used to recursively create the left and right subtrees in the same way, respectively.
LeetCode claims that the optimal solution (shown in the "Solution" tab) uses a linear search for the maximum value of the sub-array in each recursive step. This is O(n^2) in the worst case. This is the solution I came up with, and it's simple enough.
However, I was looking through other submissions and found a linear time solution, but I've struggled to understand how it works! It looks something like this:
def constructMaximumBinaryTree(nums):
nodes=[]
for num in nums:
node = TreeNode(num)
while nodes and num>nodes[-1].val:
node.left = nodes.pop()
if nodes:
nodes[-1].right = node
nodes.append(node)
return nodes[0]
I've analysed this function and in aggregate, this appears to be linear time (O(n)), since each unique node is added to and popped from the nodes array at most once. I've tried running it with different example inputs, but I'm struggling to connect the dots and wrap my head around how this works. Can someone please explain it to me?
One way to understand the algorithm is to consider the loop invariants. In this case, the array of nodes always satisfies the condition that before and after each execution of the for-loop, either:
nodes is empty and a max binary tree does not exist (for example, if the input nums was empty)
the first item in nodes is the max binary tree based on the data processed so far from the input nums
The while-loop ensures that the current max binary tree is the first item in the nodes array, since otherwise, it would have been popped and added as a left subtree.
During each iteration of the for-loop, the check:
if nodes:
nodes[-1].right = node
adds the current node as a right subtree to the last item in the nodes array. And when this happens, the current node is less than the last node in the nodes array (since each input integer is defined to be unique). And since the current node is less than the last node in the array, the last node acts as a partition point whose value is greater than the current item, which is why the current node is added as a right subtree.
When there are multiple items in the nodes array, each item is a subtree of the item to its left.
Running Time
For the running time, let n be the length of the input nums. There are n executions of the for-loop. If the input data were sorted in descending order, but with the max input value at the end of the input (such as: 4, 3, 2, 1, 5), then the inner while-loop would be skipped during each iteration until the last for-loop iteration. During the last for-loop iteration, the while loop would run n - 1 times, for a total running time of n + (n - 1) => 2n - 1 => O(n).
I would like to build a decision tree classifier using Python, but I want to force the tree, regardless of what it thinks is best, to only split one node each time into two leaves. That is, each time, a node splits into a terminal leaf and another interior node that will continue to split, rather than into two interior nodes that can themselves split. I want one of the splits to end in a termination each time, until you end up having two leaves with below the minimum number.
For instance, the following tree satisfies this requirement
but the second one does not:
The reason I want to do this is to obtain a nested set of splits of observations. I saw on another post (Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)) that the node IDs of observations can be found, which is crucial. I realize I can do this by building a tree without such a restriction, and going up one of the leaf nodes to the top, but this may not give enough observations, and I would essentially like to get this nested structure over all the observations in the dataset.
In my application I actually don't care about the classification task, I just want to obtain this nested set of observations formed by splits on features. I had planned on making the target variable randomly generated so the split on features would not be meaningful (which is counter-intuitive that this is what I want, but I'm using it for a different purpose). Alternatively, if someone knows of a similar binary split method in Python which achieves the same purpose, please let me know.
I realized one way to do this is to construct a decision tree with max_depth=1. This will perform a split into two leaves. Then pick the leaf with the highest impurity to continue splitting, fitting the decision tree again to only this subset, and repeat. To make sure that the hierarchy is obvious, I relabel the leaf_ids so it is clear that as you go up the tree, the ID values go down. Here is an example:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
def decision_tree_one_path(X, y=None, min_leaf_size=3):
nobs = X.shape[0]
# boolean vector to include observations in the newest split
include = np.ones((nobs,), dtype=bool)
# try to get leaves around min_leaf_size
min_leaf_size = max(min_leaf_size, 1)
# one-level DT splitter
dtmodel = DecisionTreeClassifier(splitter="best", criterion="gini", max_depth=1, min_samples_split=int(np.round(2.05*min_leaf_size)))
leaf_id = np.ones((nobs,), dtype='int64')
iter = 0
if y is None:
y = np.random.binomial(n=1, p=0.5, size=nobs)
while nobs >= 2*min_leaf_size:
dtmodel.fit(X=X.loc[include], y=y[include])
# give unique node id
new_leaf_names = dtmodel.apply(X=X.loc[include])
impurities = dtmodel.tree_.impurity[1:]
if len(impurities) == 0:
# was not able to split while maintaining constraint
break
# make sure node that is not split gets the lower node_label 1
most_impure_node = np.argmax(impurities)
if most_impure_node == 0: # i.e., label 1
# switch 1 and 2 labels above
is_label_2 = new_leaf_names == 2
new_leaf_names[is_label_2] = 1
new_leaf_names[np.logical_not(is_label_2)] = 2
# rename leaves
leaf_id[include] = iter + new_leaf_names
will_be_split = new_leaf_names == 2
# ignore the other one
tmp = np.ones((nobs,), dtype=bool)
tmp[np.logical_not(will_be_split)] = False
include[include] = tmp
# now create new labels
nobs = np.sum(will_be_split)
iter = iter + 1
return leaf_id
leaf_id is thus the leaf IDs of observations in order. Thus, for instance leaf_id==1 is the first observations that were split off into a terminal node. leaf_id==2 is the next one that was split off into a terminal node from the split in which leaf_id==1 was generated, as shown below. There are thus k+1 leaves.
#0
#|\
#1 .
# |\
# 2 .
#.......
#
# |\
# k (k+1)
I was wondering if there is a way to do this automatically in Python, though.
I am guessing that it is conditional probability given that the above (tree branch) condition exists. However, I am not clear on it.
If you want to read more about the data used or how do we get this diagram then go to : http://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/
For a classification tree with 2 classes {0,1}, the value of the leaf node represent the raw score for class 1. It can be converted to a probability score by using the logistic function. The calculation below use the left most leaf as an example.
1/(1+np.exp(-1*0.167528))=0.5417843204057448
What this means is if a data point ends up being distributed to this leaf, the probability of this data point being class 1 is 0.5417843204057448.
If it is a regression model (objective can be reg:squarederror), then the leaf value is the prediction of that tree for the given data point. The leaf value can be negative based on your target variable. The final prediction for that data point will be sum of leaf values in all the trees for that point.
If it is a classification model (objective can be binary:logistic), then the leaf value is representative (like raw score) for the probability of the data point belonging to the positive class. The final probability prediction is obtained by taking sum of leaf values (raw scores) in all the trees and then transforming it between 0 and 1 using a sigmoid function. The leaf value (raw score) can be negative, the value 0 actually represents probability being 1/2.
Please find more details about the parameters and outputs at - https://xgboost.readthedocs.io/en/latest/parameter.html
Attribute leaf is the predicted value. In other words, if the evaluation of a tree model ends at that terminal node (aka leaf node), then this is the value that is returned.
In pseudocode (the left-most branch of your tree model):
if(f1 < 127.5){
if(f7 < 28.5){
if(f5 < 45.4){
return 0.167528f;
} else {
return 0.05f;
}
}
}
You are correct. Those probability values associated with leaf nodes are representing the conditional probability of reaching leaf nodes given a specific branch of the tree. Branches of trees can be presented as a set of rules. For example, #user1808924 mentioned in his answer; one rule which is representing the left-most branch of your tree model.
So, in short: The tree can be linearized into decision rules, where the outcome is the contents of the leaf node, and the conditions along the path form a conjunction in the if clause. In general, the rules have the form:
if condition1 and condition2 and condition3 then outcome.
Decision rules can be generated by constructing association rules with the target variable on the right. They can also denote temporal or causal relations.
Straightforward question: I would like to retrieve all the nodes connected to a given node within a NetworkX graph in order to create a subgraph. In the example shown below, I just want to extract all the nodes inside the circle, given the name of one of any one of them.
I've tried the following recursive function, but hit Python's recursion limit, even though there are only 91 nodes in this network.
Regardless of whether or not the below code is buggy, what is the best way to do what I'm trying to achieve? I will be running this code on graphs of various sizes, and will not know beforehand what the maximum recursion depth will be.
def fetch_connected_nodes(node, neighbors_list):
for neighbor in assembly.neighbors(node):
print(neighbor)
if len(assembly.neighbors(neighbor)) == 1:
neighbors_list.append(neighbor)
return neighbors_list
else:
neighbors_list.append(neighbor)
fetch_connected_nodes(neighbor, neighbors_list)
neighbors = []
starting_node = 'NODE_1_length_6578_cov_450.665_ID_16281'
connected_nodes = fetch_connected_nodes(starting_node, neighbors)
Assuming the graph is undirected, there is a built-in networkx command for this:
node_connected_component(G, n)
The documentation is here. It returns all nodes in the connected component of G containing n.
It's not recursive, but I don't think you actually need or even want that.
comments on your code: You've got a bug that will often result an infinite recursion. If u and v are neighbors both with degree at least 2, then it will start with u, put v in the list and when processing v put u in the list and keep repeating. It needs to change to only process neighbors that are not in neighbors_list. It's expensive to check that, so instead use a set. There's also a small problem if the starting node has degree 1. Your test for degree 1 doesn't do what you're after. If the initial node has degree 1, but its neighbor has higher degree it won't find the neighbor's neighbors.
Here's a modification of your code:
def fetch_connected_nodes(G, node, seen = None):
if seen == None:
seen = set([node])
for neighbor in G.neighbors(node):
print(neighbor)
if neighbor not in seen:
seen.add(neighbor)
fetch_connected_nodes(G, neighbor, seen)
return seen
You call this like fetch_connected_nodes(assembly, starting_node).
You can simply use a Breadth-first search starting from your given node or any node.
In Networkx you can have the tree-graph from your starting node using the function:
bfs_tree(G, source, reverse=False)
Here is a link to the doc: Network bfs_tree.
Here is a recursive algorithm to get all nodes connected to an input node.
def create_subgraph(G,sub_G,start_node):
sub_G.add_node(start_node)
for n in G.neighbors_iter(start_node):
if n not in sub_G.neighbors(start_node):
sub_G.add_path([start_node,n])
create_subgraph(G,sub_G,n)
I believe the key thing here to prevent infinite recursive calls is the condition to check that node which is neighbor in the original graph is not already connected in the sub_G that is being created. Otherwise, you will always be going back and forth and edges between nodes that already have edges.
I tested it as follows:
G = nx.erdos_renyi_graph(20,0.08)
nx.draw(G,with_labels = True)
plt.show()
sub_G = nx.Graph()
create_subgraph(G,sub_G,17)
nx.draw(sub_G,with_labels = True)
plt.show()
You will find in the attached image, the full graph and the sub_graph that contains node 17.
If given a tree with nodes with integers: 1 ~ 10, and branching factor of 3 for all nodes, how can I write a function that traverses through the tree counting from root to leaves for EVERY paths
So for this example, let's say it needs to return this:
{1: 1, 2: 5}
I've tried this helper function:
def tree_lengths(t):
temp = []
for i in t.children:
temp.append(1)
temp += [e + 1 for e in tree_lengths(i)]
return temp
There are too many errors with this code. For one, it leaves behind imprints of every visited node in the traversal in the returning list - so it's difficult to figure out which ones are the values that I need from that list. For another, if the tree is large, it does not leave behind imprints of the root and earlier nodes in the path prior to reaching the line "for i in t.children". It needs to first: duplicate all paths from root leaves; second: return a list exclusively for the final number of each path count.
Please help! This is so difficult.
I'm not sure exactly what you are trying to do, but you'll likely need to define a recursive function that takes a node (the head of a tree or subtree) and an integer (the number of children you've traversed so far), and maybe a list of each visited node so far. If the node has no children, you've reached a leaf and you can print out whatever info you need. Otherwise, for each child, call this recursive function again with new parameters (+1 to count, the child node as head node, etc).