interpret xgboost model dumped by dump_model [duplicate] - python

I am guessing that it is conditional probability given that the above (tree branch) condition exists. However, I am not clear on it.
If you want to read more about the data used or how do we get this diagram then go to : http://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/

For a classification tree with 2 classes {0,1}, the value of the leaf node represent the raw score for class 1. It can be converted to a probability score by using the logistic function. The calculation below use the left most leaf as an example.
1/(1+np.exp(-1*0.167528))=0.5417843204057448
What this means is if a data point ends up being distributed to this leaf, the probability of this data point being class 1 is 0.5417843204057448.

If it is a regression model (objective can be reg:squarederror), then the leaf value is the prediction of that tree for the given data point. The leaf value can be negative based on your target variable. The final prediction for that data point will be sum of leaf values in all the trees for that point.
If it is a classification model (objective can be binary:logistic), then the leaf value is representative (like raw score) for the probability of the data point belonging to the positive class. The final probability prediction is obtained by taking sum of leaf values (raw scores) in all the trees and then transforming it between 0 and 1 using a sigmoid function. The leaf value (raw score) can be negative, the value 0 actually represents probability being 1/2.
Please find more details about the parameters and outputs at - https://xgboost.readthedocs.io/en/latest/parameter.html

Attribute leaf is the predicted value. In other words, if the evaluation of a tree model ends at that terminal node (aka leaf node), then this is the value that is returned.
In pseudocode (the left-most branch of your tree model):
if(f1 < 127.5){
if(f7 < 28.5){
if(f5 < 45.4){
return 0.167528f;
} else {
return 0.05f;
}
}
}

You are correct. Those probability values associated with leaf nodes are representing the conditional probability of reaching leaf nodes given a specific branch of the tree. Branches of trees can be presented as a set of rules. For example, #user1808924 mentioned in his answer; one rule which is representing the left-most branch of your tree model.
So, in short: The tree can be linearized into decision rules, where the outcome is the contents of the leaf node, and the conditions along the path form a conjunction in the if clause. In general, the rules have the form:
if condition1 and condition2 and condition3 then outcome.
Decision rules can be generated by constructing association rules with the target variable on the right. They can also denote temporal or causal relations.

Related

How to navigate through a non recombinant binomial tree

im looking to see how i can navigate through a non-recombinant binomial tree.
I have set up my tree as following:
enter image description here
def arbolno(t,S0,u,d):
S=[S0]
for i in range(1,2**(t+1)-1):
if i%2==0:
S.append(np.round(S[int(i/2-1)]*u,3))
elif i%2!=0:
S.append(np.round(S[int((i-1)/2)]*d,3))
return S
Now, I want to calculate a return from a holding period, lets say, n moments.
How can I index correctly, given my format of listing my nodes? For instance, if I define a return period of 2 moments in time, then, for instance, node number 14 would be compared with node 2.
I want to know a general formula if there is, in order to index through my list of nodes correctly.
Thanks!

Prune sklearn decision tree to ensure monotony

I need to prune a sklearn decision tree classifier in such a way that the indicated probability (the value on the right in the image) is monotonous increasing. For example, if you program a basic tree in python, you have:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree._tree import TREE_LEAF
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data[:, 0].reshape(-1,1), np.where(iris.target==0,0,1)
tree = DecisionTreeClassifier(max_depth=3, random_state=123)
tree.fit(X,y)
percentages = tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)
Now the leaves that do not follow the monotony, as indicated must be eliminated.
remaining as follows:
Although the indicated example does not show it, a rule to consider is that if the leaves have different parents, then the leave with the largest amount of data is kept. To deal with this I have been trying to do a brute force algorithm, but it only performs the first iteration and i need apply the algorithm for bigger trees. The answer is probably using recursion, but with the sklearn tree structure, I don't really know how to do it.
Performing the following sustains the pruning requirements you suggested: A traversal on the tree, identification of non-monotonic leaves, each time removing the non-monotonic leaves of the parent node with least members and repeating this until the monotonicity between leaves is sustained. Even though this each-time-remove-one-node approach adds time complexity, the trees usually have limited depth. The conference paper "Pruning for Monotone Classification Trees" helped me understand the monotonicity in trees. Then I have derived this approach to sustain your scenario.
Since the need is to identify non-monotonic leaves from left to right, the first step is to post-order traverse the tree. If you are not familiar with tree traversals, this is completely normal. I suggest understanding the mechanics of it via studying from the Internet sources before understanding the function. You could run the traversal function to see its findings. Practical output will help you understand.
#We will define a traversal algorithm which will scan the nodes and leaves from left to right
#The traversal is recursive, we declare global lists to collect values from each recursion
traversal=[] #List to collect traversal steps
parents=[]#List to collect the parents of the collected nodes or leaves
is_leaves=[] #List to collect if the collected traversal item are leaves or not
# A function to do postorder tree traversal
def postOrderTraversal(tree,root,parent):
if root!=-1:
#Recursion on left child
postOrderTraversal(tree,tree.tree_.children_left[root],root)
#Recursion on right child
postOrderTraversal(tree,tree.tree_.children_right[root],root)
traversal.append(root) #Collect the name of node or leaf
parents.append(parent) #Collect the parent of the collected node or leaf
is_leaves.append(is_leaf(tree,root)) #Collect if the collected object is leaf
Above, we call the left and right children of nodes with recursion, this is via the provided methods of the decision tree structure. The used is_leaf() is a helper function as below.
def is_leaf(tree,node):
if tree.tree_.children_left[node]==-1:
return True
else:
return False
The decision tree nodes always have two leaves. Therefore checking only the existence of left child yields the information whether object in question is a node or leaf. The tree returns -1 if the child asked does not exist.
As you have defined the non-monotonicity condition, the ratios of classes of 1 within leaves are required. I have called this positive_ratio() (This is what you called "percentages".)
def positive_ratio(tree): #The frequency of 1 values of leaves in binary classification tree:
#Number of samples with value 1 in leaves/total number of samples in nodes/leaves
return tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)
The final helper function below returns the tree index of the node (1,2,3 etc.) with the minimum number of samples. This function requires the list of nodes of which leaves exhibit non-monotonic behavior. We call n_node_samples property of tree structure within this helper function. The found node is the node to remove its leaves.
def min_samples_node(tree, nodes): #Finds the node with the minimum number of samples among the provided list
#Make a dictionary of number of samples of given nodes, and their index in the nodes list
samples_dict={tree.tree_.n_node_samples[node]:i for i,node in enumerate(nodes)}
min_samples=min(samples_dict.keys()) #The minimum number of samples among the samples of nodes
i_min=samples_dict[min_samples] #Index of the node with minimum number of samples
return nodes[i_min] #The number of node with the minimum number of samples
After defining the helper functions, the wrapper function that performs the pruning iterates until the monotonicity of the tree is sustained. Desired monotonic tree is returned.
def prune_nonmonotonic(tree): #Prune non-monotonic nodes of a binary classification tree
while True: #Repeat until monotonicity is sustained
#Clear the traversal lists for a new scan
traversal.clear()
parents.clear()
is_leaves.clear()
#Do a post-order traversal of tree so that the leaves will be returned in order from left to right
postOrderTraversal(tree,0,None)
#Filter the traversal outputs by keeping only leaves and leaving out the nodes
leaves=[traversal[i] for i,leaf in enumerate(is_leaves) if leaf == True]
leaves_parents=[parents[i] for i,leaf in enumerate(is_leaves) if leaf == True]
pos_ratio=positive_ratio(tree) #List of positive samples ratio of the nodes of binary classification tree
leaves_pos_ratio=[pos_ratio[i] for i in leaves] #List of positive samples ratio of the traversed leaves
#Detect the non-monotonic pairs by comparing the leaves side-by-side
nonmonotone_pairs=[[leaves[i],leaves[i+1]] for i,ratio in enumerate(leaves_pos_ratio[:-1]) if (ratio>=leaves_pos_ratio[i+1])]
#Make a flattened and unique list of leaves out of pairs
nonmonotone_leaves=[]
for pair in nonmonotone_pairs:
for leaf in pair:
if leaf not in nonmonotone_leaves:
nonmonotone_leaves.append(leaf)
if len(nonmonotone_leaves)==0: #If all leaves show monotonic properties, then break
break
#List the parent nodes of the non-monotonic leaves
nonmonotone_leaves_parents=[leaves_parents[i] for i in [leaves.index(leave) for leave in nonmonotone_leaves]]
node_min=min_samples_node(tree, nonmonotone_leaves_parents) #The node with minimum number of samples
#Prune the tree by removing the children of the detected non-monotonic and lowest number of samples node
tree.tree_.children_left[node_min]=-1
tree.tree_.children_right[node_min]=-1
return tree
The all containing "while" loop continues until the iteration where traversed leaves exhibit non-monotonicity no more. The min_samples_node() identifies the node which contains non-monotonic leaves, and it is the lowest membered among alike. When its left and right children are replaced with the value "-1", the tree is pruned and the next "while" iteration will yield a completely different tree traversal to identify and remove the remaining non-monotonicity.
The below images show the unpruned and pruned trees, respectively.

Construct a bipartite representation of a directed network for use with hopcroft-karp matching function in networkx

I am using the hopcroft-karp algorithm in networkx on a directed network which I have transformed into a bipartite representation. However, the bipartite network is not allowed to have the same vertex in both left and right node sets. There is a self loop included in my directed network, so as a workaround, I rename the vertices in the left node set as (X1_+, X2_+,X3_+) and in the right set as (X1_-,X2_-,X3_-). The directed network and the corresponding bipartite representation are as follows:
The correct result for the maximum matching, in terms of the dictionary output given by the hopcroft-karp algorithm in networkx, should be
{X2-:X1+,X3-:X3+)] so that X1- would be the only unmatched node since it does not appear as a key in the output dictionary.
I followed the networkx documentation to get the corresponding bipartite network and then I used the hopcroft-karp function to obtain the maximum matching. The code that i implemented is outlined below along with the result:
# Add nodes w/ the node attribute bipartite
G_eg.add_nodes_from(["X1_+", "X3_+"], bipartite=1)
G_eg.add_nodes_from(["X2_-", "X3_-"], bipartite=0)
# Add edges only between nodes of opposite node sets
G_eg.add_edges_from([("X1_+", "X2_-"), ("X1_+", "X3_-"), ("X3_+", "X3_-")])
#left, right nodes based on node attribute
left_nodes = {n for n, d in G_eg.nodes(data=True) if d["bipartite"] == 1}
right_nodes = set(G_eg) - left_nodes
#apply hopcroft-karp algorithm
nx.bipartite.hopcroft_karp_matching(G_eg, left_nodes)
output:
{'X3_+': 'X3_-', 'X1_+': 'X2_-', 'X3_-': 'X3_+', 'X2_-': 'X1_+'}
From my output, I can see that X1_- is not listed as a key and therefore it is deemed unmatched. However, why is it that the output gives the result in this way, meaning, it ignores the fact that the network is directed. By including X1_+:X2_- in the output dictionary, it implies that X1_+ is matched. Is my interpretation incorrect?
Indeed, any directions you impose on the edges of the graph are ignored by hopcroft_karp_matching (and the other implementations of maximum_matching). Regarding the format of the output, the documentation specifies that the output is
matches – The matching is returned as a dictionary, matches, such that matches[v] == w if node v is matched to node w. Unmatched nodes do not occur as a key in matches.
In particular, the matched notes are exactly the keys of the directory, in your case {'X3_+', 'X1_+', 'X3_-', 'X2_-'}. In particular, this means that the output could always be cut in half, with matched nodes being available as the union of keys and values; I assume the redundant information is included simply because it might be convenient in some cases.

Find clique with maximum node value in graph

I have a graph of nodes assigned with a certain value. I wish to find the clique in the graph with the maximum sum of nodes (note not necessarily the maximum clique)
One approach I thought of would be a greedy algorithm that:
Select the largest node from the graph
Select the largest next node such that it is connected to all the previously selected nodes if the sum of the nodes increases.
Repeat 2 until the sum does not increase any more
However, this approach does not lead to correctness as you could imagine a graph with a clique of 8 nodes all of value 1, and a single node of value 7. The correct answer here is 8, not 7. My actual problem has a complex graph but here are some examples of the desired result with the actual graph and the biggest sum, which I found manually:
Here is a simpler example with the solution:
What would be the best graph representation and an efficient and correct way to solve this problem in python on an arbitrary graph in a representation of your choice in python without libraries?

Propagate ratings in a graph structure

I have the following problem:
Consider a weighted direct graph.
Each node has a rating and the weighted edges represents
the "influence" of a node on its neighbors.
When a node rating change, the neighbors will see their own rating modified (positively or negatively)
How to propagate a new rating on one node?
I think this should be a standard algorithm but which one?
This is a general question but in practice I am using Python ;)
Thanks
[EDIT]
The rating is a simple float value between 0 to 1: [0.0,1.0]
There is certainly a convergence issue: I want just limit the propagation to a few iteration...
There is an easy standard way to do it as follows:
let G=(V,E) be the graph
let w:E->R be a weight function such that w(e) = weight of edge e
let A be an array such that A[v] = rating(v)
let n be the required number of iterations
for i from 1 to n (inclusive) do:
for each vertex v in V:
A'[v] = calculateNewRating(v,A,w) #use the array A for the old values and w
A <- A' #assign A with the new values which are stored in A'
return A
However, for some cases - you might have better algorithms based on the features of the graph and how the rating for each node is recalculated. For example:
Assume rating'(v) = sum(rating(u) * w(u,v)) for each (u,v) in E, and you get a variation of Page Rank, which is guaranteed to converge to the principle eigenvector if the graph is strongly connected (Perron-Forbenius theorem), so calculating the final value is simple.
Assume rating'(v) = max{ rating(u) | for each (u,v) in E}, then it is also guaranteed to converge and can be solved linearly using strongly connected components. This thread discusses this case.

Categories

Resources