ID3 in Python & Pandas - python

I'm new to programming and machine learning, but I'm trying to create a ID3 decision tree from scratch. Currently, entropy, information gain, and finding the first node is good to go, but I'm really struggling with building the remainder of the tree. I'm just including the code for building the tree below, but am happy to include everything if that helps...
def _tree(self, df):
features = df.loc[:, df.columns != self.label_name].columns.tolist()
# get root node
att, childs = self._node(df)
pred_list = []
# if split is pure, then create leaf node with prediction label
if len(df[att].unique()) == 1:
self.attributes.append(att)
self.children.append(childs)
for branch in childs:
df_subset = df[df[att] == branch]
df_subset = df_subset.drop(columns=att, axis=1)
pred = df_subset[self.label_name].mode().tolist()[0]
pred_list.append(pred)
self.predictions.append(pred_list)
# if split is not pure, then create new node with blank prediction;
# loop through all branches and partition dataset;
# call _tree function recursion to repeat the process new subset of data
else:
# keep making nodes if there are multiple feature left, and havent exceeded depth of 3
if len(features) > 1 and len(self.attributes) < 3:
self.attributes.append(att)
self.children.append(childs)
for branch in childs:
pred = ''
pred_list.append(pred)
self.predictions.append(pred_list)
# recursion for each branch/child of new node created above
for branch in childs:
df_subset = df[df[att] == branch]
df_subset = df_subset.drop(columns=att, axis=1)
self._tree(df_subset)
# make final prediction if no features left to split on, or already have 2 nodes (node 3 will be final leaf and maximum depth for this project)
else:
self.attributes.append(att)
self.children.append(childs)
for branch in childs:
df_subset = df[df[att] == branch]
df_subset = df_subset.drop(columns=att, axis=1)
pred = df[self.label_name].mode().tolist()[0]
pred_list.append(pred)
self.predictions.append(pred_list)
I'm trying to collect three lists for the attribute used for the node, the children of that node, and then the predictions for each children (empty strings when no prediction). I'm really struggling with the recursion call right now, but any other advice or help would be much appreciated!

Related

Untrackable object attribute

I am trying to adapt this code here: https://github.com/nachonavarro/gabes/blob/master/gabes/circuit.py (line 136)
but am coming across an issue because several times the attribute .chosen_label is used but I can find no mention of it anywhere in the code. The objects left_gate, right_gate and gate are Gate objects (https://github.com/nachonavarro/gabes/blob/master/gabes/gate.py)
def reconstruct(self, labels):
levels = [[node for node in children]
for children in anytree.LevelOrderGroupIter(self.tree)][::-1]
for level in levels:
for node in level:
gate = node.name
if node.is_leaf:
garblers_label = labels.pop(0)
evaluators_label = labels.pop(0)
else:
left_gate = node.children[0].name
right_gate = node.children[1].name
garblers_label = left_gate.chosen_label
evaluators_label = right_gate.chosen_label
output_label = gate.ungarble(garblers_label, evaluators_label)
gate.chosen_label = output_label
return self.tree.name.chosen_label
The code runs without error and the .chosen_label is a Label object (https://github.com/nachonavarro/gabes/blob/master/gabes/label.py)
Any help would be much appreciated
The attribute is set in the same method:
for level in levels:
for node in level:
gate = node.name
if node.is_leaf:
# set `garblers_label` and `evaluators_label` from
# the next two elements of the `labels` argument
else:
# use the child nodes of this node to use their gates, and
# set `garblers_label` and `evaluators_label` to the left and
# right `chosen_label` values, respectively.
# generate the `Label()` instance based on `garblers_label` and `evaluators_label`
output_label = gate.ungarble(garblers_label, evaluators_label)
gate.chosen_label = output_label
I'm not familiar with the anytree library, so I had to look up the documentation: the anytree.LevelOrderGroupIter(...) function orders the nodes in a tree from root to leaves, grouped by level. The tree here appears to be a balanced binary tree (each node has either 0 or 2 child nodes), so you get a list with [(rootnode,), (level1_left, level1_right), (level2_left_left, level2_left_right, level2_right_left, level2_right_right), ...]. The function loops over these levels in reverse order. This means that leaves are processed first.
Once all node.is_leaf nodes have their chosen_label set, the other non-leaf nodes can reference the chosen_label value on the leaf nodes on the level already processed before them.
So, assuming that labels is a list with at least twice the number of leaf nodes in the tree, you end up with those label values aggregated at every level via the gate.ungarble() function, and the final value is found at the root node via self.tree.name.chosen_label.

Feature importance 'gain' in XGBoost

I want to understand how the feature importance in xgboost is calculated by 'gain'. From https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7:
‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).
In scikit-learn the feature importance is calculated by the gini impurity/information gain reduction of each node after splitting using a variable, i.e. weighted impurity average of node - weighted impurity average of left child node - weighted impurity average of right child node (see also: https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting)
I wonder if xgboost also uses this approach using information gain or accuracy as stated in the citation above. I've tried to dig in the code of xgboost and found out this method (already cut off irrelevant parts):
def get_score(self, fmap='', importance_type='gain'):
trees = self.get_dump(fmap, with_stats=True)
importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue
# look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']')
# extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0])
# extract feature name from string before closing bracket
fid = fid[0].split('<')[0]
if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g
return gmap
So 'gain' is extracted from dump file of each booster but how is it actually measured?
Nice question. The gain is calculated using this equation:
For a deep explanation read this: https://xgboost.readthedocs.io/en/latest/tutorials/model.html

Object doesn't change after reassignment

I am trying to prune decision trees, using some procedure.
After i do sorts of manipulations on the current tree, i get different values from the function validate() (inside prune_helper, by print(self.validate(validation_data))) - before and after manipulating the tree, which is great (It means that something does happen to the tree for a given node).
def prune_helper(self, curr_node):
all_err = []
# The current tree.
tree_1 = self
all_err.append((tree_1._generalization_error(), tree_1))
# Replace node with leaf 1.
tree_2 = self._replace_leaf(curr_node, DEMOCRAT)
all_err.append((tree_2._generalization_error(), tree_2))
# Replace node with leaf 0.
tree_3 = self._replace_leaf(curr_node, REPUBLICAN)
all_err.append((tree_3._generalization_error(), tree_3))
# Replace node with left subtree.
test_4 = self._replace_subtree(curr_node, LEFT)
all_err.append((test_4._generalization_error(), test_4))
# Replace node with middle subtree.
test_5 = self._replace_subtree(curr_node, MIDDLE)
all_err.append((test_5._generalization_error(), test_5))
# Replace node with right subtree.
test_6 = self._replace_subtree(curr_node, RIGHT)
all_err.append((test_6._generalization_error(), test_6))
all_err.sort(key=lambda tup: tup[0])
min_tree = all_err[0][1]
# print(self.validate(validation_data)) <-- This
self = copy.deepcopy(min_tree)
# print(self.validate(validation_data)) <-- Mostly different than this
curr_node.pruned = True
def prune(self, curr_node=None):
if curr_node is None:
curr_node = self._root
# Node is leaf.
if curr_node.leaf:
self.prune_helper(curr_node=curr_node)
return
# Node is not a leaf, we may assume that he has all three children.
if curr_node.left.pruned is False:
self.prune(curr_node=curr_node.left)
if curr_node.middle.pruned is False:
self.prune(curr_node=curr_node.middle)
if curr_node.right.pruned is False:
self.prune(curr_node=curr_node.right)
# We'll prune the current node, only if we checked all of its children.
self.prune_helper(curr_node=curr_node)
But the problem is, when i want to calculate some value for the tree after prunning it "completely", i get the same value returned by validate(), which means that maybe the tree wasnt changed afterall, and the effect on the tree only took place in prune_tree-
def prune_tree(tree):
# print(tree.validate(validation_data)) <- This
tree.prune()
# print(tree.validate(validation_data)) <- Same as this
tree.print_tree('after')
I think that maybe the problem is with the way i try to change the self object or something. Is there anything obvious that i've done wrong implementing the whole thing that may lead to this results?

Is it possible to retrieve the train rows id within each leaf in a DecisionTreeRegressor of scikit-learn?

Currently, I can retrieve the ID of each node of my grown on my training sample to which each row of my test sample is most likely to belong to:
tree.tree_.apply(np.array(X_test).astype(np.float32)) where X_test represents the inputs of the decision tree.
But, for each leaf of my grown tree, I would like to get the IDs of my training sample which are contained in it. So that I would know which training sample are the most similar to one test input.
I ended up using the "apply" function to my training sample to get the leaf_id it belongs to.
def get_nearest_points(self, tr, input_train):
inside_leaves = {}
tmp = tr.tree_.apply(np.array(input_train).astype(np.float32))
leaves_list = set(tmp)
for leaf in leaves_list:
inside_leaves[leaf] = [idx for idx, elt in enumerate(tmp) if elt == leaf]
return inside_leaves
inside_leaves is now a dictionary containing for each leaf_id a list of the row involved in this leaf.

How to cleanly avoid loops in recursive function (breadth-first traversal)

I'm writing a recursive breadth-first traversal of a network. The problem I ran into is that the network often looks like this:
1
/ \
2 3
\ /
4
|
5
So my traversal starts at 1, then traverses to 2, then 3. The next stop is to proceed to 4, so 2 traverses to 4. After this, 3 traverses to 4, and suddenly I'm duplicating work as both lines try to traverse to 5.
The solution I've found is to create a list called self.already_traversed, and every time a node is traversed, I append it to the list. Then, when I'm traversing from node 4, I check to make sure it hasn't already been traversed.
The problem here is that I'm using an instance variable for this, so I need a way to set up the list before the first recursion and a way to clean it up afterwards. The way I'm currently doing this is:
self.already_traversed = []
self._traverse_all_nodes(start_id)
self.already_traversed = []
Of course, it sucks to be twoggling variables outside of the function that's using them. Is there a better way to do this so this can be put into my traversal function?
Here's the actual code, though I recognize it's a bit dense:
def _traverse_all_nodes(self, root_authority, max_depth=6):
"""Recursively build a networkx graph
Process is:
- Work backwards through the authorities for self.cluster_end and all
of its children.
- For each authority, add it to a networkx graph, if:
- it happened after self.cluster_start
- it's in the Supreme Court
- we haven't exceeded a max_depth of six cases.
- we haven't already followed this path
"""
g = networkx.Graph()
if hasattr(self, 'already_traversed'):
is_already_traversed = (root_authority.pk in self.visited_nodes)
else:
# First run. Create an empty list.
self.already_traversed = []
is_already_traversed = False
is_past_max_depth = (max_depth <= 0)
is_cluster_start_obj = (root_authority == self.cluster_start)
blocking_conditions = [
is_past_max_depth,
is_cluster_start_obj,
is_already_traversed,
]
if not any(blocking_conditions):
print " No blocking conditions. Pressing on."
self.visited_nodes.append(root_authority.pk)
for authority in root_authority.authorities.filter(
docket__court='scotus',
date_filed__gte=self.cluster_start.date_filed):
g.add_edge(root_authority.pk, authority.pk)
# Combine our present graph with the result of the next
# recursion
g = networkx.compose(g, self._build_graph(
authority,
max_depth - 1,
))
return g
def add_clusters(self):
"""Do the network analysis to add clusters to the model.
Process is to:
- Build a networkx graph
- For all nodes in the graph, add them to self.clusters
"""
self.already_traversed = []
g = self._traverse_all_nodes(
self.cluster_end,
max_depth=6,
)
self.already_traversed = []
Check out:
How do I pass a variable by reference?
which contains an example on how to past a list by reference. If you pass the list by reference, every call to your function will refer to the same list.

Categories

Resources