I am currently viewing the decision tree using the following code. Is there a way that we can export some calculated fields as output too?
For example, is it possible to display the sum of an input attribute at each node, i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:]
y = iris.target
#%%
from sklearn.tree import DecisionTreeClassifier
alg=DecisionTreeClassifier( max_depth=5,min_samples_leaf=2, max_leaf_nodes = 10)
alg.fit(X,y)
#%%
## View tree
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(alg,out_file=None, node_ids = True, proportion = True, class_names = True, filled = True, rounded = True)
graph = graphviz.Source(dot_data)
graph
There is plenty of discussion about decision trees in scikit-learn on the github page. There are answers on this SO question and this scikit-learn documentation page that provide the framework to get you started. With all the links out of the way, here are some functions that allow a user to address the question in a generalizable manner. The functions could be easily modified since I don't know if you mean all the leaves or each leaf individually. My approach is the latter.
The first function uses apply as a cheap way to find the indices of the leaf nodes. It's not necessary to achieve what you're asking, but I included it as a convenience since you mentioned you want to investigate leaf nodes and leaf node indices may be unknown a priori.
def find_leaves(X, clf):
"""A cheap function to find leaves of a DecisionTreeClassifier
clf must be a fitted DecisionTreeClassifier
"""
return set(clf.apply(X))
Result on the example:
find_leaves(X, alg)
{1, 7, 8, 9, 10, 11, 12}
The following function will return an array of values that satisfy the conditions of node and feature, where node is the index of the node from the tree that you want values for and feature is the column (or feature) that you want from X.
def node_feature_values(X, clf, node=0, feature=0, require_leaf=False):
"""this function will return an array of values
from the input array X. Array values will be limited to
1. samples that passed through <node>
2. and from the feature <feature>.
clf must be a fitted DecisionTreeClassifier
"""
leaf_ids = find_leaves(X, clf)
if (require_leaf and
node not in leaf_ids):
print("<require_leaf> is set, "
"select one of these nodes:\n{}".format(leaf_ids))
return
# a sparse array that contains node assignment by sample
node_indicator = clf.decision_path(X)
node_array = node_indicator.toarray()
# which samples at least passed through the node
samples_in_node_mask = node_array[:,node]==1
return X[samples_in_node_mask, feature]
Applied to the example:
values_arr = node_feature_values(X, alg, node=12, feature=0, require_leaf=True)
array([6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7,
5.8, 6.4, 6.5, 7.7, 7.7, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.1, 6.4,
7.4, 7.9, 6.4, 7.7, 6.3, 6.4, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7,
6.3, 6.5, 6.2, 5.9])
Now the user can perform whatever mathematical operation is desired on the subset of samples for a given feature.
i.e. sum of feature 1 from 'X' data array in the leafs of the tree.
print("There are {} total samples in this node, "
"{}% of the total".format(len(values_arr), len(values_arr) / float(len(X))*100))
print("Feature Sum: {}".format(values_arr.sum()))
There are 43 total samples in this node,28.666666666666668% of the total
Feature Sum: 286.69999999999993
Update
After re-reading the question, this is the only solution I can put together quickly that doesn't involve modifying scikit source code for export.py. Code below still relies on previously defined functions. This code modifies the dotstring via pydot and networkx.
# Load the data from `dot_data` variable, which you defined.
import pydot
dot_graph = pydot.graph_from_dot_data(dot_data)[0]
import networkx as nx
MG = nx.nx_pydot.from_pydot(dot_graph)
# Select a `feature` and edit the `dot` string in `networkx`.
feature = 0
for n in find_leaves(X, alg):
nfv = node_feature_values(X, alg, node=n, feature=feature)
MG.node[str(n)]['label'] = MG.node[str(n)]['label'] + "\nfeature_{} sum: {}".format(feature, nfv.sum())
# Export the `networkx` graph then plot using `graphviz.Source()`
new_dot_data = nx.nx_pydot.to_pydot(MG)
graph = graphviz.Source(new_dot_data.create_dot())
graph
Notice all the leaves have the sum of values from X for feature 0.
I think the best way to accomplish what you're asking would be to modify tree.py and/or export.py to natively support this feature.
Related
Does computeSVD() use map , reduce
since it is a predefined function?
i couldn't know the code of the function.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True) <------------- this function
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
It does, from Spark documentation
This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml package), which is now the primary API for MLlib.
If you want to look at code base, here it is https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L328
How do you use the new Polynomials sub-package in numpy to give it new x values and get an output of y values?
https://numpy.org/doc/stable/reference/routines.polynomials.package.html
In prior versions of numpy it went something like this:
poly = np.poly1d(np.polyfit(x, y, 3)
new_x = np.linspace(0, 100)
new_y = poly(new_x)
The new version I am struggling to give it x values that give me the y values of each?
from numpy.polynomial import Polynomial
poly = Polynomial(Polynomial.fit(x, y, 3))
When I give it an array of x it just returns the coefficients.
You can directly call the resulting series to evaluate it:
from numpy.polynomial import Polynomial
poly = Polynomial.fit(x, y, 3)
new_y = poly(new_x)
Check this page of the documentation it has several examples.
Unfortunately, the answer by #Joan Charmant and the supportive comment #rh109019 do not work.
The intuitive way suggested by #Joan Charmant is, basically, what the question's about: it doesn't work.
Evidently, there is a new method introduced in numpy.polynomial.polynomial devoted specifically to evaluating polynomials. See here.
Here's my code where I'm comparing the two approaches.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurve = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
print("The polynomial: {}".format(calibrationCurve))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
c = calibrationCurve.coef # coefficients of the polynomial
print("The intuitive (wrong) way: {}".format(calibrationCurve(x)))
print("The correct way: {}".format(np.polynomial.polynomial.polyval(x, c)))
The first print command prints out the polynomial:4.6+3.5x.
If we want to evaluate it at the points 0 and 1 (x = np.asarray([0, 1])), we expect to get 4.6 and 8.1 respectively.
The second print command (that reads "The intuitive (wrong) way"), uses the method suggested by #Joan Charmant. It gives [0.1, 1.1] as the result. Which is wrong. Though seemingly, it looks ok: it gives two numbers as expected. But the numbers themselves are wrong. I don't know how these numbers were calculated. But if I had a bigger series of data, I wouldn't go with a calculator through it and assume I've got a correct result.
The last print command makes use of the polyval method suggested in the user manual that I cited above. And it works perfectly well. It gives [4.6, 8.1] as the result.
It so happens that my answer is wrong as well (see all the comments below by #user2357112 supports Monica).
But still, I'll leave it here for the folks who, like me, fell the victim of the confusing new numpy.polynomial library.
FIRST: why my code is wrong?
Everything's ok with it. But the line print("The polynomial: {}".format(calibrationCurve)) doesn't give me what, I think, it must give me. It takes the correct polynomial, changes its coefficients somehow and prints out a new polynomial with the changed coefficients. Still, it does store the correct polynomial in its memory and when you do the thing suggested by #Joan Charmant it may give you the correct answer if you ask it properly.
SECOND: how to use the new numpy.polynomial library in order to get a correct result?
Due to that peculiarity, you have to introduce a new line of code. Namely, do the Polynomial.fit() and immediately afterwards use the .convert() method. Then work with the converted polynomial only.
Here's my code that works correctly now.
import numpy as np
Pgauge = np.asarray([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
NIST = np.asarray([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1])
calibrationCurveMessedUp = np.polynomial.polynomial.Polynomial.fit(Pgauge,
NIST,
deg=1
)
calibrationCurve = calibrationCurveMessedUp.convert()
print("The polynomial: {}".format(calibrationCurve))
print("The rounded polynomial coefficients: {}".format(calibrationCurve.coef))
x = np.asarray([0, 1]) # values of x to evaluate the polynomial at
print(calibrationCurve(x))
THIRD: a little note.
Apparently, there is a possibility to get the correct polynomial without the additional line of code. Probably, you have to give the correct window and domain parameters to the Polynomial.fit() function. Or may be there is another way.
If anybody knows such a way, you're welcome to edit my current answer and add your code.
I have two Python lists of float numbers and I want to find a pivot point which minimizes the "overlap" of these two lists.
The problem is illustrated in the figure below, where I would like to get the cross point of the two curves (each curve can be imagined as the histogram plot of a list), and the "overlap" is defined as the green area.
For example, I have two lists [2.1, 3.5, 3.8, 3.8, 3.8, 4.2] and [3.7, 4.1, 4.1, 4.1, 5.0]. A good pivot point could be 4.0 (or any number between 3.8 and 4.1), where the "overlap" corresponds to only one number (4.2) from the 1st list and one number (3.7) from the 2nd list.
Apparently the set() & set() method doesn't apply here as the numbers wouldn't be the same in both lists. The only method I came up is a brute force search, starting from 4.2 and ending at 3.7, which is not ideal.
By the comments, I need to separate it into two questions:
1) What's the Python solution to find such a pivot point of the two lists?
2) Much better, maybe too much to ask it here, but how to get a statistically rigor solution to minimize the separation of the two set of values? I am not sure if I can assume a Gaussian distribution of the values, but let's assume we can if that helps to formulate a solution.
We have two lists a and b. We are looking for such a value x for which the cumulative probability of higher values in a is equal to cumulative probability of lower values in b.
Formally:
1 − CDF(a, x) == CDF(b, x)
Alternatively:
1 − CDF(a, x) − CDF(b, x) == 0
Let's implement it in Python.
import itertools
import random
def boundary(a, b):
"""Return interval of boundary values."""
# Calculate probability density function for both list
# Merge lists and sort them by their values
cc = sorted(itertools.chain(
((x, 1/len(a)) for i, x in enumerate(a)),
((x, 1/len(b)) for i, x in enumerate(b))))
# Mark all values with 1 − CDF(a, x) − CDF(b, x)
pp = [(x[0], 1-sum(z[1] for z in cc[:i+1])) for i, x in enumerate(cc)]
# Find index of a value closest to zero
m = min(enumerate(pp), key=lambda x: abs(x[1][1]))
# Return range of values
index = m[0]
return pp[index][0], pp[index+1][0]
Test simple cases:
print(boundary([1, 2], [3, 4])) # -> (2, 3)
print(boundary([1], [3])) # -> (1, 3)
print(boundary([1, 3], [2, 4])) # -> (2, 3)
And test a more complicated case:
a = sorted(random.gauss(0, 1) for _ in range(300))
b = sorted(random.gauss(1, 1) for _ in range(200))
print(boundary(a, b)) # -> approx (0.5, 0.5 + Δ)
Please note that the algorithm correctly processes lists of different lengths.
And with slight performance optimizations it can successfully handle lists with millions of items.
One idea is to use a decision classifier to determine the best separation point.
Code
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# Setup Data
df = pd.DataFrame({'Feature': [1.0, 2.1, 3.5, 4.2,3.7, 4.1, 5.0],'Label':[0,0,0,0,1,1,1]})
feature_cols = ['Feature']
X = df[feature_cols] # Features
y = df.Label # Target variable
# Create Decision Tree classifer object (use max_depth of 1 to have one boundary)
clf = DecisionTreeClassifier(max_depth = 1)
# Train Decision Tree Classifer
clf = clf.fit(X, y)
# Found decision boundary by creating test data in 0.1 steps from min to max
# (i.e. 1 to 5)
arr = np.arange(1, 5.1, 0.1)
test_set = pd.DataFrame({'Feature': arr})
# Create predictor so we can see where boundary is created
y_pred = clf.predict(test_set)
indexes = np.where(y_pred > 0) # all points with label 1
pivot_index = indexes[0][0] # first point with label 1
pivot_value = arr[pivot_index] # value is pivot value
print(f'Pivot value: {pivot_value}')
Output
Pivot value: 3.7000000000000024
If I understood correctly the values you are looking for do not necessarily belong to the lists.
If that is the case you can "artificially" resample your lists with decimal spacing between min and max of the original lists, transform them to "sets" and compute the intersection between them.
Hi I have two numpy arrays (in this case representing depth and percentage depth dose data) as follows:
depth = np.array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.2,
2.4, 2.6, 2.8, 3. , 3.5, 4. , 4.5, 5. , 5.5])
pdd = np.array([ 80.40649399, 80.35692155, 81.94323956, 83.78981286,
85.58681373, 87.47056637, 89.39149833, 91.33721651,
93.35729334, 95.25343909, 97.06283306, 98.53761309,
99.56624117, 100. , 99.62820672, 98.47564754,
96.33163961, 93.12182427, 89.0940637 , 83.82699219,
77.75436857, 63.15528566, 46.62287768, 29.9665386 ,
16.11104226, 6.92774817, 0.69401413, 0.58247614,
0.55768992, 0.53290371, 0.5205106 ])
which when plotted give the following curve:
I need to find the depth at which the pdd falls to a given value (initially 50%). I have tried slicing the arrays at the point where the pdd reaches 100% as I'm only interested in the points after this.
Unfortunately np.interp only appears to work where both x and y values are incresing.
Could anyone suggest where I should go next?
If I understand you correctly, you want to interpolate the function depth = f(pdd) at pdd = 50.0. For the purposes of the interpolation, it might help for you to think of pdd as corresponding to your "x" values, and depth as corresponding to your "y" values.
You can use np.argsort to sort your "x" and "y" by ascending order of "x" (i.e. ascending pdd), then use np.interp as usual:
# `idx` is an an array of integer indices that sorts `pdd` in ascending order
idx = np.argsort(pdd)
depth_itp = np.interp([50.0], pdd[idx], depth[idx])
plt.plot(depth, pdd)
plt.plot(depth_itp, 50, 'xr', ms=20, mew=2)
This isn't really a programming solution, but it's how you can find the depth. I'm taking the liberty of renaming your variables, so x(i) = depth(i) and y(i) = pdd(i).
In a given interval [x(i),x(i+1)], your linear interpolant is
p_1(X) = y(i) + (X - x(i))*(y(i+1) - y(i))/(x(i+1) - x(i))
You want to find X such that p_1(X) = 50. First find i such that x(i)>50 and x(i+1), then the above equation can be rearranged to give
X = x(i) + (50 - y(i))*((x(i+1) - x(i))/(y(i+1) - y(i)))
For your data (with MATLAB; sorry, no python code) I make it approximately 2.359. This can then be verified with np.interp(X, depth, pdd)
There are several methods to carry out interpolation. For your case, you are basically looking for the depth at 50% which is not available in your data. The simplest interpolation is the linear case. I'm using numerical recipes library in C++ for acquiring the interpolated value via several techniques, therefore,
Linear Interpolation: see page 117
interpolated value depth(50%): 2.35915
Polynomial Interpolation: see page 117
interpolated value depth(50%): 2.36017
Cubic Spline Interpolation: see page 120
interpolated value depth(50%): 2.19401
Rational Function Interpolation: see page 124
interpolated value depth(50%): 2.35986
I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.
You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series
You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal