random subsampling of the majority class - python

I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?

Say your data looks like something generated from this code:
import numpy as np
x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])
(only a 1/5th of y is 1, which is the minority class).
To find the size of the minority class, do:
>>> np.sum(y == 1)
20
To find the subset that consists of the majority class, do:
majority_x, majority_y = x[y == 0, :], y[y == 0]
To find a random subset of size 20, do:
inds = np.random.choice(range(majority_x.shape[0]), 20)
followed by
majority_x[inds, :]
and
majority_y[inds]

Related

Constraining log-likelihood using pymc3.Potential?

Problem description
I am new to probabilistic programming and working through PyMC3's Gaussian Mixture Model sample notebook:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
# simulate data from a known mixture distribution
np.random.seed(12345) # set random seed for reproducibility
k = 3
ndata = 500
spread = 5
centers = np.array([-spread, 0, spread])
# simulate data from mixture distribution
v = np.random.randint(0, k, ndata)
data = centers[v] + np.random.randn(ndata)
plt.hist(data);
# setup model
model = pm.Model()
with model:
# cluster sizes
p = pm.Dirichlet("p", a=np.array([1.0, 1.0, 1.0]), shape=k)
# ensure all clusters have some points
p_min_potential = pm.Potential("p_min_potential", tt.switch(tt.min(p) < 0.1, -np.inf, 0))
# cluster centers
means = pm.Normal("means", mu=[0, 0, 0], sigma=15, shape=k)
# break symmetry
order_means_potential = pm.Potential(
"order_means_potential",
tt.switch(means[1] - means[0] < 0, -np.inf, 0)
+ tt.switch(means[2] - means[1] < 0, -np.inf, 0),
)
# measurement error
sd = pm.Uniform("sd", lower=0, upper=20)
# latent cluster of each observation
category = pm.Categorical("category", p=p, shape=ndata)
# likelihood for each observed value
points = pm.Normal("obs", mu=means[category], sigma=sd, observed=data)
# fit model
with model:
step1 = pm.Metropolis(vars=[p, sd, means])
step2 = pm.ElemwiseCategorical(vars=[category], values=[0, 1, 2])
tr = pm.sample(10000, step=[step1, step2], tune=5000)
I am struggling with the following expressions:
# ensure all clusters have some points
p_min_potential = pm.Potential("p_min_potential", tt.switch(tt.min(p) < 0.1, -np.inf, 0))
and
# break symmetry
order_means_potential = pm.Potential(
"order_means_potential",
tt.switch(means[1] - means[0] < 0, -np.inf, 0)
+ tt.switch(means[2] - means[1] < 0, -np.inf, 0),
)
What I researched
From looking at related questions and PyMC3's and Theano's documentation I think I understand that pm.Potential() is a way of setting the log-likelihood of an event in your model during sampling without providing observations to it, and that tt.switch() checks whether a certain condition is met and returns one of two values accordingly.
Thus p_min_potential ensures that all values in p are greater than 0.1 by setting the log-likelihood for an event where one value in p to negative infinite, similarly order_means_potential ensures the values in means differ from on another and their ordering stays the same during sampling.
Questions
Unfortunately neither related questions nor the documentations could answer the following questions:
How are the results of these expressions fed back into the model, as neither p_min_potential nor order_means_potential occur as input to any other expression?
Am I right in how tt.switch() works, thus if the condition tt.min(p) < 0.1 is met -np.inf is returned as that events log-likelihood, and 0 in any other case?
Any help would be greatly appreciated, I would like to understand how this example works to the degree where I'll be able to alter and expand it. Specifically I want to implement a mixture model for two or more beta distributions.

Python optimization of prediction of random forest regressor

I have built a random forest regressor to predict the elasticity of a certain object based on color, material, size and other features.
The model works fine and I can predict the expected elasticity given certain inputs.
Eventually, I want to be able to find the lowest elasticity with certain constraints. The inputs have limited possibilities, i.e., material can only be plastic or textile.
I would like to have a smart solution in which I don't have to brute force and try all the possible combinations and find the one with lowest elasticity. I have found that surrogate models can be used for this but I don't understand how to apply this concept to my problem. For example, what is the objective function I should optimize in my case? I thought of passing the .predict() of the random forest but I'm not sure this is the correct way.
To summarize, I'd like to have a solution that given certain conditions, tells me what should be the best set of features to have lowest elasticity. Example, I'm looking for the lowest elasticity when the object is made of plastic --> I'd like to receive the set of other features that tells me how to get lowest elasticity in that case. Or simply, what feature I should tune to improve the performance
import numpy
from scipy.optimize import minimize
import random
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=10, random_state=0)
model.fit(X_train,y_train)
material = [0,1]
size= list(range(1, 45))
color= list(range(1, 500))
def objective(x):
material= x[0]
size = x[1]
color = x[2]
return model.predict([[material,size,color]])
# initial guesses
n = 3
x0 = np.zeros(n)
x0[0] = random.choice(material)
x0[1] = random.choice(size)
x0[2] = random.choice(color)
# optimize
b = (None,None)
bnds = (b, b, b, b, b)
solution = minimize(objective, x0, method='nelder-mead',
options={'xtol': 1e-8, 'disp': True})
x = solution.x
print('Final Objective: ' + str(objective(x)))
This is one solution if I understood you correctly,
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from scipy.optimize import differential_evolution
model = None
def objective(x):
material= x[0]
size = x[1]
color = x[2]
return model.predict([[material,size,color]])
# define input data
material = np.random.choice([0,1], 10); material = np.expand_dims(material, 1)
size = np.arange(10); size = np.expand_dims(size, 1)
color = np.arange(20, 30); color = np.expand_dims(color, 1)
input = np.concatenate((material, size, color), 1) # shape = (10, 3)
# define output = elasticity between [0, 1] i.e. 0-100%
elasticity = np.array([0.51135295, 0.54830051, 0.42198349, 0.72614775, 0.55087905,
0.99819945, 0.3175208 , 0.78232872, 0.11621277, 0.32219236])
# model and minimize
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(input, elasticity)
limits = ((0, 1), (0, 10), (20, 30))
res = differential_evolution(objective, limits, maxiter = 10000, seed = 11111)
min_y = model.predict([res.x])[0]
print("min_elasticity ==", round(min_y, 5))
The output is minimal elasticity based on the limits
min_elasticity == 0.19029
These are random data so the RandomForestRegressor doesn't do the best job perhaps

sample X examples from each class label

l have a dataset (numpy vector) with 50 classes and 9000 training examples.
x_train=(9000,2048)
y_train=(9000,) # Classes are strings
classes=list(set(y_train))
l would like to build a sub-dataset such that each class will have 5 examples
which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :
sub_train_data=(250,2048)
sub_train_labels=(250,)
Remark : we take randomly 5 examples from each class (total number of classes = 50)
Thank you
Here is a solution for that problem :
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
def balanced_sample_maker(X, y, sample_size, random_seed=42):
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.items():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
data_train=X[balanced_copy_idx]
labels_train=y[balanced_copy_idx]
if ((len(data_train)) == (sample_size*len(uniq_levels))):
print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
else:
print('number of samples is wrong ')
labels, values = zip(*Counter(labels_train).items())
print('number of classes ', len(list(set(labels_train))))
check = all(x == values[0] for x in values)
print(check)
if check == True:
print('Good all classes have the same number of examples')
else:
print('Repeat again your sampling your classes are not balanced')
indexes = np.arange(len(labels))
width = 0.5
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
return data_train,labels_train
X_train,y_train=balanced_sample_maker(X,y,10)
inspired by Scikit-learn balanced subsampling
Pure numpy solution:
def sample(X, y, samples):
unique_ys = np.unique(y, axis=0)
result = []
for unique_y in unique_ys:
val_indices = np.argwhere(y==unique_y).flatten()
random_samples = np.random.choice(val_indices, samples, replace=False)
ret.append(X[random_samples])
return np.concatenate(result)
I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.
sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
for train_index, _ in sp.split(x_train, y_train):
x_train, y_train = x_train[train_index], y_train[train_index]
You can use dataframe as input (as in my case), and use simple code below:
col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
t = t4m.loc[t4m[col] == val].sample(nsamples)
res = pd.concat([res, t], ignore_index=True).sample(frac=1)
col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.
Then you can convert result back to np.array

sklearn decision tree calculating misclassified data points for each "predicted region/rectagle"

Firstly, credit to kamezekase for the excellent implementation for returning the x,y vertices of the decision tree rectangles for low-intensive plotting. The implementation is as follows:
import numpy as np
from collections import deque
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import _tree as ctree
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
class AABB:
"""Axis-aligned bounding box"""
def __init__(self, n_features):
self.limits = np.array([[-np.inf, np.inf]] * n_features)
def split(self, f, v):
left = AABB(self.limits.shape[0])
right = AABB(self.limits.shape[0])
left.limits = self.limits.copy()
right.limits = self.limits.copy()
left.limits[f, 1] = v
right.limits[f, 0] = v
return left, right
def tree_bounds(tree, n_features=None):
"""Compute final decision rule for each node in tree"""
if n_features is None:
n_features = np.max(tree.feature) + 1
aabbs = [AABB(n_features) for _ in range(tree.node_count)]
queue = deque([0])
while queue:
i = queue.pop()
l = tree.children_left[i]
r = tree.children_right[i]
if l != ctree.TREE_LEAF:
aabbs[l], aabbs[r] = aabbs[i].split(tree.feature[i], tree.threshold[i])
queue.extend([l, r])
return aabbs
def decision_areas(tree_classifier, maxrange, x=0, y=1, n_features=None):
""" Extract decision areas.
tree_classifier: Instance of a sklearn.tree.DecisionTreeClassifier
maxrange: values to insert for [left, right, top, bottom] if the interval is open (+/-inf)
x: index of the feature that goes on the x axis
y: index of the feature that goes on the y axis
n_features: override autodetection of number of features
"""
tree = tree_classifier.tree_
aabbs = tree_bounds(tree, n_features)
rectangles = []
for i in range(len(aabbs)):
if tree.children_left[i] != ctree.TREE_LEAF:
continue
l = aabbs[i].limits
r = [l[x, 0], l[x, 1], l[y, 0], l[y, 1], np.argmax(tree.value[i])]
rectangles.append(r)
rectangles = np.array(rectangles)
rectangles[:, [0, 2]] = np.maximum(rectangles[:, [0, 2]], maxrange[0::2])
rectangles[:, [1, 3]] = np.minimum(rectangles[:, [1, 3]], maxrange[1::2])
return rectangles
My Goal:
I have the task of returning the count of misclassified data points; I need the count for each possible classification of the model. If we were to use the textbook iris data set, then 3 classes. That is to say for all rectangles we extract, I also want to include how many data points were incorrectly bounded by the respective rectangle. So for each rectangle, there should be 1 class that has data points misclassified = 0, as it was the predicted region, or np.argmax(tree.value[i]) as seen above.
I would like to store the count as part of the r variable. Perhaps as follows, (using lengthy variables for clarity):
r = [l[x, 0], l[x, 1], l[y, 0], l[y, 1], np.argmax(tree.value[i]), class1_number_of_incorrect_in_this_rect, class2_number_of_incorrect_in_this_rect, class3_number_of_incorrect_in_this_rect]
Despite his code being very clean/intuitive, I can't quite find the right location to add my iterative calculation. I'm also confused about the naming conventions of the variables, here y is the y_axis, not the target, so it's a little murky as to what I'm going to compare tree.predict() with. And I'm also not sure if there is an existing reference to the total number of possible classes that the tree is handling in this implementation. I would imagine sklearn would have built-in functionality for that at least.
That's it in a nutshell; am I even on the right conceptual approach to my goal?

roc_auc_score - Only one class present in y_true

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score.
The problem is - sometimes the test data only contains 0s, and not 1s!
I tried using this example, but with different numbers:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
And I get this exception:
ValueError: Only one class present in y_true. ROC AUC score is not
defined in that case.
Is there any workaround that can make it work in such cases?
You could use try-except to prevent the error:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
try:
roc_auc_score(y_true, y_scores)
except ValueError:
pass
Now you can also set the roc_auc_score to be zero if there is only one class present. However, I wouldn't do this. I guess your test data is highly unbalanced. I would suggest to use stratified K-fold instead so that you at least have both classes present.
As the error notes, if a class is not present in the ground truth of a batch,
ROC AUC score is not defined in that case.
I'm against either throwing an exception (about what? This is the expected behaviour) or returning another metric (e.g. accuracy). The metric is not broken per se.
I don't feel like solving a data imbalance "issue" with a metric "fix". It would probably be better to use another sampling, if possibile, or just join multiple batches that satisfy the class population requirement.
I am facing the same problem now, and using try-catch does not solve my issue. I developed the code below in order to deal with that.
import pandas as pd
import numpy as np
class KFold(object):
def __init__(self, folds, random_state=None):
self.folds = folds
self.random_state = random_state
def split(self, x, y):
assert len(x) == len(y), 'x and y should have the same length'
x_, y_ = pd.DataFrame(x), pd.DataFrame(y)
y_ = y_.sample(frac=1, random_state=self.random_state)
x_ = x_.loc[y_.index]
event_index, non_event_index = list(y_[y == 1].index), list(y_[y == 0].index)
assert len(event_index) >= self.folds, 'number of folds should be less than the number of rows in x'
assert len(non_event_index) >= self.folds, 'number of folds should be less than number of rows in y'
indexes = []
#
#
#
step = int(np.ceil(len(non_event_index) / self.folds))
start, end = 0, step
while start < len(non_event_index):
train_fold = set(non_event_index[start:end])
valid_fold = set([k for k in non_event_index if k not in train_fold])
indexes.append([train_fold, valid_fold])
start, end = end, min(step + end, len(non_event_index))
#
#
#
step = int(np.ceil(len(event_index) / self.folds))
start, end, i = 0, step, 0
while start < len(event_index):
train_fold = set(event_index[start:end])
valid_fold = set([k for k in event_index if k not in train_fold])
indexes[i][0] = list(indexes[i][0].union(train_fold))
indexes[i][1] = list(indexes[i][1].union(valid_fold))
indexes[i] = tuple(indexes[i])
start, end, i = end, min(step + end, len(event_index)), i + 1
return indexes
I just wrote that code and I did not tested it exhaustively. It was tested only for binary categories. Hope it be useful yet.
You can increase the batch-size from e.g. from 32 to 64, you can use StratifiedKFold or StratifiedShuffleSplit. If the error still occurs, try shuffeling your data e.g. in your DataLoader.
Simply modify the code with 0 to 1 make it work
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 1, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
I believe the error message has suggested that only one class in y_true (all zero), you need to give 2 classes in y_true.

Categories

Resources