How does sklearn.feature_selection.SelectFwe work?

How does sklearn.feature_selection.SelectFwe work? - python

import numpy as np
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn import __version__ #1.0.2
I have this example dataset:
X = np.array([
[3,0,2,1],
[7,3,0,5],
[4,2,5,1],
[6,2,7,3],
[3,2,5,2],
[6,1,1,4]
])
y = np.array([2,9,2,4,5,9])
The f_regression(X, y) function returns two arrays:
(array([ 4.68362124, 0.69456469, 2.59714175, 27.64721141]),
array([0.09643779, 0.45148854, 0.18234859, 0.00626275]))
The first one contains the F-statistic for the 4 features of my dateset, the second one contains the p-values associated with the F-statistic.
Now suppose I want to extract the features with a p-value lower than 0.15; what I expect is that the first and last features are selected. I would like to use SelectFwe (here the documentation) to perform this step, so:
SelectFwe(f_regression, alpha=.15).fit(X, y).get_support()
Unfortunately it returns array([False, False, False, True]), meaning that only the last feature is selected.
Why does it happen? Did I misunderstand how SelectFwe works? Probably the following picture is helpful:
The code I used to produce the plot:
plt.plot(
np.linspace(0,1,101),
[SelectFwe(f_regression, alpha=alpha).fit(X, y).get_support().sum() for alpha in np.linspace(0,1,101)]
)
plt.xlabel("alpha")
plt.ylabel("selected features")
plt.show()

In the source, the alpha is divided by the number of features:
def _get_support_mask(self):
check_is_fitted(self)
return self.pvalues_ < self.alpha / len(self.pvalues_)
This is because the class is considering "family-wise error" rate, which is
the probability of making one or more false discoveries
(wikipedia, my emph). You can use instead SelectFpr, "false positive rate" test, which works exactly the same but doesn't divide by the number of features. See also Issue1007.

Related

Efficient expanding OLS in pandas

I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.

One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered

What is the purpose of keras utils normalize?

I'd like to normalize my training set before passing it to my NN so instead of doing it manually (subtract mean and divide by std), I tried keras.utils.normalize() and I am amazed about the results I got.
Running this:
r = np.random.rand(3000) * 1000
nr = normalize(r)
print(np.mean(r))
print(np.mean(nr))
print(np.std(r))
print(np.std(nr))
print(np.min(r))
print(np.min(nr))
print(np.max(r))
print(np.max(nr))

Results in that:
495.60440066771866
0.015737914577213984
291.4440194021
0.009254802974329002
0.20755517410064872
6.590913227674956e-06
999.7631481267636
0.03174747238214018
Unfortunately, the docs don't explain what's happening under the hood. Can you please explain what it does and if I should use keras.utils.normalize instead of what I would have done manually?

It is not the kind of normalization you expect. Actually, it uses np.linalg.norm() under the hood to normalize the given data using Lp-norms:
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
# Arguments
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
# Returns
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
For example, in the default case, it would normalize the data using L2-normalization (i.e. the sum of squared of elements would be equal to one).
You can either use this function, or if you don't want to do mean and std normalization manually, you can use StandardScaler() from sklearn or even MinMaxScaler().

How to properly sample truncated distributions?

I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
Thank you for your help.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range[0]<a_t and a_t<a_range[1] and b_range[0]<b_t and b_t<b_range[1]:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.

You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape[0] < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape[0] < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])

To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mystery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.

GaussianMixture initialization using component parameters - sklearn

I want to use sklearn.mixture.GaussianMixture to store a gaussian mixture model so that I can later use it to generate samples or a value at a sample point using score_samples method. Here is an example where the components have the following weight, mean and covariances
import numpy as np
weights = np.array([0.6322941277066596, 0.3677058722933399])
mu = np.array([[0.9148052872961359, 1.9792961751316835],
[-1.0917396392992502, -0.9304220945910037]])
sigma = np.array([[[2.267889129267119, 0.6553245618368836],
[0.6553245618368835, 0.6571014653342457]],
[[0.9516607767206848, -0.7445831474157608],
[-0.7445831474157608, 1.006599716443763]]])
Then I initialised the mixture as follow
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
Finally I tried to generate a sample based on the parameters which resulted in an error:
x = gmix.sample(1000)
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
As I understand GaussianMixture is intended to fit a sample using a mixture of Gaussian but is there a way to provide it with the final values and continue from there?

You rock, J.P.Petersen!
After seeing your answer I compared the change introduced by using fit method. It seems the initial instantiation does not create all the attributes of gmix. Specifically it is missing the following attributes,
covariances_
means_
weights_
converged_
lower_bound_
n_iter_
precisions_
precisions_cholesky_
The first three are introduced when the given inputs are assigned. Among the rest, for my application the only attribute that I need is precisions_cholesky_ which is cholesky decomposition of the inverse covarinace matrices. As a minimum requirement I added it as follow,
gmix.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(sigma)).transpose((0, 2, 1))

It seems that it has a check that makes sure that the model has been trained. You could trick it by training the GMM on a very small data set before setting the parameters. Like this:
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(rand(10, 2)) # Now it thinks it is trained
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
x = gmix.sample(1000) # Should work now

To understand what is happening, what GaussianMixture first checks that it has been fitted:
self._check_is_fitted()
Which triggers the following check:
def _check_is_fitted(self):
check_is_fitted(self, ['weights_', 'means_', 'precisions_cholesky_'])
And finally the last function call:
def check_is_fitted(estimator, attributes, msg=None, all_or_any=all):
which only checks that the classifier already has the attributes.
So in short, the only thing you have missing to have it working (without having to fit it) is to set precisions_cholesky_ attribute:
gmix.precisions_cholesky_ = 0
should do the trick (can't try it so not 100% sure :P)
However, if you want to play safe and have a consistent solution in case scikit-learn updates its contrains, the solution of #J.P.Petersen is probably the best way to go.

As a slight alternative to #hashmuke's answer, you can use the precision computation that is used inside GaussianMixture directly:
import numpy as np
from scipy.stats import invwishart as IW
from sklearn.mixture import GaussianMixture as GMM
from sklearn.mixture._gaussian_mixture import _compute_precision_cholesky
n_dims = 5
mu1 = np.random.randn(n_dims)
mu2 = np.random.randn(n_dims)
Sigma1 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
Sigma2 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
gmm = GMM(n_components=2)
gmm.weights_ = np.array([0.2, 0.8])
gmm.means_ = np.stack([mu1, mu2])
gmm.covariances_ = np.stack([Sigma1, Sigma2])
gmm.precisions_cholesky_ = _compute_precision_cholesky(gmm.covariances_, 'full')
X, y = gmm.sample(1000)
And depending on your covariance type you should change full accordingly as input to _compute_precision_cholesky (will be one of full, diag, tied, spherical).

PyMC Bernoulli model checking

I am currently trying to do model checking with PyMC where my model is a Bernoulli model and I have a Beta prior. I want to do both a (i) gof plot as well as (ii) calculate the posterior predictive p-value.
I have got my code running with a Binomial model, but I am quite struggling to find the right way of making a Bernoulli model working. Unfortunately, there is no example anywhere that I can work with. My code looks like the following:
import pymc as mc
import numpy as np
alpha = 2
beta = 2
n = 13
yes = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,0,0])
p = mc.Beta('p',alpha,beta)
surv = mc.Bernoulli('surv',p=p,observed=True,value=yes)
surv_sim = mc.Bernoulli('surv_sim',p=p)
mc_est = mc.MCMC({'p':p,'surv':surv,'surv_sim':surv_sim})
mc_est.sample(10000,5000,2)
import matplotlib.pylab as plt
plt.hist(mc_est.surv_sim.trace(),bins=range(0,3),normed=True)
plt.figure()
plt.hist(mc_est.p.trace(),bins=100,normed=True)
mc.Matplot.gof_plot(mc_est.surv_sim.trace(), 10/13., name='surv')
#here I have issues
D = mc.discrepancy(yes, surv_sim, p.trace())
mc.Matplot.discrepancy_plot(D)
The main problem I am having is in determining the expected values for the discrepancy function. Just using p.trace() does not work here, as these are the probabilities. Somehow, I need to incorporate the sample size here, but I am struggling to do that in a similar way as I would do it for a Binomial model. I am also not quite sure, if I am doing the gof_plot correctly.
Hope someone can help me out here! Thanks!

Per the discrepancy function doc string, the parameters are:
observed : Iterable of observed values (size=(n,))
simulated : Iterable of simulated values (size=(r,n))
expected : Iterable of expected values (size=(r,) or (r,n))
So you need to correct two things:
1) modify your simulated results to have size n (e.g., 13 in your example):
surv_sim = mc.Bernoulli('surv_sim', p=p, size=n)
2) encapsulate your p.trace() with the bernoulli_expval method:
D = mc.discrepancy(yes, surv_sim.trace(), mc.bernoulli_expval(p.trace()))
(bernoulli_expval just spits back p.)
With those two changes, I get the following:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does sklearn.feature_selection.SelectFwe work? - python

Related

Efficient expanding OLS in pandas

What is the purpose of keras utils normalize?

How to properly sample truncated distributions?

GaussianMixture initialization using component parameters - sklearn

PyMC Bernoulli model checking

Categories

Resources