Scikit-learn (Python): what does f_regression() compute?

Scikit-learn (Python): what does f_regression() compute? - python

I'm trying to understand what f_regression() in the feature selection package does.
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)
According to the documentation, the first step in f_regression is as follows:
"1. the regressor of interest and the data are orthogonalized wrt constant regressors."
What does this line mean, exactly? What are these constant regressors?
Thanks!

It means that the mean is subtracted on both variables.
A constant regressor is a vector full of ones. What this vector can explain in your data is then subtracted out. This leads to a vector with zero sum, i.e. a centered variable.
What f1_regression essentially calculates is correlation, a scalar product between centered and appropriately rescaled variables.
The resulting score is a function of this value and the degrees of freedom, i.e. the dimensionality of the vectors. The higher the score, the more probably the variables are associated.

Related

How does GPFlow handle the length-scales of additive multidimensional Kernels?

I am working with GPFlow to implement a basic Gaussian Process Regressor. In particular, I have defined the following kernel for my GPR:
kernel = gpflow.kernels.RBF(lengthscales=1., active_dims=[0,1]) + \
gpflow.kernels.RBF(lengthscales=1., active_dims=[2,3])
self.gp = gpflow.models.GPR(data=(gpXVar, gpYVar),kernel=self.kernel)
The overall parameter space is 4-dimensional where I expect dimensions (0,1) to share the same lengthscale parameter and (2,3) to share another lengthscale parameter). I bound the input space to the unit hypercube, and I normalize my labels. My question is: when setting the prior distributions over my kernel lengthscales, how does the internal implementation of gpflow inform my choices?
In particular, I only add data points to the regression if they are a minimum Euclidean distance away from any other point in the dataset (1e-4). Since I bound the input domain to the unit cube, I know the maximum distance between samples on any given dimension is 1, and the total Euclidean distance within the input domain is 2=Sqrt(1+1+1+1). From the excellent case study here:
https://betanalpha.github.io/assets/case_studies/gaussian_processes.html#3_Inferring_A_Gaussian_Process
I know that the prior distribution should lower bound the lengthscale to be 1e-4, but how am I to upper bound the lengthscales? Is the upper bound the maximum distance for a single dimension (1), is it the maximum distance for a given component kernel (sqrt(2)), or is it the maximum distance over the entire input domain (2)?
I have been pouring over the source code for GPFlow, but I can't seem to figure out which is correct.
Thank you.

Logistic Regression vs predicting probability by splitting data into bin

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.

The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

Convert independent sklearn GaussianMixture log probability scores to probabilities summing to 1

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.

variance vs coefficient of variation

I need to identify which statistic let me to find on digital image which line has the highest variation. I am using Variance (square units, calculated as numpy.var(x)) and Coefficient of Variation (unitless, calculated as numpy.sd(x)/numpy.mean(x)), but I got different values, as here:
v1 = line(VAR(x))
v2 = line(CV(x))
print(v1,v2)
The result:
(12,17)
Should not both find the same line?
Which one could be better to use in this case?

Coefficient of variation and variance are not supposed to choose the same array on a random data. Coefficient of variation will be sensitive to both variance and the scale of your data, whereas variance will be geared towards variation in your data.
Please see the example:
import numpy as np
x = np.random.randn(10)
x1= x+10
np.var(x), np.std(x)/np.mean(x)
(2.0571740850649021, -2.2697110381499224)
np.var(x1), np.std(x1)/np.mean(x1)
(2.0571740850649016, 0.1531035017615747)
Which one to choose depends on your application, but I'm leaning towards variance in your case.

Variance defines how much it varies from the mean (No Noise in the data) or Median(Noise in the data).
Coefficient of variation defines standarddeviation divided by mean. It always expressed in percentages.

Mathematical background of statsmodels wls_prediction_std

wls_prediction_std returns standard deviation and confidence interval of my fitted model data. I would need to know the way the confidence intervals are calculated from the covariance matrix. (I already tried to figure it out by looking at the source code but wasn't able to) I was hoping some of you guys could help me out by writing out the mathematical expression behind wls_prediction_std.

There should be a variation on this in any textbook, without the weights.
For OLS, Greene (5th edition, which I used) has
se = s^2 (1 + x (X'X)^{-1} x')
where s^2 is the estimate of the residual variance, x is vector or explanatory variables for which we want to predict and X are the explanatory variables used in the estimation.
This is the standard error for an observation, the second part alone is the standard error for the predicted mean y_predicted = x beta_estimated.
wls_prediction_std uses the variance of the parameter estimate directly.
Assuming x is fixed, then y_predicted is just a linear transformation of the random variable beta_estimated, so the variance of y_predicted is just
x Cov(beta_estimated) x'
To this we still need to add the estimate of the error variance.
As far as I remember, there are estimates that have better small sample properties.
I added the weights, but never managed to verify them, so the function has remained in the sandbox for years. (Stata doesn't return prediction errors with weights.)
Aside:
Using the covariance of the parameter estimate should also be correct if we use a sandwich robust covariance estimator, while Greene's formula above is only correct if we don't have any misspecified heteroscedasticity.
What wls_prediction_std doesn't take into account is that, if we have a model for the heteroscedasticity, then the error variance could also depend on the explanatory variables, i.e. on x.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.