I'm using scikit-optimize to do a BayesSearchCV within my RandomForestClassifier hyperparameter space. One hyperparameter is supposed to also be 0 (zero) while having a log-uniform distribution:
ccp_alpha = Real(min(ccp_alpha), max(ccp_alpha), prior='log-uniform')
Since log(0) is impossible to calculate, it is apparently impossible to have the parameter take the value 0 at some point.
Consequently, the following error is thrown:
ValueError: Not all points are within the bounds of the space.
Is there any way to work around this?
Note that getting a 0 from a log-uniform distribution is not well defined. How would you normalize this distribution, or in other words what would the odds be of drawing a 0?
The simplest approach would be to yield a list of values to try with the specified distribution. Since the values in this list will be sampled uniformly, you can use any distribution you like. For example with the list
reals = [0,0,0,0,x1,x2,x3,x4]
wehere x1 to x4 are log-uniformly distributed will give you odds 4 / 8 of drawing a 0, and odds 4 / 8 of drawing a log-uniformly distributed value.
If you really wanted to, you could also implement a class called MyReal (probably subclassed from Real) that implements a rvs method that yields the distribution you want.
Related
I am fairly new to Python and learning some basic python for data science, and I am trying to ascertain the min and max values for an array when you run randn.
to clarify the question, how potentially low and high could the below numbers get?
does it have anything to do with the row/column values entered?
I thought they were for values between -1 and 1 but this is not the case when I test.
import numpy as np
np.random.randn(3,3)
array([[ 1.61311526, -1.20028357, -0.41723647],
[-0.31983635, -3.05411198, -0.43453723],
[ 0.09385744, -0.28239577, -1.17262933]])
As mentioned by others, graphically, the probability distribution looks like this.
Probability of getting a value from -0.5 to 0.5: 19.1% + 19.1% = 38.2%
Probability of getting a value larger than 3 = 0.1%
np.random.randn return a sample (or samples) from the “standard normal” distribution (see documentation here).
The standard normal distribution is not bounded. However, the probability for example to get a sample smaller than -3 is 0.0013.
The function numpy.random.randn returns values from the standard normal distribution, which can be anything between negative and positive infinity, so there's no max or min. These values are distributed along the "bell curve" centered at 0, and are exponentially less likely to occur the farther you get from 0.
The row/column parameters don't affect determine any (non-existent) max/min, they just determine the shape of the output array (see the documentation)
So in your example, passing (3,3) into np.random.randn(3,3) returns a 3x3 array of values from the standard normal distribution.
Basically, there's no max or min value, but since higher numbers are less likely to come up or in other words have lower probabilities, you're usually only looking at -3.5 to 3.5. But the larger the size of random data you are trying to generate, the higher the chances of generating a larger value.
The numpy.random.randn function is based on a standard normal distribution, meaning that there is not a maximum value or a minimum value. However, more positive and negative values are less likely to be produced than ones closer to zero.
I am using SciPy's norm object here and I have a normal distribution here with a mean value of 100. and a standard deviation of 20.:
from scipy.stats import norm
dist = norm(loc=100., scale=20.)
I want to get the probability of a new instance being in locations... let's say... 70, or 120, how can I retrieve this probability using the norm object?
The norm object has a few methods such as norm.pdf, norm.cdf, norm.ppf, etc.. I am not sure which one I can use for this task.
Thank you
First of all you are talking of normal distribution which is a continuous distribution so you cannot get the probability that a new instance is at an exact location (that would be 0 by definition).
In your example you can get the probability that the observation is for example > 70 or < 70 (the strict inequality makes no difference for continuous distributions hence >= or > are same).
You need to use dist.cdf(70) for this to get P(X<=70) and 1 - dist.cdf(70) to get P(X>70)
I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?
In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.
Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.
I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!