In sklearn.cross_decomposition.PLSRegression, we can obtain the latent variables scores from the X array using x_scores_.
I would like to extract the loadings to calculate the latent variables scores for a new array W. Intuitively, what I whould do is: scores = W*loadings (matrix multiplication). I tried this using either x_loadings_, x_weights_, and x_rotations_ as loadings as I could not figure out which array was the good one (there is little info on the sklearn website). I also tried to standardize W (subtracting the mean and dividing by the standard deviation of X) before multiplying by the loadings. But none of these works (I tried using the X array and I cannot obtain the same scores as in the x_scores_ array).
Any help with this?
Actually, I just had to better understand the fit() and transform() methods of Sklearn. I need to use transform(W) to obtain the latent variables scores of the W array:
1.Fit(): generates learning model parameters from training data
2.Transform(): uses the parameters generated from fit() method to transform a particular dataset
Related
I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function
I'm trying to write a Kernel Density Estimation algorithm in Tensorflow.
When fitting the KDE model, I am iterating through all the data in the current batch and, for each, I am creating a kernel using the tensorflow.contrib.distributions.MultivariateNormalDiag object:
self.kernels = [MultivariateNormalDiag(loc=data, scale=bandwidth) for data in X]
Later, when trying to predict the likelihood of a data point with respect to the model fitted above, for each data point I am evaluating, I am summing together the probability given by each of the kernels above:
tf.reduce_sum([kernel._prob(X) for kernel in self.kernels], axis=0)
This approach only works when X is a numpy array, as TF doesn't let you iterate over a Tensor. My question is whether or not there is a way to make the algorithm above work with X as a tf.Tensor or tf.Variable?
One answer that I found for this problem tackles the problem of fitting the KDE and predicting the probabilities in one fell swoop. The implementation is a bit hacky, though.
def fit_predict(self, data):
return tf.map_fn(lambda x: \
tf.div(tf.reduce_sum(
tf.map_fn(lambda x_i: self.kernel_dist(x_i, self.bandwidth).prob(x), self.fit_X)),
tf.multiply(tf.cast(data.shape[0], dtype=tf.float64), self.bandwidth[0])), self.X)
The first tf.map_fn iterates through the data for which we are calculating the likelihood, summing together the probabilities from each of the individual kernels.
The second tf.map_fn iterates through all the data that we use to fit our model, and creates a tf.contrib.distributions.Distribution (here this is parameterized by kernel_dist).
self.X and self.fit_X are placeholders that are created when initializing the KernelDensity object.
x = data.values
y = target.values
lda = LDA(solver='eigen', shrinkage='auto',n_components=2)
df_lda = lda.fit(x,y).transform(x)
df_lda.shape
This is the small part of the code. I am trying to reduce the dimensionality to the most discriminative directions. To my understanding the transform() function projects data to maximize class separation for my data set and should return an array of shape (n_samples, n_components)
But my df_lda is of shape (614, 1).
What am I missing here ? Or is my data not linearly separable?.
For the case of K distinct classes in target.values there are K-1 components in the transformed data (without further dimensionality reduction). Since you only have two classes in your data set, there is only one transformed component so you cannot get more components than that.
I suppose it might by helpful for sklearn to issue a warning when you request more than are available.
I'm reconstructing a paper. They trained Gradient Boosting Regression Trees given the input X and soft targets y_s to get the final output y with minimum mean squared error. Regarding the paper they implemented all decision tree based methods using scitkit-learn package without any modification. This is what I want to do.
If you know the solution already I would be happy to hear, otherwise here are my thoughts:
Just for simplification assume we have a binary problem with
X = [[x1 x2 x3] [x1 x2 x3]...] and
y_s [[0.4 0.6][0.8 0.2]...].
Regarding the GradientBoostingTree for classification (see link above), I can only feed in a 1-dim class array
(y : array-like, shape = [n_samples]) Target values (integers in
classification, real numbers in regression) For classification, labels
must correspond to classes.
, so even when I would overwrite the cost function (e.g. to cross-entropy) which can handle soft labels, I'm still cannot feed in the 2 dim soft labels (at least).
Another idea was to reduce it to 1-dim by only take one soft label (only works for binary problem where both soft labels add up to 1) and use GradientBoostingRegression instead. But again only one class is possible and I can also not train independent models like
X = [[1,2,3], [1,2,3], [4,5,6]]
y = [[3.141, 2.718], [3.141, 2.718], [2.718, 3.141]]
rgr = MultiOutputRegressor(GradientBoostingRegressor(random_state=0))
rgr.fit(X, y)
X_test = [[1.5,2.5,3.5], [3.5,4.5,5.5]]
rgr.predict(X_test)
because of the correlation between the outputs..
Big picture:
1. Extraction of combined features
2.
a) Training: extracted features(Xb), original labels(y) -> logistic regression
b) Prediction: soft labels (yb)
3.
a) Training: original features (X), soft labels(yb) -> GradientBoostingTree
b) Evaluation: predicting normal labels (y_)
-> Importance of original features
The entire procedure without the soft labels is worthless. I mean it has to be possible somehow but I cannot figure out how...
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html
scikit-learn's docs on multi-output decision trees should point you in the right direction
I was trying to train a lr classifier against text dataset, different from common scene where text data directly feed to tfidf vectorizer, orginal text line was first transformed into dictionary like {a:0.1, phrase:0.5, in:0.3, line:0.8}, in which weights were computed due to some specific rules and some words were omitted. so, in order to feed these dictionaries to lr classifier, I chose FeatureHasher to do the hash trick. However, I found the lr classifier worked extremely slow when the n_features param of FeatureHasher grew large, say 10^8.
But as far as I know, both memory-cost and calculation-cost of sparse matrix should not grow with dimensions while the number of valid elements is fixed. For example, if we have a two-element sparse vector [coordinate:(1,2), value:(3,4)], where its original dimension is 10. we change the hash-range to 20, and we get [(3,7), (3,4)], there is no difference in storing these two vectors, and if we calculate its distance with another sparse vector, we only need to traverse to list with fixed number of elements therefore calculation-cost if fixed.
I think there must be something wrong with my understanding, or I should have missed something with the lr classifier of sklearn, hope someone would correct me, thanks!