I'm implementing a multinomial logistic regression model in Python using Scikit-learn. Here's my code:
X = pd.concat([each for each in feature_cols], axis=1)
y = train[["<5", "5-6", "6-7", "7-8", "8-9", "9-10"]]
lm = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lm.fit(X, y)
However, I'm getting ValueError: bad input shape (50184, 6) when it tries to execute the last line of code.
X is a DataFrame with 50184 rows, 7 columns. y also has 50184 rows, but 6 columns.
I ultimately want to predict in what bin (<5, 5-6, etc.) the outcome falls. All the independent and dependent variables used in this case are dummy columns which have a binary value of either 0 or 1. What am I missing?
The Logistic Regression 3-class Classifier example illustrates how fitting LogisticRegression uses a vector rather than a matrix input, in this case the target variable of the iris dataset, coded as values [0, 1, 2].
To convert the dummy matrix to a series, you could multiply each column with a different integer, and then - assuming it's a pandas.DataFrame - just call .sum(axis=1) on the result. Something like:
for i, col in enumerate(y.columns.tolist(), 1):
y.loc[:, col] *= i
y = y.sum(axis=1)
Related
I fitted a LogisticRegression on a training set and a test set and got accuracies of ~80%
Then i wanted to make predictions on the test set, giving scores of each student_id depending on whether they answered_correctly or not [1 if yes, 0 if no].
I did this :
features_X = X.columns # getting columns names of X
# X_test is an array created from a previous train_test_split step.
test_df = pd.DataFrame(columns=features_X, data=X_test)
predictions = grid_logit.predict(test_df[features_X])
#Create a DataFrame with predictions
submission = pd.DataFrame({'Id':test_df['student_id'],'Answered_correctly':predictions})
#Visualize the first 5 rows
submission.head()
Id Answered_correctly
12992348 0
7268428 0
9497321 1
588792 1
5045118 1
As you can see it classifies each user between 0 and 1.
What i want is something like this :
Id Answered_correctly
12992348 0.32
7268428 0.52
9497321 0.65
answered_correctly_values corresponding to the probability of being class 1.
NB: Using predict_probafunction returns an error :
Exception: Data must be 1-dimensional
EDIT :
I replaced predict with predict_proba(test_df[[features_X]])
but it returns an error : None of [[ features_X cols]] are in the [columns]
predict_proba returns the probability estimates for each class. Given that you have two classes (0 and 1), it will return an array of shape (n_samples, 2).
The error message originates from the pandas dataframe, as it requires you to pass 1-dimensional data only. As mentioned above, predictions is but a 2-dimensional output.
Only pass the probability estimates for class 1 (predictions[:, 1]) to the dataframe constructor and it should work fine:
submission = pd.DataFrame({'Id': test_df['student_id'], 'Answered_correctly': predictions[:, 1]})
Additional note:
If test_df has all columns given by features_X, you do not need to pass test_df[features_X] since test_df should be sufficient:
predictions = grid_logit.predict_proba(test_df)
I am very new to time series modeling and statsmodels and trying to understand the AR model in statsmodels. Suppose I have a data record y of 1000 samples, and I fit an AR (1) model on y. Then I generate the in-sample prediction from this model as y_pred. I do this as
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(y,1).fit()
y_pred = model.predict()
I get the parameters of the model using model.params.
I would like to know, after estimating the model parameters, how does statsmodels calculate the in-sample predictions? For ex. how is y_pred[10] calculated?
I am sorry if the question is too basic, thanks for the help.
Per Wikipedia:
The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).
In your model example, you have one predictor - lagged value of y. In this simple case, the .predict() method multiplies each lagged value by the value of the estimated linear slope parameter for that predictor and adds the estimated value of the intercept of that line. So y_pred[10] will be equal to the product of the fitted slope parameter and y[9], with the value of the intercept estimate added.
Here is an example:
from statsmodels.tsa.ar_model import AutoReg
y = [1, 2, 3, 6, 2, 9, 1]
model = AutoReg(y,1).fit()
model.params
# array([ 5.72953737, -0.49466192])
The first value in the params array is the estimated intercept parameter and the second value is the estimated linear (slope) parameter.
y_pred = model.predict()
y_pred
# array([5.23487544, 4.74021352, 4.2455516 , 2.76156584, 4.74021352, 1.27758007])
The first value in the y_pred array is the predicted value for the second value in the y array. It is calculated as:
-0.49466192 * 1 + 5.72953737 = 5.23487544
The second value in the y_pred array is computed as:
-0.49466192 * 2 + 5.72953737 = 4.74021353
and so on...
With a supervised learning method, we have features (inputs) and targets (outputs). If we have multi-dimensional targets that sum to 1 row-wise (e.g [0.3, 0.4, 0.3]) why does sklearn's RandomForestRegressor seem to normalize all outputs/predictions to sum to 1 when the training data sums to 1?
It seems like somewhere in the source code of sklearn it is normalizing outputs if the training data sums to 1, but I haven't been able to find it. I've gotten to the BaseDecisionTree class which seems to be used by random forests, but haven't been able to see any normalization going on it there. I created a gist to show how it works. When the row-wise sums of the targets don't sum to 1, the outputs of the regressor do not sum to 1. But when the row-wise sums of the targets DO sum to 1, it seems to normalize it. Here is the demonstration code from the gist:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
# simulate data
# 12 rows train, 6 rows test, 5 features, 3 columns for target
features = np.random.random((12, 5))
targets = np.random.random((12, 3))
test_features = np.random.random((6, 5))
rfr = RandomForestRegressor(random_state=42)
rfr.fit(features, targets)
preds = rfr.predict(features)
print('preds sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
# normalize targets to sum to 1
norm_targets = targets / targets.sum(axis=1, keepdims=1)
rfr.fit(features, norm_targets)
preds = rfr.predict(features)
te_preds = rfr.predict(test_features)
print('predictions all sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
print('test predictions all sum to 1?')
print(np.allclose(te_preds.sum(axis=1), np.ones(6)))
As one last note, I tried running a comparable fit in other random forest implementations (H2O in Python, in R: rpart, Rborist, RandomForest) but didn't find another implementation that allows multiple outputs.
My guess is that there is a bug in the sklearn code which is mixing up classification and regression somehow, and the outputs are being normalized to 1 like a classification problem.
What can be misleading here, is that you are only looking at the resulting sum of the output values. The reason why all predictions add up to 1 when the model is trained with the normalized labels, is that it will be predicting only among these multi-output arrays that it has seen. And this is happening because with such few samples, the model is overfitting, and the decision tree is de facto acting like a classifier.
In other words, looking at the example where the output is not normalised (the same applies to a DecisionTree):
from sklearn.tree import DecisionTreeRegressor
features = np.random.random((6, 5))
targets = np.random.random((6, 3))
rfr = DecisionTreeRegressor(random_state=42)
rfr.fit(features, targets)
If we now predict on a new set of random features, we will be getting predictions among the set of outputs the model has been trained on:
features2 = np.random.random((6, 5))
preds = rfr.predict(features2)
print(preds)
array([[0.0017143 , 0.05348525, 0.60877828], #0
[0.05232433, 0.37249988, 0.27844562], #1
[0.08177551, 0.39454957, 0.28182183],
[0.05232433, 0.37249988, 0.27844562],
[0.08177551, 0.39454957, 0.28182183],
[0.80068346, 0.577799 , 0.66706668]])
print(targets)
array([[0.80068346, 0.577799 , 0.66706668],
[0.0017143 , 0.05348525, 0.60877828], #0
[0.08177551, 0.39454957, 0.28182183],
[0.75093787, 0.29467892, 0.11253746],
[0.87035059, 0.32162589, 0.57288903],
[0.05232433, 0.37249988, 0.27844562]]) #1
So logically, if all training outputs add up to 1, the same will apply to the predicted values.
If we take the intersection of the sums along the first axis for both the targets and predicted values, we see that all predicted values' sum exists in targets:
preds_sum = np.unique(preds.sum(1))
targets_sum = np.unique(targets.sum(1))
len(np.intersect1d(targets_sum, preds_sum)) == len(features)
# True
I generate a simple linear model in which X (dimension D) variables come from multi-normal with 0 covariance. Only the first 10 variables have true coefficients of 1, the rest have coefficients 0. Hence, theoretically, the ridge regression results should be the true coefficients divided by (1+C), where C is the penalty constant.
import numpy as np
from sklearn import linear_model
def generate_data(n):
d = 100
w = np.zeros(d)
for i in range(0,10):
w[i] = 1.0
trainx = np.random.normal(size=(n,d))
e = np.random.normal(size=(n))
trainy = np.dot(trainx, w) + e
return trainx, trainy
Then I use:
n = 200
x,y = generate_data(n)
regr = linear_model.Ridge(alpha=4,normalize=True)
regr.fit(x, y)
print(regr.coef_[0:20])
Under normalize = True, I get the first 10 coefficients to be somewhere 20% (i.e. 1/(1+4)) of the true value of 1. When normalize = False, I get the first 10 coefficients to be around 1, which are the same results as a simple linear regression model. Moreover, since I generate the data to be mean = 0 and std = 1, normalize = True shouldn't do anything as the data is already "normalized". Can someone explain to me what is going on here? Thanks!
It's important to understand that normalizing and standardizing are not the same and both cannot be done at the same time. You can either normalize or standardize.
Often Standardizing refers to transforming the data so that it has 0 mean and unit (1) variance. E.g. can be achieved by removing the mean and dividing by the standard deviation. In this case, this would be feature (column) wise.
Commonly Normalizing refers to transforming the data values to a range between 0 and 1. E.g. can be achieved by dividing by the length of the vector. But that doesn't mean that the mean is going to be 0 and the variance 1.
After generating trainx, trainy they're not not normalized yet. Maybe print it to see your results.
So, when normalize=True, trainx will be normalized by subtracting the mean and dividing by the l2-norm (according to sklearn).
When normalize=False, trainx will remain as is.
If you do normalize=True, every feature column is divided by its L2 norm, in other words, magnitude of every feature column is diminished, which causes the estimated coefficients to be larger (βX should be more or less constant; the smaller X, the larger β). When coefficients are larger, greater L2 penalty is imposed. The function thus places more focus on L2 penalty rather than the linear part (Xβ). The estimates of coefficients from the linear part, as a result, is not so accurate compared to pure linear regression.
By contrast, if normalize=False, X is bigger, β is smaller. Given the same alpha, L2 penalty is marginal. More focus is on linear part - the result is close to a pure linear regression.
I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)