I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
Related
I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().
mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
pd.get_dummies(df[["Education","Gender"]]),
preprocessing.LabelEncoder().fit_transform(df["statement"])
)
I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this:
Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this:
These sum to 1 across the 3 potential categories for each data point.
However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.
First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this:
From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below:
For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.
I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?
I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean:
=
= 0.737007424626824
Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).
Sources that got me here:
How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function
I have a dataset that consists of different features, like "gender". The task of the model is to determine if the annual income is above or below 50k.
Let say I have a trained network that does the classification.
Now I want to see how often the classifier makes false positive respectively false negative predictions by grouping them accordingly to the gender feature.
The basic idea is a confusion matrix of some sorts, but not a matrix of class to class but class to feature.
The image below illustrates the result I would like to have.
The basic idea is as follows:
1)Make a prediction with the Network.
2)Set the predicted values as new column in your Dataset, you now have a new dataset data_new
Your dataset now has two columns, one for the predicted and one for the true values. You can calculate the overall accuracy by boolean comparison (1 and 1 is right prediction and 0 and 1 and 1 and 0 are wrong predictions respectively).
3)Now you can filter the new data for any column you want, so in my case for the specific gender.
4)Now you can calculate the accuracy w.r.t to chosen gender.
I have a SGDClassifier model trained with scikit-learn. I extract features names with .get_feature_names() and coefficients with .coef_
I combine the 2 columns in a dataframe like this :
feature value
hiroshima 3.918584
wildfire 3.287680
earthquake 3.256817
massacre 3.186762
storm 3.124809
... ...
job -1.696438
song -1.736640
as -1.956571
nowplaying -2.028240
write -2.263968
I want to know how I can interpret the features importances ?
What does a positive high value mean?
What does a low negative value mean?
SGDClassifier fits a linear model, meaning that the decision is essentially based on
SUM_i w_i f_i + b
where w_i is the weight attached to feature f_i, consequently you can interpret these numbers as literally "votes" for positive/negative class at the scale proportional to their absolute value. All that your classifier does is to add these weights, and then it adds _intercept value from your model, and classifies based on the sign.
I am working on a linear regression task and I only know the concept of simple linear regression where we give an 'x' value and it predicts the 'y' value.
I have generated semi-random numbers between 100 to 100000 using a specific algorithm and save the result in a CSV column.
Now I want to use this column and train a Linear Regressor that it learns the sequence between these numbers and then to predict a number on the basis of the last number which I will give to it.
Or I can Treat this problem as a sequence generation problem using LSTM. Will LSTM is a good approach for this, in which I will feed this 1-D dataset of numbers and on the basis of this LSTM will generate more numbers?
I have only one column which is x column and doesn't have a y column.
I searched "How to use linear regression on 1-D data" but found nothing.
Is there any way to train a Linear Regression on 1-D data to predict a number?
I am using Python language for this task.
My CSV file looks like this:
I think you can get some idea from time series analysis processes like moving averages and auto regressive and create the dataset that can fit for regression problem.
You can plot auto-correlation to find how many lags do you need to consider for next prediction.
you can use pandas autocorr function to find the auto-correlation up to some lag and plot the correlogram.
lets say your last 5 values are highly correlated with the latest value.
then you can stack these numbers as a one row like this,in your case latest value is t,
| ---------- X_train --------------------| |-- y train|
1st row -> 226,200,1169,134,117 (t-1 ,t-2,t-3,t-4,t-5) predicted value -> 239 (t)
2nd row -> 200,1169,134,117,759 (t-2 ,t-3,t-4,t-5,t-6) predicted value -> 226 (t-1)
3rd row -> 1169,134,117,759,102 (t-3 ,t-4,t-5,t-6,t-7) predicted value -> 200 (t-2)
...................................................... ...................so on..
Pandas shift method is use to shift the dataset by lag by lag easily and create the dataset.
Now you have X_train and y_train set.Split the dataset and train a linear model.
In Neural Networks, the number of samples used for training data is 5000 and before the data is given for training it was normalized using the formula
y - mean(y)
y' = -----------
stdev(y)
Now I want to de-normalise the data after getting the predicted output. Generally for prediction a test data data is used which is 2000 samples. In order to de-normalize, following formula is used
y = y' * stdev(y) + mean(y)
This approach is taken from the following thread
[How to denormalise (de-standardise) neural net predictions after normalising input data
Could anyone explain me how the same mean and standard deviation used in normalizing the training data(5000*2100) could be used in de-normalizing the predicted data as you know for prediction test data(2000*2100) is used,both the counts are different.
The denormalization equation is simple algebra: it's the same equation as normalization, but solved for y instead of y'. The function is to reverse the normalization process, recovering the "shape" of the original data; that's why you have to use the original stdev and mean.
Normalization is a process of shifting the data to center on 0 (using the mean), and then squeezing the distribution to a standard normal curve (for a new stdev of 1.0). To return to the original shape, you have to un-shift and un-squeeze the same amounts as the original distribution.
Note that we expect the predicted data to have a mean of 0 and a stdev around 1.0 (with some change in variations due to the central tendency theorem). Your worry is not silly: we do have a different population count for the stdev.