In R, I have a data frame with two categorical predictors, one of which has multiple levels, and a categorical response. I am running a multinomial logistic regression on each of the categorical predictors, plus the interaction of the two categorical predictors.
library(VGAM)
x1 <- as.factor(rep(c(1,2,3,3,2,1,1,2,3,3,2,1),5))
x2 <- as.factor(rep(c(1,1,1,1,1,1,2,2,2,2,2,2),5))
y <- as.factor(rep(c(1,2,3,1,2,3,1,2,3,1,2,3),5))
VGAM's vglm function has the ability to handle the categorical variables and their interactions.
M <- vglm(y ~ x1*x2, family=multinomial)
However, I now have to do this work in Python, and I am having a hard time getting the categorical variables to function as cleanly in statsmodels as they do in R. R does the categorical encoding from a factor variable just fine and then does the interactions. statsmodels has not done that for me (yet).
I have the Python function that fits multinomial logistic regressions, smf.mnlogit (smf coming from `import statsmodels.formulas.api as smf'). How can I use that with the factor variables to get the interactions that I get in R?
Here is the Python code I've tried:
# import packages
#
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Define data
#
x1 = np.array([1,2,3,3,2,1,1,2,3,3,2,1] * 5)
x2 = np.array([1,1,1,1,1,1,2,2,2,2,2,2] * 5)
y = np.array([1,2,3,1,2,3,1,2,3,1,2,3] * 5)
# Make data frame
#
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
# Make the columns categorical
#
df['x1'] = df['x1'].astype('category')
df['x2'] = df['x2'].astype('category')
df['y'] = df['y'].astype('category')
# fit the multinomial logistic regression
#
mlr = smf.mnlogit(formula='y ~ x1*x2', data=df).fit()
I get the following error:
ValueError: endog has evaluated to an array with multiple columns that has shape (60, 3). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
I think the categorical columns get one hot encoded once they are used as a target variable due to which you are getting this error. A possible solution to this would be to encode various categories to numbers and then normalize to supply it to the logit() function (Although it is not right to encode string categories to integer values).
Consider the following example:
df_log[target] = pd.Categorical(df_log[target])
df_log[target] = df_log[target].cat.codes
min_max_scaler = preprocessing.MinMaxScaler()
df_log[[target]] = min_max_scaler.fit_transform(df_log[[target]])
formula = "target ~ x1 + x2"
model = smf.logit(formula=formula, data=df_log).fit()
Related
I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)
I am trying to create some code that gives weight to the most impactful features.
My dataframe contains both nominal and categorical data.
example data:
[Brand] [Model] [Car_price] [...] [Prime]
BMW X1 40,000 300
The Y is the prime and X is all other columns.
I tried using the following:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data, delimiter=";")
#df = df.dropna(axis=1)
array = df.values
X = array[:,(6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,34,35,37,44,45,47,48,54,61,62)]
Y = array[:,51]
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, Y)
And get the following error: ValueError: could not convert string to float
I know there is a way to transform from string into numerical data, but was wondering if it is necessary. What fixes can I apply to get weighted features?
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.
I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.
I am currently using Scikit-Learn's LogisticRegression to build a model. I have used
from sklearn import preprocessing
scaler=preprocessing.StandardScaler().fit(build)
build_scaled = scaler.transform(build)
to scale all of my input variables prior to training the model. Everything works fine and produces a decent model, but my understanding is the coefficients produced by LogisticRegression.coeff_ are based on the scaled variables. Is there a transformation to those coefficients that can be used to adjust them to produce coefficients that can be applied to the non-scaled data?
I am thinking forward to am implementation of the model in a productionized system, and attempting to determine if all of the variables need to be pre-processed in some way in production for scoring of the model.
Note: the model will likely have to be re-coded within the production environment and the environment is not using python.
You have to divide by the scaling you applied to normalise the feature, but also multiply by the scaling that you applied to the target.
Suppose
each feature variable x_i was scaled (divided) by scale_x_i
the target variable was scaled (divided) by scale_y
then
orig_coef_i = coef_i_found_on_scaled_data / scale_x_i * scale_y
Here's an example using pandas and sklearn LinearRegression
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
boston = load_boston()
# Looking at the description of the data tells us the target variable name
# print boston.DESCR
data = pd.DataFrame(
data = np.c_[boston.data, boston.target],
columns = list(boston.feature_names) + ['MVAL'],
)
data.head()
X = boston.data
y = boston.target
lr = LinearRegression()
lr.fit(X,y)
orig_coefs = lr.coef_
coefs1 = pd.DataFrame(
data={
'feature': boston.feature_names,
'orig_coef' : orig_coefs,
}
)
coefs1
This shows us our coefficients for a linear regression with no scaling applied.
# | feature| orig_coef
# 0| CRIM | -0.107171
# 1| ZN | 0.046395
# 2| INDUS | 0.020860
# etc
We now normalise all our variables
# Now we normalise the data
scalerX = StandardScaler().fit(X)
scalery = StandardScaler().fit(y.reshape(-1,1)) # Have to reshape to avoid warnings
normed_X = scalerX.transform(X)
normed_y = scalery.transform(y.reshape(-1,1)) # Have to reshape to avoid warnings
normed_y = normed_y.ravel() # Turn y back into a vector again
# Check it's worked
# print np.mean(X, axis=0), np.mean(y, axis=0) # Should be 0s
# print np.std(X, axis=0), np.std(y, axis=0) # Should be 1s
We can do the regression again on this normalised data...
# Now we redo our regression
lr = LinearRegression()
lr.fit(normed_X, normed_y)
coefs2 = pd.DataFrame(
data={
'feature' : boston.feature_names,
'orig_coef' : orig_coefs,
'norm_coef' : lr.coef_,
'scaleX' : scalerX.scale_,
'scaley' : scalery.scale_[0],
},
columns=['feature', 'orig_coef', 'norm_coef', 'scaleX', 'scaley']
)
coefs2
...and apply the scaling to get back our original coefficients
# We can recreate our original coefficients by dividing by the
# scale of the feature (scaleX) and multiplying by the scale
# of the target (scaleY)
coefs2['rescaled_coef'] = coefs2.norm_coef / coefs2.scaleX * coefs2.scaley
coefs2
When we do this we see that we have recreated our original coefficients.
# | feature| orig_coef| norm_coef| scaleX| scaley| rescaled_coef
# 0| CRIM | -0.107171| -0.100175| 8.588284| 9.188012| -0.107171
# 1| ZN | 0.046395| 0.117651| 23.299396| 9.188012| 0.046395
# 2| INDUS | 0.020860| 0.015560| 6.853571| 9.188012| 0.020860
# 3| CHAS | 2.688561| 0.074249| 0.253743| 9.188012| 2.688561
For some machine learning methods, the target variable y must be normalised as well as the feature variables x. If you've done that, you need to include this "multiply by the scale of y" step as well as "divide by the scale of X_i" to get back the original regression coefficients.
Hope that helps
Short answer, to get LogisticRegression coefficients and intercept for unscaled data (assuming binary classification, and lr is a trained LogisticRegression object):
you must divide your coefficient array element wise by the (since v0.17) scaler.scale_ array: coefficients = np.true_divide(lr.coeff_, scaler.scale_)
you must subtract from your intercept the inner product of the resulting coefficients (the division result) array with the scaler.mean_ array: intercept = lr.intercept_ - np.dot(coefficients, scaler.mean_)
you can see why the above needs to be done, if you think that every feature is normalized by substracting from it its mean (stored in the scaler.mean_ array) and then dividing it by its standard deviation (stored in the scaler.scale_ array).
You can use pipeline with two steps: scaling and regression. It takes raw data as input and produces regression desired.
Or if you explicitly want to get coefficients, you can manually combine LogisticRegression coefficients with scaler parameters which are scaler.mean_ and scaler.std_.
To do so, note that standardscaler normalized data this way: v_norm = (v - M(v))/ sigma(v). Here M(v) is mean of raw variable v and sigma(v) is it's standard deviation and stored in scaler.mean_ and scaler.std_ arrays respectively.
Then LogisticRegression takes this normalized variables and multiplies them by LogisticRegression.coef_ and adds intercept_.