I have a book dataset. I want to make a fixed effect regression model.
I want to fixed effect of year, month, day and book_genre in my model, so in this case I will take out the effects of repetition of the same books in multiple observations. I want to use Python code for my fixed effect model. My variables are:
Variables that I want to fix them are: year, month, day and book_genre.
Other variables in the model are: Read_or_not: categorical variable, ne_factor, x1, x2, x3, x4, x5= numerical variables
Response variable: Y
I used this code but I get an error "DataFrame input must have a MultiIndex with 2 levels"
I highly appreciate it if you help me with how I can fix my code to make a fixed effect model regression.
I also attach a png of dataset to show the variables:
''''
import pandas as pd
from linearmodels import PanelOLS
import numpy as np
df = pd.read_csv('all_a.csv')
df
# Set the index for fixed effects
data = df.set_index(['year', 'month', 'day','book_genre'])
data = df.dropna(subset=['book_id','year','month','day','Read_or_not ' ,'ne_factor,','Y','book_genre','X1', 'X2','X3',"X4" ,"X5"])
# Regression
FE = PanelOLS(data.attention_data_score, data[ 'Y'],
entity_effects = True,
time_effects=True
)
# Result
result = FE.fit(cov_type = 'clustered',
cluster_entity=True,
cluster_time=True
)
Related
In R, I have a data frame with two categorical predictors, one of which has multiple levels, and a categorical response. I am running a multinomial logistic regression on each of the categorical predictors, plus the interaction of the two categorical predictors.
library(VGAM)
x1 <- as.factor(rep(c(1,2,3,3,2,1,1,2,3,3,2,1),5))
x2 <- as.factor(rep(c(1,1,1,1,1,1,2,2,2,2,2,2),5))
y <- as.factor(rep(c(1,2,3,1,2,3,1,2,3,1,2,3),5))
VGAM's vglm function has the ability to handle the categorical variables and their interactions.
M <- vglm(y ~ x1*x2, family=multinomial)
However, I now have to do this work in Python, and I am having a hard time getting the categorical variables to function as cleanly in statsmodels as they do in R. R does the categorical encoding from a factor variable just fine and then does the interactions. statsmodels has not done that for me (yet).
I have the Python function that fits multinomial logistic regressions, smf.mnlogit (smf coming from `import statsmodels.formulas.api as smf'). How can I use that with the factor variables to get the interactions that I get in R?
Here is the Python code I've tried:
# import packages
#
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Define data
#
x1 = np.array([1,2,3,3,2,1,1,2,3,3,2,1] * 5)
x2 = np.array([1,1,1,1,1,1,2,2,2,2,2,2] * 5)
y = np.array([1,2,3,1,2,3,1,2,3,1,2,3] * 5)
# Make data frame
#
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
# Make the columns categorical
#
df['x1'] = df['x1'].astype('category')
df['x2'] = df['x2'].astype('category')
df['y'] = df['y'].astype('category')
# fit the multinomial logistic regression
#
mlr = smf.mnlogit(formula='y ~ x1*x2', data=df).fit()
I get the following error:
ValueError: endog has evaluated to an array with multiple columns that has shape (60, 3). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
I think the categorical columns get one hot encoded once they are used as a target variable due to which you are getting this error. A possible solution to this would be to encode various categories to numbers and then normalize to supply it to the logit() function (Although it is not right to encode string categories to integer values).
Consider the following example:
df_log[target] = pd.Categorical(df_log[target])
df_log[target] = df_log[target].cat.codes
min_max_scaler = preprocessing.MinMaxScaler()
df_log[[target]] = min_max_scaler.fit_transform(df_log[[target]])
formula = "target ~ x1 + x2"
model = smf.logit(formula=formula, data=df_log).fit()
I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable.
For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2.
My problem is how to do the same with the adjusted R2 and the p value
At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results.
To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0).
This post has got more information on the issue.
I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.
I'm trying to use scikit learn in python to do a couple different classifier problems (RF, GBM, etc). In addition to building models and making predictions, I'd like to see variable importance. I know there is a way to get the importances
importances = clf.feature_importances_
print(importances)
but how do I get something more refined that has the importance connected to the variable name (ie summary(gbm) in R or varImp(randomForest) in R) especially if it's a categorical variable with multiple levels?
The variable importance (or feature importance) is calculated for all the features that you are fitting your model to. This pseudo code gives you an idea of how variable names and importance can be related:
import pandas as pd
train = pd.read_csv("train.csv")
cols = ['hour', 'season', 'holiday', 'workingday', 'weather', 'temp', 'windspeed']
clf = YourClassifiers()
clf.fit(train[cols], train.targets) # targets/labels
print len(clf.feature_importances_)
print len(cols)
You will see that the lengths of the two lists being printed are the same - you can essentially map the lists together or manipulate them how you wish. If you'd like to show variable importance nicely in a plot, you could use this:
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(6 * 1.618, 6))
index = np.arange(len(cols))
bar_width = 0.35
plt.bar(index, clf.feature_importances_, color='black', alpha=0.5)
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Feature importance')
plt.xticks(index + bar_width, cols)
plt.tight_layout()
plt.show()
If you don't want to use this method (meaning that you are fitting all columns, not just selected few as set in cols variable), then you could get the column/feature/variable names of your data with train.columns.values (and then map this list together with variable importance list or manipulate in some other way).
I generate a set of features for input, that I store as a table using pandas and the CSV format.
(Each column header represents a feature names, except for the first, blank column, which is where the class labels are stored for each row).
My next step is reading the table from the csv file, into scikit learn. (I'm currently doing this with pandas again). However, after training and experimenting with my models using different feature selection methods (and different initially generated features), I want the NAMES of the selected features.
I assume this should be trivial, but I just haven't found how to do it.
(Note: I am NOT working on standard text documents, so "CountVectorizer" and "NaiveBayes"/nltk and the like do not help me).
I need a method to get the selected features, (and preferably something to drop the unselected ones, for when I apply the models and selected features on new "test" data).
Thank you very much!
My data is currently loaded like this:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
def load_data(filename="Feat_normalized.csv") :
df = pd.read_csv(filename, index_col=0)
lb = LabelEncoder()
labels = lb.fit_transform((df.index.values))
features = df.values
feature_names = list(df.columns)
feature_names.pop(0) #Remove index.
return (features, labels, lb)
features, labels, lb_encoder = load_data(filename)
X, y = features, labels
clf_logit = LogisticRegression(penalty="l1", dual=False, class_weight='auto')
X_reduced = clf_logit.fit_transform(X, y)
print('New sparse (filtered) features matrix size:')
print(X_svm.shape)
#Then fit to various models, Random forests, SVM, etc'..
Truncated Example of the first 2 rows in the input data/csv:
AA_C AA__D AA__E AA_F AA__G AA_H AA_I AA_K AA_L AA_M
Mammal_sequence_1.0.fasta 3.838099345 0.456591162 3.764884604 3.620232638 3.460992571 3.858487012 2.69247235 3.18710619 3.671029774 4.625996297 1.542632799
(AA_"" = Feature name. Mammal_sequence_1.0.fasta = Class name/label; (1 per row, empty header).
Thank you very much!