I am trying to create a multiple linear regression model to predict
the rating a guest gives to a hotel (Reviewer_Score) in Python using statsmodels.
Review_Total_Negative_Word_Counts is how long their negative comments about the hotel are
Total_Number_of_Reviews is how many reviews the hotel has
Review_Total_Positive_Word_Counts is how long their positive comments about the hotel are
Total_Number_of_Reviews_Revewier_Has_Given is how many reviews the guest has given on the site
Attitude is a categorical variable: GOOD or BAD
Reason is reason for visit (Leisure or Business)
Continent is the continent which the guest came from (multiple levels)
Solo is whether the traveler is a solo traveler ('Yes' or 'No')
Season is during which season the guest stayed at the hotel ('Fall', 'Winter', 'Summer', 'Spring')
As you can see, I have some numeric and also categorical features.
My code so far is:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Reviewer_Score ~ Review_Total_Negative_Word_Counts + Total_Number_of_Reviews + Review_Total_Positive_Word_Counts + Total_Number_of_Reviews_Reviewer_Has_Given + Attitude + Reason + Continent + Solo + Season', data = Hotel).fit()
lm.params
lm.summary()
My issue is that when I look at the parameters (slopes and intercept estimates) also P-values, they look like:
The levels of each of the categorical features are included and I just want to have an output that shows us the slopes and p-values for numeric and categorical features (NOT the slopes and p-values for each level in the categorical features!)
Essentially, I want the slope output to look like:
Intercept
Total_Number_of_Reviews
Review_Total_Positive_Word_Counts
Total_Number_of_Reviews_Revewier_Has_Given
Attitude
Reason
Continent
Solo
Season
How would I do something like this to collapse the levels and just show the significance and slope value for each of the variables?
Right now, each of your original inputs to your model is being converted into dummy variables.*
The reason this clashes with your expectations, I suspect, is that you have three types of variables you call categorical in your model:
Temporal ("Season")
Binary ("Attitude", "Reason", "Solo")
Categorical ("Continent")
OnlyContinent is truly non-binary categorical as there is no way to order the continents in a hierarchy without any further information. For "Season" the model/program has no indication that there are only four seasons, or that they occur in a temporal order. With the binary variables, it similarly doesn't know that there are only two possible values.
I recommend converting binary variables to 1,0, or Nan (you could first use a lambda function, followed by pd.fillna()).
For "Season" specifically, it sounds you want something more akin to "time of year, indicated by season/quarter." I'd map the seasons to 1,2,3 or 4.
For the "Continent" you could rank the continents by how many reviews you have from each, and convert each continent to its respective rank... but you'd be regressing on something more akin to a blend of "continent" + "population from originating continent." (This, of course, may be useful to do anyways). Or, you could keep the dummy variable encoding that was already utilized.
Alternatively, you could come up with a random mapping for the continent, but include some indicator of the relative population from each continent in addition.
*To make this explicit, you can use pd.get_dummmies()
Related
I have few questions about preparing the data for learning.
Im very confused about how to convert columns to categorical and binary columns when i want to use the for correlations and classifier decision tree.
for exmaple in NBA_df, convert the position column to categorical column for using decision tree, can i convert it to categorical with .astype('category').cat.codes? (I know in basketball you can note the position by number 1-5.
NBA_df
And in students_df why its more correct to convert the 'gender','race/ethnicity','lunch','test preparation course' columns to a new binary columns with .get_dummies and not do the categorical convert in the same column ?
students_df
Its same in correlation and trees?
I'm not sure I totally understand what you mean by converting to categorical "in the same column", but I assume you mean replacing the categorical response from positions into numbers 1 through 5 and keeping those numbers in the same column.
Assuming this is what you meant, you have to think about how the computer will interpret the input. Is a Small Forward (position 3 in basketball) 3 times a Point Guard (1 * 3)? Of course not, but a computer will see it that way. It will determine relationships with the target that are not realistic. For this reason, you need separate columns with a binary indicator like .get_dummies is doing. That way, the computer will not see the positions as numeric values that can be operated on, but it will see the positions as separate entities.
I am trying to use the statsmodels package to do Maximum Likehood Estimation.
My goal is to compute a set of soccer team ratings given a bunch of game outcomes.
I'm using a Logistic model:
1/(1 + e^(HFA + HT - AT))
That has the following parameters:
HFA - home field advantage
HT - home team rating
AT - away team rating
The goal here is to compute each teams' ratings (there are 18 teams) plus a constant Home Field Advantage factor.
The data I'm given is simply a list of game outcomes - home team, away team, 1 if the home team won the game, 0 if not.
My current thinking is to enrich this data to have a column for each team. I would then make a particular column 1 if that team is playing in that particular game and 0 if not, so there should be two 1's per row. Then add another column called 'hfa', which is always 1, which represents the home field advantage.
Because of how the model works, for any given game, I need to know which team was home and which was away, then I can compute the prediction properly. And to do that, I believe I need the data labels so I can determine which of the two teams in the game was the home team. However, any time I include non-numeric data (e.g, the team name) in my X columns, I get an error from statsmodels.
Here is the gist of the code I'm trying to write. Any guidance on how to do what I'm trying to do would be much appreciated - thank you!
from statsmodels.base.model import GenericLikelihoodModel
class MyLogit(GenericLikelihoodModel):
def nloglikeobs(self, params):
"""
This function should return one evaluation of the negative log-likelihood function per observation in your
dataset (i.e. rows of the endog/X matrix).
"""
???
X_cols = ['hfa'] + list(teams)
X = results[X_cols]
y = results['game_result']
model = MyLogit(X, y)
result = model.fit()
print(result.summary())
I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!
I'm conducting some linear regressions using Python. I have a fairly large data file I'm working with, and one of the columns I'm looking at is titled "male" which points to a the gender of a subject. Column values can be 1 = male, 0 = female. "rgroupx" is the treatment variable (0 = control, 6 = high status treatment) and "log_mm" is the outcome variable.
One of the questions I need to answer is: How much does the high status treatment affect the number of traffic violations post intervention for male drivers? Is there a significant treatment effect for female drivers?
I have below my current Python statement. My problem is for both questions, how would I specify a column value to include in the regression? If the question is asking for male drivers, how do I tell Python to include only 1s? Thanks in advance!
model3 = smf.ols('log_mm ~ rgroupx + male', data=Traffic).fit()
If the structure of your data is in a dataframe, than a combination of indexing and dropping data while assigning it to a new variable 'male' would work.
Example:
males_df = data.drop(data[data.gender != 1].index)
variable for regression:
males = males_df.gender
I use GradientBoosting classifier to predict gender of users. The data have a lot of predictors and one of them is the country. For each country I have binary column. There are always only one column set to 1 for all country columns. But such desicion is very slow from computation point of view. Is there any way to represent country columns with only one column? I mean correct way.
You can replace the binary variable with the actual country name then collapse all of these columns into one column. Use LabelEncoder on this column to create a proper integer variable and you should be all set.