I'm conducting some linear regressions using Python. I have a fairly large data file I'm working with, and one of the columns I'm looking at is titled "male" which points to a the gender of a subject. Column values can be 1 = male, 0 = female. "rgroupx" is the treatment variable (0 = control, 6 = high status treatment) and "log_mm" is the outcome variable.
One of the questions I need to answer is: How much does the high status treatment affect the number of traffic violations post intervention for male drivers? Is there a significant treatment effect for female drivers?
I have below my current Python statement. My problem is for both questions, how would I specify a column value to include in the regression? If the question is asking for male drivers, how do I tell Python to include only 1s? Thanks in advance!
model3 = smf.ols('log_mm ~ rgroupx + male', data=Traffic).fit()
If the structure of your data is in a dataframe, than a combination of indexing and dropping data while assigning it to a new variable 'male' would work.
Example:
males_df = data.drop(data[data.gender != 1].index)
variable for regression:
males = males_df.gender
Related
I have a problem that I cannot seem able to solve.
I have a dataset with two categorical variables: Gender (Male vs Female) and Smoking status (Smokers vs Non-smokers). The dataset contains 60% Male and 50% of Smokers.
df = pd.DataFrame()
df['Gender'] = ['M','M','M','M','M','M','F','F','F','F']
df['Smoking_status'] = ['S','S','S','S','S','NS','NS','NS','NS','NS']
Is there a way to create a subset such that the new dataset will have 50% Male and 30% Smokers? (it does not matter the percentage of male and smokers since it is an information that I do not have for the final dataset).
I am implementing this in python but I will be happy with just an idea of a solution.
Thank you all!
I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()
I tried to calculate the median and counts of specific column of my data frame:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')\
.agg([np.median, np.size])
print(large_depts)
It said:
ValueError: no results
But when I checked the dataframe, there were values in my dataframe:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]
print(large_depts)
Total Department
0 677,680.65 Boston Police Department
1 250,893.61 Boston Police Department
2 208,676.89 Boston Police Department
3 319,319.93 Boston Police Department
4 577,123.44 Boston Police Department
I found out that When I try to groupby, there was something wrong, but I don't know why:
large_depts = df[df['Department'].isin(Departments_top10)]\
[['Total', 'Department']]\
.groupby('Department')
print(large_depts)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000000000D1C0A08>
Here's the data: https://data.boston.gov/dataset/418983dc-7cae-42bb-88e4-d56f5adcf869/resource/31358fd1-849a-48e0-8285-e813f6efbdf1/download/employeeearningscy18full.csv
You donĀ“t need call Department variable again. You can chage np.size to 'count' too. Try this code:
df[df['Department'].isin(Departments_top10)].Total.groupby('Department').agg([np.median, 'count'])
You have a couple of errors going on in your code above.
Your Total column is not a numeric type (as you pointed out in the comments, it's a String). I'm assuming you can change (though permanent) your Total column and your code may work? I don't have access to your data so I can't fully check if your groupby functions are working.
Here's code to change your string to list (as asked in the comments). Not sure if this is what you really want.
str2lst = lambda s: s.split(",")
df['Total'] = [str2lst(i) for i in df['Total']]
EDIT: After looking at your DataFrame (and realizing that Total is a number and not a list), I uncovered several rows that contained the column names as values. Removing these as well as changing your string values to float type:
df.drop([12556, 22124, 22123, 22122, 22121, 22125], inplace = True)
str2float = lambda s: s.replace(',', '')
df['Total'] = [float(str2float(i)) for i in df['Total']]
Now running agg() exactly how you have it in the question will work. Here's my results:
Total
Department median size
BPS Facility Management 53183.315 668.0
BPS Special Education 49875.830 831.0
BPS Substitute Teachers/Nurs 6164.070 1196.0
BPS Transportation 20972.770 506.0
Boston Cntr - Youth & Families 44492.625 584.0
In your last code entry, groupby has to have a method you're trying to group by with. Think about it intuitively, how are you grouping your variables? If I instructed you to group a set of cards together, you'd ask how? By color? Number? Suits? You told Python to group Department, but you didn't give it how you wanted it grouped. So Python returned a "...generic.DataFrameGroupBy object".
Try doing df...groupby('Department').count() and you'll see df grouped by Department.
I have been working on a fake news detection model
I have been able to infer relation between the news title as against the news content
I have an existing dataframe of the following columns:
AUTHOR NEWS_TITLE NEWS_CREDIBILITY
I want to use this existing columns to create new columns as follows:
AUTHOR, AUTHOR_NEWS_COUNT, TOTAL_NUM_CREDIBLE_NEWS, TOTAL_NUM_NONCREDIBLE_NEWS
NOTE: The columns: TOTAL_NUM_CREDIBLE_NEWS, TOTAL_NUM_NON_CREDIBLE_NEWS is based on the values from the column for NEWS_CREDIBILTY
news_authors = news1['AUTHOR'].value_counts()
print(news_authors)
df[news_...
AUTHOR AUTHOR_NEWS_COUNT TOTAL_NUM_CREDIBLE_NEWS TOTAL_NUM_NONCREDIBLE_NEWS
Pam Key 243 240 3
David Flynn 30 20 10
I may be misunderstanding the question but what you need may be a simple groupby. I'm going to assume a function, is_credible, which takes your NEWS_CREDIBILITY and outputs TRUE or FALSE based on if it's credible. Then you need something like this:
df['CREDIBLE'] = df['NEWS_CREDIBILITY'].apply(is_credible)
df['NOTCREDIBLE'] = df['NEWS_CREDIBILITY'].apply(lambda x: not is_credible(x))
this creates a boolean column of credibility and its opposite (there is probably a more elegant way to do this, sorry!)
Then you can do:
per_author_df = df.groupby('AUTHOR').agg({'NEWS_TITLE':'count','CREDIBLE':'sum','NOTCREDIBLE':'sum'})
This basically groups by author and performs the following operations on those three colums. NEWS_TITLE becomes count of news articles, and since in sum TRUE=1 and FALSE=0 the two other columns are a count of credible or incredible news.
EDIT: As I said earlier, you need a function like is_credible that tells you what is credible based on your NEWS_CREDIBILITY column. For example, if NEWS_CREDIBILITY is a score and having over 80 means you are credible, it would be:
def is_credible(cred_score):
if cred_score >= 80:
return TRUE
else:
return FALSE
You need to adapt this to your NEWS_CREDIBILITY column - I don't even know what data type that carries.
I am trying to create a multiple linear regression model to predict
the rating a guest gives to a hotel (Reviewer_Score) in Python using statsmodels.
Review_Total_Negative_Word_Counts is how long their negative comments about the hotel are
Total_Number_of_Reviews is how many reviews the hotel has
Review_Total_Positive_Word_Counts is how long their positive comments about the hotel are
Total_Number_of_Reviews_Revewier_Has_Given is how many reviews the guest has given on the site
Attitude is a categorical variable: GOOD or BAD
Reason is reason for visit (Leisure or Business)
Continent is the continent which the guest came from (multiple levels)
Solo is whether the traveler is a solo traveler ('Yes' or 'No')
Season is during which season the guest stayed at the hotel ('Fall', 'Winter', 'Summer', 'Spring')
As you can see, I have some numeric and also categorical features.
My code so far is:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Reviewer_Score ~ Review_Total_Negative_Word_Counts + Total_Number_of_Reviews + Review_Total_Positive_Word_Counts + Total_Number_of_Reviews_Reviewer_Has_Given + Attitude + Reason + Continent + Solo + Season', data = Hotel).fit()
lm.params
lm.summary()
My issue is that when I look at the parameters (slopes and intercept estimates) also P-values, they look like:
The levels of each of the categorical features are included and I just want to have an output that shows us the slopes and p-values for numeric and categorical features (NOT the slopes and p-values for each level in the categorical features!)
Essentially, I want the slope output to look like:
Intercept
Total_Number_of_Reviews
Review_Total_Positive_Word_Counts
Total_Number_of_Reviews_Revewier_Has_Given
Attitude
Reason
Continent
Solo
Season
How would I do something like this to collapse the levels and just show the significance and slope value for each of the variables?
Right now, each of your original inputs to your model is being converted into dummy variables.*
The reason this clashes with your expectations, I suspect, is that you have three types of variables you call categorical in your model:
Temporal ("Season")
Binary ("Attitude", "Reason", "Solo")
Categorical ("Continent")
OnlyContinent is truly non-binary categorical as there is no way to order the continents in a hierarchy without any further information. For "Season" the model/program has no indication that there are only four seasons, or that they occur in a temporal order. With the binary variables, it similarly doesn't know that there are only two possible values.
I recommend converting binary variables to 1,0, or Nan (you could first use a lambda function, followed by pd.fillna()).
For "Season" specifically, it sounds you want something more akin to "time of year, indicated by season/quarter." I'd map the seasons to 1,2,3 or 4.
For the "Continent" you could rank the continents by how many reviews you have from each, and convert each continent to its respective rank... but you'd be regressing on something more akin to a blend of "continent" + "population from originating continent." (This, of course, may be useful to do anyways). Or, you could keep the dummy variable encoding that was already utilized.
Alternatively, you could come up with a random mapping for the continent, but include some indicator of the relative population from each continent in addition.
*To make this explicit, you can use pd.get_dummmies()