I have my data frame that initially looked like this:
item_id title user_id gender .....
0 1 Toy Story (1995) 308 M
1 4 Get Shorty (1995) 308 M
2 5 Copycat (1995) 308 M
Than I ran a mixed effects regression, which worked fine:
import statsmodels.api as sm
import statsmodels.formula.api as smf
md = smf.mixedlm("rating ~ C(gender) + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())
However, afterwards I did a one hot encoding on the gender variable and the dataframe became like this:
item_id title user_id gender_M gender_F .....
0 1 Toy Story (1995) 308 1 0
1 4 Get Shorty (1995) 308 1 0
2 5 Copycat (1995) 308 1 0
Would it be correct to run the model like this (changing gender with gender_M and gender_F)? Is it the same? Or is there a better way?
md = smf.mixedlm("rating ~ gender_M + gender_F + C(genre) + C(gender)*C(genre)", data, groups=data["user_id"])
mdf=md.fit()
print(mdf.summary())
Related
I have 3 datasets and I would like to know which ID has at least one unmatched when comparing Dataset A, Dataset B and Dataset C. May I know how could I achieve this in Python?
Dataset A
ID Salary
12 12,000
14 13,004
16 1,400
17 500
19 900
20 12,000
Dataset B
ID Name
13 John
12 James
15 Jacob
19 Michael
20 Seth
Dataset C
ID State
16 WA
17 WA
15 VC
19 NSW
20 WA
Since you mentioned Python I assumed you are using Pandas for the DataFrames.
import pandas as pd
DatasetA = pd.DataFrame({"ID":[12,14,16,17,19,20],"Salary":[12000,13004,1400,500,900,12000]})
DatasetB = pd.DataFrame({"ID":[13,12,15,19,20],"Name":["John","James","Jacob","Michael","Seth"]})
DatasetC = pd.DataFrame({"ID":[16,17,15,19,20],"State":["WA","WA","VC","NSW","WA"]})
IDs_A = set(DatasetA["ID"])
IDs_B = set(DatasetB["ID"])
IDs_C = set(DatasetC["ID"])
AB = IDs_A.symmetric_difference(IDs_B)
BC = IDs_B.symmetric_difference(IDs_C)
AC = IDs_A.symmetric_difference(IDs_C)
result = AB.union(BC).union(AC)
print(result)
I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924
I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.
Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))
My stud_alcoh data set is given below
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = stud_alcoh.groupby('legal_drinker').size()
number_of_drinkers
legal_drinker
False 284
True 111
dtype: int64
I have to draw a pie chart with number_of_drinkers with True as 111 and False 284. I wrote number_of_drinkers.plot(kind='pie')
which Y label and also the number(284 and 111) is not labeling
This should work:
number_of_drinkers.plot(kind = 'pie', label = 'my label', autopct = '%.2f%%')
The autopct argument gives you a notation of percentage inside the plot, with the desired number of decimals indicated right before the letter "f". So you can change this, for example, to %.1f%% for only one decimal.
I personally don't know of a way to show the raw numbers inside but only the percentage, but to the best of my understanding this is the purpose of a pie.
Your question already has a good answer. You could also try this. I'm using the data frame you shared.
import pandas as pd
df = pd.read_clipboard()
df
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian legal_drinker
0 GP F 18 U GT3 A 4 4 AT_HOME TEACHER course mother True
1 GP F 17 U GT3 T 1 1 AT_HOME OTHER course father False
2 GP F 15 U LE3 T 1 1 AT_HOME OTHER other mother False
3 GP F 15 U GT3 T 4 2 HEALTH SERVICES home mother False
4 GP F 16 U GT3 T 3 3 OTHER OTHER home father False
number_of_drinkers = df.groupby('legal_drinker').size() # Series
number_of_drinkers
legal_drinker
False 4
True 1
dtype: int64
number_of_drinkers.plot.pie(label='counts', autopct='%1.1f%%') # Label the wedges with their numeric value
I have the below synopsis of a df:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
What i'm looking for is count 'user id' and average 'rating' and keep all other columns intact. So the result will be something like this:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
any idea how to do that?
Thanks
If all the values are in the columns you are aggregating over are the same for each group then you can avoid the join by putting them into the group.
Then pass a dictionary of functions to agg. If you set as_index to False to keep the grouped by columns as columns:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
Note len is used to count
When you have too many columns, you probably do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)