I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')
i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901
I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5