Percentage based on column value - python

I am trying to find best most efficent method to calculate the percentage that each teacher is not tenured where they are teaching in my df.
For example df below:
District | Teacher Name | Tenured?
55 Bo Carns Yes
42 Bo Carns No
55 Steven Ast No
43 Fiona Tan Yes
43 Steven Ast Yes
43 Mike Po No
31 Steve Chi No
Each teacher can teach in multiple districts but they can be tenured or not tenured so I wanted to calculate the % that all the teachers in my df are not tenured to find the teachers teaching that are not tenured the most ( so for each teacher, the # of times the tenured column is no / all rows for the df )
Expected output would be:
Teacher Name | pct
Bo Carns .5
Steven Ast .5
Fiona Tan 0
Mike Po 1
Steve Chi 1
where the pct is the percent they were not tenured for all records or all districts
thanks for taking time to look at my question

You can try
s = df['Tenured?'].eq('Yes').groupby(df.TeacherName).mean()
Out[57]:
TeacherName
BoCarns 0.5
FionaTan 1.0
MikePo 0.0
SteveChi 0.0
StevenAst 0.5
Name: Tenured?, dtype: float64

Related

Why there are null values when I use group by?

I have data on amazon's 50 best-selling books(from Kaggle).
There are no null values in the data.
Now, I find the mean of reviews given by the user. Now, I use a group by function but it gives null values for User Ratings and mean.
In the next step, I filter all those reviews where the reviews are greater than the average reviews.
My question is: why did I get the null values in the first case? since there were no null values in the dataset?
Why did I get null values when I used group by?
ipynb file
This 'Answer' is an attempt at #reproducibility.
The OP question cannot be reproduced.
PS: The groupby returns grouping as 'expected'.
#TANNU, It appears your 'NaN' might have come from your data cleansing. Kindly show your relevant code.
NB: The 'Amazon Top 50 Bestselling Books 2009 - 2019' dataset has #550 rows {data.shape:(550, 7)}
[For noting]
Your book_review groupby has a whopping 269010 rows. My reproduction of your book_review yielded 351 rows × 5 columns
PS: Updated based on #Siva Shanmugam's edit.
## import libraries
import pandas as pd
import numpy as np
## read dataset
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Amazon%20Top%2050%20Bestselling%20Books%202009%20-%202019.csv')
data.head(2)
''' [out]
Name Author User Rating Reviews Price Year Genre
0 10-Day Green Smoothie Cleanse JJ Smith 4.7 17350 8 2016 Non Fiction
1 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Fiction
'''
## check shape
data.shape
''' [out]
(550, 7)
'''
## check dataset
data.describe()
''' [out]
User Rating Reviews Price Year
count 550.000000 550.000000 550.000000 550.000000
mean 4.618364 11953.281818 13.100000 2014.000000
std 0.226980 11731.132017 10.842262 3.165156
min 3.300000 37.000000 0.000000 2009.000000
25% 4.500000 4058.000000 7.000000 2011.000000
50% 4.700000 8580.000000 11.000000 2014.000000
75% 4.800000 17253.250000 16.000000 2017.000000
max 4.900000 87841.000000 105.000000 2019.000000
'''
## check NaN
data.Reviews.isnull().any().any()
''' [out]
False
'''
## mean of reviews
mean_reviews = np.math.ceil(data.Reviews.mean())
mean_reviews
''' [out]
11954
'''
## group by mean of `User Rating` and `Reviews`
book_review = data.groupby(['Name', 'Author', 'Genre'], as_index=False)[['User Rating', 'Reviews']].mean()
book_review
''' [out]
Name Author Genre User Rating Reviews
0 10-Day Green Smoothie Cleanse JJ Smith Non Fiction 4.7 17350.0
2 12 Rules for Life: An Antidote to Chaos Jordan B. Peterson Non Fiction 4.7 18979.0
3 1984 (Signet Classics) George Orwell Fiction 4.7 21424.0
5 A Dance with Dragons (A Song of Ice and Fire) George R. R. Martin Fiction 4.4 12643.0
6 A Game of Thrones / A Clash of Kings / A Storm... George R. R. Martin Fiction 4.7 19735.0
... ... ... ... ... ...
341 When Breath Becomes Air Paul Kalanithi Non Fiction 4.8 13779.0
342 Where the Crawdads Sing Delia Owens Fiction 4.8 87841.0
345 Wild: From Lost to Found on the Pacific Crest ... Cheryl Strayed Non Fiction 4.4 17044.0
348 Wonder R. J. Palacio Fiction 4.8 21625.0
350 You Are a Badass: How to Stop Doubting Your Gr... Jen Sincero Non Fiction 4.7 14331.0
83 rows × 5 columns
'''
## get book reviews that are less than the mean(reviews)
book_review[book_review.Reviews < mean_reviews]
''' [out]
Name Author Genre User Rating Reviews
​
'''

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Conditional aggregation on dataframe columns with combining 'n' rows into 1 row

I have an Input Dataframe that the following :
NAME TEXT START END
Tim Tim Wagner is a teacher. 10 20.5
Tim He is from Cleveland, Ohio. 20.5 40
Frank Frank is a musician. 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. 62 70
Frank He performed at the Carnegie Hall last year. 70 85
Frank It was fantastic listening to him. 85 90
Frank I really enjoyed 90 93
Want output dataframe as follows:
NAME TEXT START END
Tim Tim Wagner is a teacher. He is from Cleveland, Ohio. 10 40
Frank Frank is a musician 40 50
Tim He like to travel with his family 50 62
Frank He is a performing artist who plays the cello. He performed at the Carnegie Hall last year. 62 85
Frank It was fantastic listening to him. I really enjoyed 85 93
My current code:
grp = (df['NAME'] != df['NAME'].shift()).cumsum().rename('group')
df.groupby(['NAME', grp], sort=False)['TEXT','START','END']\
.agg({'TEXT':lambda x: ' '.join(x), 'START': 'min', 'END':'max'})\
.reset_index().drop('group', axis=1)
This combines the last 4 rows into one. Instead I want to combine only 2 rows (say any n rows) even if the 'NAME' has the same value.
Appreciate your help on this.
Thanks
You can groupby the grp to get the relative blocks inside the group:
blocks = df.NAME.ne(df.NAME.shift()).cumsum()
(df.groupby([blocks, df.groupby(blocks).cumcount()//2])
.agg({'NAME':'first', 'TEXT':' '.join,
'START':'min', 'END':'max'})
)
Output:
NAME TEXT START END
NAME
1 0 Tim Tim Wagner is a teacher. He is from Cleveland,... 10.0 40.0
2 0 Frank Frank is a musician. 40.0 50.0
3 0 Tim He like to travel with his family 50.0 62.0
4 0 Frank He is a performing artist who plays the cello.... 62.0 85.0
1 Frank It was fantastic listening to him. I really en... 85.0 93.0

Pandas find values present in at least two groups

I have a multiindex dataframe like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
John 3 961.0
4 346.0
Bricks James 10 244.0
20 303.0
30 811.0
Fred 40 449.0
James 501 265.0
Sand Donald 15 378.0
800 359.0
How can I slice that df to see only drivers, who worked for different companies? So the result should be like this:
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0
UPD: My original dataframe is 400k long, so I can't just slice it by index. I'm trying to find general solution to solve problems like these.
To get the number of unique companies a person has worked for, use groupby and unique:
v = (df.index.get_level_values(0)
.to_series()
.groupby(df.index.get_level_values(1))
.nunique())
# Alternative involving resetting the index, may not be as efficient.
# v = df.reset_index().groupby('Driver').Company.nunique()
v
Driver
Donald 1
Fred 2
James 1
John 1
Name: Company, dtype: int64
Now, you can run a query:
names = v[v.gt(1)].index.tolist()
df.query("Driver in #names")
Distance
Company Driver Document_id
Salt Fred 1 592.0
2 550.0
Bricks Fred 40 449.0

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Categories

Resources