Why there are null values when I use group by? - python

I have data on amazon's 50 best-selling books(from Kaggle).
There are no null values in the data.
Now, I find the mean of reviews given by the user. Now, I use a group by function but it gives null values for User Ratings and mean.
In the next step, I filter all those reviews where the reviews are greater than the average reviews.
My question is: why did I get the null values in the first case? since there were no null values in the dataset?
Why did I get null values when I used group by?
ipynb file

This 'Answer' is an attempt at #reproducibility.
The OP question cannot be reproduced.
PS: The groupby returns grouping as 'expected'.
#TANNU, It appears your 'NaN' might have come from your data cleansing. Kindly show your relevant code.
NB: The 'Amazon Top 50 Bestselling Books 2009 - 2019' dataset has #550 rows {data.shape:(550, 7)}
[For noting]
Your book_review groupby has a whopping 269010 rows. My reproduction of your book_review yielded 351 rows × 5 columns
PS: Updated based on #Siva Shanmugam's edit.
## import libraries
import pandas as pd
import numpy as np
## read dataset
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Amazon%20Top%2050%20Bestselling%20Books%202009%20-%202019.csv')
data.head(2)
''' [out]
Name Author User Rating Reviews Price Year Genre
0 10-Day Green Smoothie Cleanse JJ Smith 4.7 17350 8 2016 Non Fiction
1 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Fiction
'''
## check shape
data.shape
''' [out]
(550, 7)
'''
## check dataset
data.describe()
''' [out]
User Rating Reviews Price Year
count 550.000000 550.000000 550.000000 550.000000
mean 4.618364 11953.281818 13.100000 2014.000000
std 0.226980 11731.132017 10.842262 3.165156
min 3.300000 37.000000 0.000000 2009.000000
25% 4.500000 4058.000000 7.000000 2011.000000
50% 4.700000 8580.000000 11.000000 2014.000000
75% 4.800000 17253.250000 16.000000 2017.000000
max 4.900000 87841.000000 105.000000 2019.000000
'''
## check NaN
data.Reviews.isnull().any().any()
''' [out]
False
'''
## mean of reviews
mean_reviews = np.math.ceil(data.Reviews.mean())
mean_reviews
''' [out]
11954
'''
## group by mean of `User Rating` and `Reviews`
book_review = data.groupby(['Name', 'Author', 'Genre'], as_index=False)[['User Rating', 'Reviews']].mean()
book_review
''' [out]
Name Author Genre User Rating Reviews
0 10-Day Green Smoothie Cleanse JJ Smith Non Fiction 4.7 17350.0
2 12 Rules for Life: An Antidote to Chaos Jordan B. Peterson Non Fiction 4.7 18979.0
3 1984 (Signet Classics) George Orwell Fiction 4.7 21424.0
5 A Dance with Dragons (A Song of Ice and Fire) George R. R. Martin Fiction 4.4 12643.0
6 A Game of Thrones / A Clash of Kings / A Storm... George R. R. Martin Fiction 4.7 19735.0
... ... ... ... ... ...
341 When Breath Becomes Air Paul Kalanithi Non Fiction 4.8 13779.0
342 Where the Crawdads Sing Delia Owens Fiction 4.8 87841.0
345 Wild: From Lost to Found on the Pacific Crest ... Cheryl Strayed Non Fiction 4.4 17044.0
348 Wonder R. J. Palacio Fiction 4.8 21625.0
350 You Are a Badass: How to Stop Doubting Your Gr... Jen Sincero Non Fiction 4.7 14331.0
83 rows × 5 columns
'''
## get book reviews that are less than the mean(reviews)
book_review[book_review.Reviews < mean_reviews]
''' [out]
Name Author Genre User Rating Reviews
​
'''

Related

How to separate Numbers from string and move them to next column in Python?

I am working on a share market data and in some columns market cap has shifted to previous column. I am trying to fetch them in next column but the value it's returning is completely different.
This is the code I am using -
data['Market Cap (Crores)']=data['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
data['Market Cap (Crores)']
But the output I am getting is
968 NaN
969 NaN
970 -2.147484e+09
971 -2.147484e+09
972 -2.147484e+09
How do I get the correct values?
You just do it, step by step. First, pick out the rows that need fixing (where the market cap is Nan). Then, I create two functions, one to pull the market cap from the string, one to remove the market cap. I use apply to fix up the rows, and substitute the values into the original dataframe.
import pandas as pd
import numpy as np
data = [
['GNA Axles Ltd', 'Auto Parts', 1138.846797],
['Andhra Paper Ltd', 'Paper Products', 1135.434614],
['Tarc', 'Real Estate 1134.645409', np.NaN],
['Udaipur Cement Works', 'Cement 1133.531734', np.NaN],
['Pnb Gifts', 'Investment Banking 1130.463641', np.NaN],
]
def getprice(row):
return float(row['Sub-Sector'].split()[-1])
def removeprice(row):
return ' '.join(row['Sub-Sector'].split()[:-1])
df = pd.DataFrame( data, columns= ['Company','Sub-Sector','Market Cap (Crores)'] )
print(df)
picks = df['Market Cap (Crores)'].isna()
rows = df[picks]
print(rows)
df.loc[picks,'Sub-Sector'] = rows.apply(removeprice, axis=1)
df.loc[picks,'Market Cap (Crores)'] = rows.apply(getprice, axis=1)
print(df)
Output:
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
2 Tarc Real Estate 1134.645409 NaN
3 Udaipur Cement Works Cement 1133.531734 NaN
4 Pnb Gifts Investment Banking 1130.463641 NaN
Company Sub-Sector Market Cap (Crores)
0 GNA Axles Ltd Auto Parts 1138.846797
1 Andhra Paper Ltd Paper Products 1135.434614
2 Tarc Real Estate 1134.645409
3 Udaipur Cement Works Cement 1133.531734
4 Pnb Gifts Investment Banking 1130.463641
df['Sub-Sector Number'] = df['Sub-Sector'].astype('str').str.extractall('(\d+)').unstack().fillna('').sum(axis=1).astype(int)
df['Sub-Sector final'] = df[['Sub-Sector Number','Sub-Sector']].ffill(axis=1).iloc[:,-1]
df
Hi there,
Here is the method which you can try, use your code to create a numeric field and select non-missing value from Sub-Sector Number and Sub-Sector creating your final field - Sub-Sector final
Please try it and if not working please let me know
Thanks Leon

How do I getting all values from a one panda.series row that correspond to s specific value in second row?

I have a Dataframe that looks like this :
name age occ salary
0 Vinay 22.0 engineer 60000.0
1 Kushal NaN doctor 70000.0
2 Aman 24.0 engineer 80000.0
3 Rahul NaN doctor 65000.0
4 Ramesh 25.0 doctor 70000.0
and im trying to get all salary values that correspond to a specific occupation ,t o then compute the mean salary of that occ.
here is an answer with a few step
temp_df = df.loc[df['occ'] == 'engineer']
temp_df.salary.mean()
All averages at once:
df_averages = df[['occ', 'salary']].groupby('occ').mean()
salary
occ
doctor 68333.333333
engineer 70000.000000

Fuzzy match for 2 lists with very similar names

I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.
Have used this so far:
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
for name_master in df2:
if fuzz.partial_ratio(name_to_find,name_master) > 90:
response[name_to_find] = name_master
break
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
sample_name
actual_name
jtsports
JT Sports LLC
tombaseball
Tom Baseball Inc.
context express
Context Express LLC
zb sicily
ZB Sicily LLC
lightening express
Lightening Express LLC
fire roads
Fire Road Express
N/A
Earth Treks
N/A
TS Sports LLC
N/A
MM Baseball Inc.
N/A
Contact Express LLC
N/A
AB Sicily LLC
N/A
Lightening Roads LLC
Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?
import pandas as pd
from fuzzywuzzy import process
threshold = 50
list1 = ['jtsports','tombaseball','context express','zb sicily',
'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']
response = []
for name_to_find in list1:
resp_match = process.extractOne(name_to_find ,list2)
if resp_match[1] > threshold:
row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
response.append(row)
print(row)
results = pd.DataFrame(response)
# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN
It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match.
That is how a lot of nieve spell check systems work.
I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.
Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
highest_so_far = ("", 0)
for name_master in df2:
score = fuzz.partial_ratio(name_to_find, name_master)
if score > highest_so_far[1]:
highest_so_far = (name_master, score)
response[name_to_find] = highest_so_far[0]
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)

Percentage based on column value

I am trying to find best most efficent method to calculate the percentage that each teacher is not tenured where they are teaching in my df.
For example df below:
District | Teacher Name | Tenured?
55 Bo Carns Yes
42 Bo Carns No
55 Steven Ast No
43 Fiona Tan Yes
43 Steven Ast Yes
43 Mike Po No
31 Steve Chi No
Each teacher can teach in multiple districts but they can be tenured or not tenured so I wanted to calculate the % that all the teachers in my df are not tenured to find the teachers teaching that are not tenured the most ( so for each teacher, the # of times the tenured column is no / all rows for the df )
Expected output would be:
Teacher Name | pct
Bo Carns .5
Steven Ast .5
Fiona Tan 0
Mike Po 1
Steve Chi 1
where the pct is the percent they were not tenured for all records or all districts
thanks for taking time to look at my question
You can try
s = df['Tenured?'].eq('Yes').groupby(df.TeacherName).mean()
Out[57]:
TeacherName
BoCarns 0.5
FionaTan 1.0
MikePo 0.0
SteveChi 0.0
StevenAst 0.5
Name: Tenured?, dtype: float64

Add new column to dataframe based on an average

I have a dataframe that includes the category of a project, currency, number of investors, goal, etc., and I want to create a new column which will be "average success rate of their category":
state category main_category currency backers country \
0 0 Poetry Publishing GBP 0 GB
1 0 Narrative Film Film & Video USD 15 US
2 0 Narrative Film Film & Video USD 3 US
3 0 Music Music USD 1 US
4 1 Restaurants Food USD 224 US
usd_goal_real duration year hour
0 1533.95 59 2015 morning
1 30000.00 60 2017 morning
2 45000.00 45 2013 morning
3 5000.00 30 2012 morning
4 50000.00 35 2016 afternoon
I have the average success rates in series format:
Dance 65.435209
Theater 63.796134
Comics 59.141527
Music 52.660558
Art 44.889045
Games 43.890467
Film & Video 41.790649
Design 41.594386
Publishing 34.701650
Photography 34.110847
Fashion 28.283186
Technology 23.785582
And now I want to add in a new column, where each column will have a success rate matching their category, i.e. wherever the row is technology, the new column will include 23.78 for that row.
df[category_success_rate] = i want the output column to be the % success which matches with the category in "main category" column.
I think you need GroupBy.transform with a Boolean mask, df['state'].eq(1) or (df['state'] == 1):
df['category_success_rate'] = (df['state'].eq(1)
.groupby(df['main_category']).transform('mean') * 100)
Alternative:
df['category_success_rate'] = ((df['state'] == 1)
.groupby(df['main_category']).transform('mean') * 100)

Categories

Resources