I'm working on an assignment for a course and I'm running into an issue with my dataframe. I made the changes as they requested, but when I go to display my new dataframe, it just shows the headers.
These are the requirements of the assignment:
Load the data file using pandas
Check for null values in the data.
Drop records with nulls in any of the columns
Size column has sizes in kb as well as mb. To analyze you'll need to convert them to numeric
the data set has M and K and "Varies with device" showing up in these columns so I removed them
Price field is a string and has $ symbol. Remove the $ symbol and convert to numeric.
Average rating should be between 1 nd 5 as only these values are allowed. Drop the rows that have a value outside this range.
For Free apps in the Type column, drop these rows.
#Here is my code:
import pandas as pd
import numpy as np
ds = pd.read_csv('googleplaystore.csv')
headers = pd.DataFrame(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'])
ds['Size'] = ds['Size'].replace("Varies with Device", np.nan, inplace = True)
ds =ds.dropna()
ds['Size'] = ds['Size'].str.replace("M", "", regex = True)
ds['Size'] = ds['Size'].str.replace("k", "", regex = True)
ds['Size'] = ds['Size'].astype(float)
ds['Installs'] = ds['Installs'].str.replace("+", '', regex = True)
ds['Installs'] = ds['Installs'].astype(int)
ds['Reviews'] = ds['Reviews'].astype(float)
ds['Price'] = ds['Price'].str.replace("$", "", regex = True)
ds['Price'] = ds['Price'].astype(float)
indexrating = ds[(ds['Rating'] >= 1) & (ds['Rating'] <= 5)].index
ds.drop(indexrating, inplace = True)
ds['Type']= ds['Type'].replace("Free", np.nan, inplace = True)
ds =ds.dropna()
display(ds)
#I was expecting for a new dataframe to display with the dropped rows
Remove everything that ends by 'M' or 'k' or contains "Varies with device", remove all rows.
>>> df['Size'].str[-1].value_counts()
M 7466 # ends with 'M'
e 1637 # ends with 'k'
k 257 # for "Varies with device"
Name: Size, dtype: int64
Try with this version:
df = pd.read_csv(googleplaystore.csv) # 1
df = df.dropna() # 3
df['Size'] = df['Size'].str.extract(r'(\d+\.?\d)', expand=False).astype(float) * df['Size'].str[-1].replace({'M': 1024, 'k': 1}) # 4
df = df.dropna() # remove nan from "Varies with device"
df['Price'] = df['Price'].str.strip('$').astype(float) # 5
df = df.loc[df['Rating'].between(1, 5)] # 6
df = df.loc[df['Type'] != 'Free'] # 7
Output:
>>> df
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6963.2 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39936.0 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
290 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6963.2 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
291 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39936.0 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
477 Calculator DATING 2.6 57 6348.8 1,000+ Paid 6.99 Everyone Dating October 25, 2017 1.1.6 4.0 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10690 FO Bixby PERSONALIZATION 5.0 5 861.0 100+ Paid 0.99 Everyone Personalization April 25, 2018 0.2 7.0 and up
10697 Mu.F.O. GAME 5.0 2 16384.0 1+ Paid 0.99 Everyone Arcade March 3, 2017 1.0 2.3 and up
10760 Fast Tract Diet HEALTH_AND_FITNESS 4.4 35 2457.6 1,000+ Paid 7.99 Everyone Health & Fitness August 8, 2018 1.9.3 4.2 and up
10782 Trine 2: Complete Story GAME 3.8 252 11264.0 10,000+ Paid 16.99 Teen Action February 27, 2015 2.22 5.0 and up
10785 sugar, sugar FAMILY 4.2 1405 9728.0 10,000+ Paid 1.20 Everyone Puzzle June 5, 2018 2.7 2.3 and up
[577 rows x 13 columns]
Related
I have data on amazon's 50 best-selling books(from Kaggle).
There are no null values in the data.
Now, I find the mean of reviews given by the user. Now, I use a group by function but it gives null values for User Ratings and mean.
In the next step, I filter all those reviews where the reviews are greater than the average reviews.
My question is: why did I get the null values in the first case? since there were no null values in the dataset?
Why did I get null values when I used group by?
ipynb file
This 'Answer' is an attempt at #reproducibility.
The OP question cannot be reproduced.
PS: The groupby returns grouping as 'expected'.
#TANNU, It appears your 'NaN' might have come from your data cleansing. Kindly show your relevant code.
NB: The 'Amazon Top 50 Bestselling Books 2009 - 2019' dataset has #550 rows {data.shape:(550, 7)}
[For noting]
Your book_review groupby has a whopping 269010 rows. My reproduction of your book_review yielded 351 rows × 5 columns
PS: Updated based on #Siva Shanmugam's edit.
## import libraries
import pandas as pd
import numpy as np
## read dataset
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Amazon%20Top%2050%20Bestselling%20Books%202009%20-%202019.csv')
data.head(2)
''' [out]
Name Author User Rating Reviews Price Year Genre
0 10-Day Green Smoothie Cleanse JJ Smith 4.7 17350 8 2016 Non Fiction
1 11/22/63: A Novel Stephen King 4.6 2052 22 2011 Fiction
'''
## check shape
data.shape
''' [out]
(550, 7)
'''
## check dataset
data.describe()
''' [out]
User Rating Reviews Price Year
count 550.000000 550.000000 550.000000 550.000000
mean 4.618364 11953.281818 13.100000 2014.000000
std 0.226980 11731.132017 10.842262 3.165156
min 3.300000 37.000000 0.000000 2009.000000
25% 4.500000 4058.000000 7.000000 2011.000000
50% 4.700000 8580.000000 11.000000 2014.000000
75% 4.800000 17253.250000 16.000000 2017.000000
max 4.900000 87841.000000 105.000000 2019.000000
'''
## check NaN
data.Reviews.isnull().any().any()
''' [out]
False
'''
## mean of reviews
mean_reviews = np.math.ceil(data.Reviews.mean())
mean_reviews
''' [out]
11954
'''
## group by mean of `User Rating` and `Reviews`
book_review = data.groupby(['Name', 'Author', 'Genre'], as_index=False)[['User Rating', 'Reviews']].mean()
book_review
''' [out]
Name Author Genre User Rating Reviews
0 10-Day Green Smoothie Cleanse JJ Smith Non Fiction 4.7 17350.0
2 12 Rules for Life: An Antidote to Chaos Jordan B. Peterson Non Fiction 4.7 18979.0
3 1984 (Signet Classics) George Orwell Fiction 4.7 21424.0
5 A Dance with Dragons (A Song of Ice and Fire) George R. R. Martin Fiction 4.4 12643.0
6 A Game of Thrones / A Clash of Kings / A Storm... George R. R. Martin Fiction 4.7 19735.0
... ... ... ... ... ...
341 When Breath Becomes Air Paul Kalanithi Non Fiction 4.8 13779.0
342 Where the Crawdads Sing Delia Owens Fiction 4.8 87841.0
345 Wild: From Lost to Found on the Pacific Crest ... Cheryl Strayed Non Fiction 4.4 17044.0
348 Wonder R. J. Palacio Fiction 4.8 21625.0
350 You Are a Badass: How to Stop Doubting Your Gr... Jen Sincero Non Fiction 4.7 14331.0
83 rows × 5 columns
'''
## get book reviews that are less than the mean(reviews)
book_review[book_review.Reviews < mean_reviews]
''' [out]
Name Author Genre User Rating Reviews
'''
You’ll need to bring all your filtering skills together for this task. We’ve provided you a list of companies in the developers variable. Filter df however you choose so that you only get games that meet the following conditions:
Sold in all 3 regions (North America, Europe, and Japan)
The Japanese sales were greater than the combined sales from North America and Europe
The game developer is one of the companies in the developers list
There is no column that explicitly says whether a game was sold in each region, but you can infer that a game was not sold in a region if its sales are 0 for that region.
Use the cols variable to select only the 'name', 'developer', 'na_sales', 'eu_sales', and 'jp_sales' columns from the filtered DataFrame, and assign the result to a variable called df_filtered. Print the whole DataFrame.
You can use a filter mask or query string for this task. In either case, you need to check if the 'jp_sales' column is greater than the sum of 'na_sales' and 'eu_sales', check if each sales column is greater than 0, and use isin() to check if the 'developer' column contains one of the values in developers. Use [cols] to select only those columns and then print df_filtered.
developer
na_sales
eu_sales
jp_sales
critic_score
user_score
0
Nintendo
41.36
28.96
3.77
76.0
8.0
1
NaN
29.08
3.58
6.81
NaN
NaN
2
Nintendo
15.68
12.76
3.79
82.0
8.3
3
Nintendo
15.61
10.93
3.28
80.0
8.0
4
NaN
11.27
8.89
10.22
NaN
NaN
This is my code. Pretty difficult and having difficulty providing a df_filtered variable with a running code.
import pandas as pd
df = pd.read_csv('/datasets/vg_sales.csv')
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
df_filtered = df([cols ]> 0 | cols['jp_sales'] > sum(cols['eu_sales']+cols['na_sales']) | df['developer'].isin(developers))
print(df_filtered)
If I understand correctly, it looks like a multi-condition dataframe filtering:
df[
df["developer"].isin(developers) \
& (df["jp_sales"] > df["na_sales"] + df["eu_sales"]) \
& ~df["na_sales"].isnull()
& ~df["eu_sales"].isnull()
& ~df["jp_sales"].isnull()
]
It will not return results for sample dataset given in question because the conditions that JP sales should exceed NA and EU sales and developer should be from given list are not met. But it works for proper data:
data=[
("SquareSoft",41.36,28.96,93.77,76.0,8.0),
(np.nan,29.08,3.58,6.81,np.nan,np.nan),
("SquareSoft",15.68,12.76,3.79,82.0,8.3),
("Nintendo",15.61,10.93,30.28,80.0,8.0),
(np.nan,11.27,8.89,10.22,np.nan,np.nan)
]
columns = ["developer","na_sales","eu_sales","jp_sales","critic_score","user_score"]
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
df = pd.DataFrame(data=data, columns=columns)
[Out]:
developer na_sales eu_sales jp_sales critic_score user_score
0 SquareSoft 41.36 28.96 93.77 76.0 8.0
Try this:
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
cond = (
# Sold in all 3 regions
df[["na_sales", "eu_sales", "jp_sales"]].gt(0).all(axis=1)
# JP sales greater than NA and EU sales combined
& df["jp_sales"].gt(df["na_sales"] + df["eu_sales"])
# Developer is in a predefined list
& df["developer"].isin(developers)
)
if cond.any():
df_filtered = df.loc[cond, cols]
else:
print("No match found")
I have this text file, Masterlist.txt, which looks something like this:
S1234567A|Jan Lee|Ms|05/10/1990|Software Architect|IT Department|98785432|PartTime|3500
S1234567B|Feb Tan|Mr|10/12/1991|Corporate Recruiter|HR Corporate Admin|98766432|PartTime|1500
S1234567C|Mark Lim|Mr|15/07/1992|Benefit Specialist|HR Corporate Admin|98265432|PartTime|2900
S1234567D|Apr Tan|Ms|20/01/1996|Payroll Administrator|HR Corporate Admin|91765432|FullTime|1600
S1234567E|May Ng|Ms|25/05/1994|Training Coordinator|HR Corporate Admin|98767432|Hourly|1200
S1234567Y|Lea Law|Ms|11/07/1994|Corporate Recruiter|HR Corporate Admin|94445432|PartTime|1600
I want to reduce the Salary(the number at the end of each line) of each line, only if "PartTime" is in the line and after 1995, by 50%, and then add it up.
Currently I only know how to select only lines with "PartTime" in it, and my code looks like this:
f = open("Masterlist.txt", "r")
for x in f:
if "FullTime" in x:
print(x)
How do I extract the Salary and reduce by 50% + add it up only if the year is after 1995?
Try using pandas library.
From your question I suppose you want to reduce by 50% Salary if year is less than 1995, otherwise increase by 50%.
import pandas as pd
path = r'../Masterlist.txt' # path to your .txt file
df = pd.read_csv(path, sep='|', names = [0,1,2,'Date',4,5,6,'Type', 'Salary'], parse_dates=['Date'])
# Now column Date is treated as datetime object
print(df.head())
0 1 2 Date 4 \
0 S1234567A Jan Lee Ms 1990-05-10 Software Architect
1 S1234567B Feb Tan Mr 1991-10-12 Corporate Recruiter
2 S1234567C Mark Lim Mr 1992-07-15 Benefit Specialist
3 S1234567D Apr Tan Ms 1996-01-20 Payroll Administrator
4 S1234567E May Ng Ms 1994-05-25 Training Coordinator
5 6 Type Salary
0 IT Department 98785432 PartTime 3500
1 HR Corporate Admin 98766432 PartTime 1500
2 HR Corporate Admin 98265432 PartTime 2900
3 HR Corporate Admin 91765432 FullTime 1600
4 HR Corporate Admin 98767432 Hourly 1200
df.Salary = df.apply(lambda row: row.Salary*0.5 if row['Date'].year < 1995 and row['Type'] == 'PartTime' else row.Salary + (row.Salary*0.5 ), axis=1)
print(df.Salary.head())
0 1750.0
1 750.0
2 1450.0
3 2400.0
4 600.0
Name: Salary, dtype: float64
Add some modifications to the if, else statement inside the apply function if you wanted something different.
I am trying to use a data frame that includes historical game statistics like the below df1, and build a second data frame that shows what the various column averages were going into each game (as I show in df2). How can I use grouby or something else to find the various averages for each team but only for games that have a date prior to the date in that specific row. Example of historical games column:
Df1 = Date Team Opponent Points Points Against 1st Downs Win?
4/16/20 Eagles Ravens 10 20 10 0
2/10/20 Eagles Falcons 30 40 8 0
12/15/19 Eagles Cardinals 40 10 7 1
11/15/19 Eagles Giants 20 15 5 1
10/12/19 Jets Giants 10 18 2 1
Below is the dataframe that i'm trying to create. As you can see, it is showing the averages for each column but only for the games that happened prior to each game. Note: this is a simplified example of a much larger data set that i'm working with. In case the context helps, I'm trying to create this dataframe so I can analyze the correlation between the averages and whether the team won.
Df2 = Date Team Opponent Avg Pts Avg Pts Against Avg 1st Downs Win %
4/16/20 Eagles Ravens 25.0 21.3 7.5 75%
2/10/20 Eagles Falcons 30.0 12.0 6.0 100%
12/15/19 Eagles Cardinals 20.0 15.0 5.0 100%
11/15/19 Eagles Giants NaN NaN NaN NaN
10/12/19 Jets Giants NaN NaN NaN NaN
Let me know if anything above isn't clear, appreciate the help.
The easiest way is to turn your dataframe into a Time Series.
Run this for a file:
data=pd.read_csv(r'C:\Users\...csv',index_col='Date',parse_dates=True)
This is an example with a CSV file.
You can run this after:
data[:'#The Date you want to have all the dates before it']
If you want build a Series that has time indexed:
index=pd.DatetimeIndex(['2014-07-04',...,'2015-08-04'])
data=pd.Series([0, 1, 2, 3], index=index)
Define your own function
def aggs_under_date(df, date):
first_team = df.Team.iloc[0]
first_opponent= df.Opponent.iloc[0]
if df.date.iloc[0] <= date:
avg_points = df.Points.mean()
avg_againts = df['Points Against'].mean()
avg_downs = df['1st Downs'].mean()
win_perc = f'{win_perc.sum()/win_perc.count()*100} %'
return [first_team, first_opponent, avg_points, avg_againts, avg_downs, win_perc]
else:
return [first_team, first_opponent, np.nan, np.nan, np.nan, np.nan]
And do the groupby applying the function you just defined
date_max = pd.to_datetime('11/15/19')
Df1.groupby(['Date']).agg(aggs_under_date, date_max)
I have a dataframe which I want to compare if they are present in another df.
after_h.sample(10, random_state=1)
movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
I want to compare if the above movies are present in another df.
FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560
I want something like this as my final output:
FILM votes
0 Max Steel 560
There are two ways:
get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of:
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year).
.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
(Acknowledging much help from #user3483203 here and in Python chat room)
Code to recreate dataframes:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
smci's option 1 is nearly there, the following worked for me:
df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))
Explanation:
Create a Votes column in df1
Apply a lambda to every movie string in df1
The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title
Select the Votes column of the resulting subset of df2
Take the first value in this column with any(0)