Comparing Data in Pandas - python

I am just trying to get some data and re-arrange it.
Here is my dataset showing foods and the scores they received in different years.
What I want to do is find the foods which had the lowest and highest scores on average and track their scores across the years.
The next part is where I am a little stuck:
I'd need to display the max and min foods from the original dataset that would show all the columns - Food, year, Score. This is what I have tried, but it doesn't work:
menu[menu.Food == Max & menu.Food == Min]
Basically I want it to display something like the below in a dataframe, so I can plot some graphs (i.e. I want to then make a line plot which would display the years on the x-axis, scores on the y-axis and plot the lowest scoring food and the top scoring food:
If you guys know any other ways of doing this, please let me know!
Any help would be appreciated

You can select first and last rows per year by Series.duplicated with invert mask and chain by | for bitwise OR, filter in boolean indexing:
df1 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]
Solution with groupby:
df1 = df.groupby('year').agg(['first','last']).stack(1).droplevel(1).reset_index()
If need minimal and maximal per years:
df = df.sort_values(['year','food'])
df2 = df[~df['year'].duplicated() | ~df['year'].duplicated(keep='last')]
Solution with groupby:
df2 = df.loc[df.groupby('year')['Score'].agg(['idxmax','idxmin']).stack()]

Related

Filtering the Min Value of a Mean in a GroupBy Pandas

I have a Star Wars People df with the following Columns:
columns = [name, height, mass, birth_year, gender, homeworld]
name is the index
I need to compute the following:
Which is the planet with the lowest average mass index of its characters?
Which character/s are from that planet?
Which I tried:
df.groupby(["homeworld"]).filter(lambda row: row['mass'].mean() > 0).min()
However, I need to have the min() inside the filter because I can have more than 1 character in the homeworld that have this lowest average mass index. Right now the filter function is not doing anything, is just to show how I want the code to be.
How can I achieve that? Hopefully with the filter function.
Use:
#aggreagate mean to Series
s = df.groupby("homeworld")['mass'].mean()
#filter out negative values and get homeworld with minimum value
out = s[s.gt(0)].idxmin()
#filter original DataFrame
df1 = df[df['homeworld'].eq(out)]
What do you mean with "more than 1 character in the homeworld that have this lowest average mass index"?
It should not matter how many characters are present per homeworld, the groupby aggregation with the mean method will calculate the averages for you.
When I look at the question you can just do the groupby like so:
df = df.groupby(['homeworld']).mean().sort_values(by=["mass"], ascending=False)
df.head(1)
And note the homeworld that is displayed

How to decile a column in a dataframe by partitions in Pandas Python?

I am trying to decile a sales column in my dataframe but also partition by year. So each year should have different deciles.
df = ['year','name', 'sales']
I think I can use this function but want to partition by year
df['decile']=pd.qcut(df['sales'],10,labels=False)
I suppose I can use groupby but I am not able to figure out the syntax.
Would really appreciate any help?
You can try:
df['decile'] = df.groupby('year')[['sales'']].apply(lambda g: pd.qcut(g.rank(method='first'), 10, labels=False)+1)
Explain:
g.rank(method='first'): in case there are many sales with same values. I add this one because in my experiment, I encountered many case where you have same values. If there is low chance of duplicated values, then you can replace by g which is fine.
10, labels=False)+1): you can leave option +1 if you want to label from 1 to 10. If not, it will label from 0 to 9

How do I create new pandas dataframe by grouping multiple variables?

I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?

How to set parameters for new column in pandas dataframe OR for a value count on python?

I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()

Not able to understand the below pandas code

can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.
It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.
It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working

Categories

Resources