I'm doing some Python exercises and I'm stuck with a question.
I'm using the following Titanic dataframe: https://drive.google.com/file/d/1NEHvlUMTNPusHZvHUFTqeUR_9yY1tHVz/view
Now I need to find the minimum value of the column 'Age' for each class of 'Pclass' for the passengers that paid a fare ('Fare') above the average.
Using this I can get the minimum age by group, but how can I add the 'above average Fare' condition to this?
df.groupby('Pclass')['Age'].min()
you can:
find mean
filter
pivot_table, minimum value of the column 'Age' for each class of 'Pclass'
avrg_Fare = df['Fare'].mean()
df = df.loc[df['Fare'] > avrg_Fare]
PVT_min_age = df.pivot_table(index='Pclass', aggfunc={'Age':np.min}).reset_index()
Give this a shot
average_fare = df['Fare'].mean()
df.query("fare > #average_fare").groupby('Pclass_2').agg{'Age': ['min']}
Grouping by with Where conditions in Pandas
I may have some syntax errors since its been awhile since i've done pandas, if anyone sees a problem please correct it
Related
I have a Star Wars People df with the following Columns:
columns = [name, height, mass, birth_year, gender, homeworld]
name is the index
I need to compute the following:
Which is the planet with the lowest average mass index of its characters?
Which character/s are from that planet?
Which I tried:
df.groupby(["homeworld"]).filter(lambda row: row['mass'].mean() > 0).min()
However, I need to have the min() inside the filter because I can have more than 1 character in the homeworld that have this lowest average mass index. Right now the filter function is not doing anything, is just to show how I want the code to be.
How can I achieve that? Hopefully with the filter function.
Use:
#aggreagate mean to Series
s = df.groupby("homeworld")['mass'].mean()
#filter out negative values and get homeworld with minimum value
out = s[s.gt(0)].idxmin()
#filter original DataFrame
df1 = df[df['homeworld'].eq(out)]
What do you mean with "more than 1 character in the homeworld that have this lowest average mass index"?
It should not matter how many characters are present per homeworld, the groupby aggregation with the mean method will calculate the averages for you.
When I look at the question you can just do the groupby like so:
df = df.groupby(['homeworld']).mean().sort_values(by=["mass"], ascending=False)
df.head(1)
And note the homeworld that is displayed
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
I am trying to decile a sales column in my dataframe but also partition by year. So each year should have different deciles.
df = ['year','name', 'sales']
I think I can use this function but want to partition by year
df['decile']=pd.qcut(df['sales'],10,labels=False)
I suppose I can use groupby but I am not able to figure out the syntax.
Would really appreciate any help?
You can try:
df['decile'] = df.groupby('year')[['sales'']].apply(lambda g: pd.qcut(g.rank(method='first'), 10, labels=False)+1)
Explain:
g.rank(method='first'): in case there are many sales with same values. I add this one because in my experiment, I encountered many case where you have same values. If there is low chance of duplicated values, then you can replace by g which is fine.
10, labels=False)+1): you can leave option +1 if you want to label from 1 to 10. If not, it will label from 0 to 9
can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.
It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.
It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.