I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
Think of groupby as a rows-separation function. It groups all rows having the same attributes (as specified in by parameter) into separate data frames.
After the groupby, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean) on a single column.
Related
I have used the "groupby" method on my dataframe to find the total number of people at each location.
To the right of the "sum" column, I need to add a column that lists all of the people's names at each location (ideally in separate rows, but a list would be fine too).
Is there a way to "ungroup" my dataframe again after having found the sum?
dataframe.groupby(by=['location'], as_index=False)['people'].agg('sum')
You can do two different things:
(1) Create an aggregate DataFrame using groupby.agg and calling appropriate methods. The code below lists all names corresponding to a location:
out = dataframe.groupby(by=['location'], as_index=False).agg({'people':'sum', 'name':list})
(2) Use groupby.transform to add a new column to dataframe that has the sum of people by location in each row:
dataframe['sum'] = dataframe.groupby(by=['location'])['people'].transform('sum')
I think you are looking for 'transform' ?
dataframe.groupby(by=['location'], as_index=False)['people'].transform('sum')
In cases where I have a large number of columns that I want to sum, average, etc., is there a way to NOT change the column names, without having to use .alias on each column? The default is to add the function to the column name (e.g. col1 becomes "avg(col1)" after taking the average), is there an efficient way to have it stay named "col1"?
df = df.groupby(seg).avg('col1')
I use groupby method to group data by month. The ouput is exactly want I wanted.
What I want to understand is, how does x display only 3 columns (Quantity Ordered, Price Each and Sales) and rejects the other columns shown in the dataset after I use the groupby method. Is it because the other data isn't numeric ? Is it because I used sum method along with groupby method ?
Since sum is a numeric function, pandas would only apply it to the columns that are numeric. This is described in the documentation as Automatic exclusion of “nuisance” columns.
I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.
I have a function that takes as one of its inputs a dataframe, which is indexed by date. How can I run the function only on a subset of the dataframe (say, from 2005-2010)? I don't think I can just drop the rest of the rows from the dataframe because part of the function keeps track of a rolling average, and thus the first few rows would depend on dates I am not considering.
You can subset the dataframe with the rows that you need:
df.iloc[0:5] ## Gives only 5 rows
Then, you can run the function on these rows, like this:
df.iloc[0:5].apply(my_function)