Pandas Groupby: Filling index names for all rows [duplicate] - python

I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?

When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.

Related

Pandas Groupby Syntax explanation

I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
Think of groupby as a rows-separation function. It groups all rows having the same attributes (as specified in by parameter) into separate data frames.
After the groupby, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean) on a single column.

How does pandas know to only group and show these 3 columns?

I use groupby method to group data by month. The ouput is exactly want I wanted.
What I want to understand is, how does x display only 3 columns (Quantity Ordered, Price Each and Sales) and rejects the other columns shown in the dataset after I use the groupby method. Is it because the other data isn't numeric ? Is it because I used sum method along with groupby method ?
Since sum is a numeric function, pandas would only apply it to the columns that are numeric. This is described in the documentation as Automatic exclusion of “nuisance” columns.

Difference between `print(df['A'][2])` and `print(df.loc[2, 'A'])` [duplicate]

I've noticed three methods of selecting a column in a Pandas DataFrame:
First method of selecting a column using loc:
df_new = df.loc[:, 'col1']
Second method - seems simpler and faster:
df_new = df['col1']
Third method - most convenient:
df_new = df.col1
Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.
I'm mostly curious as to why there appear to be three methods for doing the same thing.
In the following situations, they behave the same:
Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, [] does not work in the following situations:
You can select a single row with df.loc[row_label]
You can select a list of rows with df.loc[[row_label1, row_label2]]
You can slice columns with df.loc[:, 'A':'C']
These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).
Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.
loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:
df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']
However [] is intended to get columns with particular names:
df['Price']
With [] you can also filter rows, but it is more elaborated:
df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']
If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
Select specific rows and/or columns using loc when using the row and
column names
Select specific rows and/or columns using iloc when using the
positions in the table
You can assign new values to a selection based on loc/iloc.
I highlighted some of the points to make their use-case differences even more clear.
There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.
You can refer to this question:
Is there a nice way to generate multiple columns using .loc?
Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

In what cases are loc and iloc a better approach than just using __getitem__ with pandas dataframes? [duplicate]

I've noticed three methods of selecting a column in a Pandas DataFrame:
First method of selecting a column using loc:
df_new = df.loc[:, 'col1']
Second method - seems simpler and faster:
df_new = df['col1']
Third method - most convenient:
df_new = df.col1
Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.
I'm mostly curious as to why there appear to be three methods for doing the same thing.
In the following situations, they behave the same:
Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, [] does not work in the following situations:
You can select a single row with df.loc[row_label]
You can select a list of rows with df.loc[[row_label1, row_label2]]
You can slice columns with df.loc[:, 'A':'C']
These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).
Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.
loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:
df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']
However [] is intended to get columns with particular names:
df['Price']
With [] you can also filter rows, but it is more elaborated:
df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']
If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
Select specific rows and/or columns using loc when using the row and
column names
Select specific rows and/or columns using iloc when using the
positions in the table
You can assign new values to a selection based on loc/iloc.
I highlighted some of the points to make their use-case differences even more clear.
There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.
You can refer to this question:
Is there a nice way to generate multiple columns using .loc?
Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

I've noticed three methods of selecting a column in a Pandas DataFrame:
First method of selecting a column using loc:
df_new = df.loc[:, 'col1']
Second method - seems simpler and faster:
df_new = df['col1']
Third method - most convenient:
df_new = df.col1
Is there a difference between these three methods? I don't think so, in which case I'd rather use the third method.
I'm mostly curious as to why there appear to be three methods for doing the same thing.
In the following situations, they behave the same:
Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, [] does not work in the following situations:
You can select a single row with df.loc[row_label]
You can select a list of rows with df.loc[[row_label1, row_label2]]
You can slice columns with df.loc[:, 'A':'C']
These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).
Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.
loc is specially useful when the index is not numeric (e.g. a DatetimeIndex) because you can get rows with particular labels from the index:
df.loc['2010-05-04 07:00:00']
df.loc['2010-1-1 0:00:00':'2010-12-31 23:59:59 ','Price']
However [] is intended to get columns with particular names:
df['Price']
With [] you can also filter rows, but it is more elaborated:
df[df['Date'] < datetime.datetime(2010,1,1,7,0,0)]['Price']
If you're confused which of these approaches is (at least) the recommended one for your use-case, take a look at this brief instructions from pandas tutorial:
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list
of column/row labels, a slice of labels, a conditional expression or
a colon.
Select specific rows and/or columns using loc when using the row and
column names
Select specific rows and/or columns using iloc when using the
positions in the table
You can assign new values to a selection based on loc/iloc.
I highlighted some of the points to make their use-case differences even more clear.
There seems to be a difference between df.loc[] and df[] when you create dataframe with multiple columns.
You can refer to this question:
Is there a nice way to generate multiple columns using .loc?
Here, you can't generate multiple columns using df.loc[:,['name1','name2']] but you can do by just using double bracket df[['name1','name2']]. (I wonder why they behave differently.)

Categories

Resources