I use groupby method to group data by month. The ouput is exactly want I wanted.
What I want to understand is, how does x display only 3 columns (Quantity Ordered, Price Each and Sales) and rejects the other columns shown in the dataset after I use the groupby method. Is it because the other data isn't numeric ? Is it because I used sum method along with groupby method ?
Since sum is a numeric function, pandas would only apply it to the columns that are numeric. This is described in the documentation as Automatic exclusion of “nuisance” columns.
Related
I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
Think of groupby as a rows-separation function. It groups all rows having the same attributes (as specified in by parameter) into separate data frames.
After the groupby, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean) on a single column.
I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?
When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.
I'm having a Pandas Dataframe and I'm doing a groupby on two columns and have a couple of aggregate functions on a column. Here is how my code looks like
df2 = df[X,Y, Z].groupby([X,Y]).agg([np.mean, np.max, np.min]).reset_index()
It find the aggregate functions on the column Z.
I need to sort by let's say min (i.e. sort_values('min')) column but it keeps complaining that 'min' column does not exist. How can I do that
Since you are generating a pd.MultiIndex, you must use a tuple in sort_values.
Try:
df2.sort_values(('Z','amin'))
Consider this case:
Python pandas equvilant to R groupby mutate
In dplyr:
df = df%>% group_by(a,b) %>%
means first the dataframe is grouped by column a then by b.
In my case I am trying to group my data first by group_name column, then by user_name , then by type_of_work . There are more than three columns (which is why I got confused) but I need data grouped according to these three headers in the same order. I already have an algorithm to work with columns after this stage. I only need an algorithm for creating a dataframe grouped according to these three columns.
It is important in my case that the sequence is preserved like the dplyr function.
Do we have anything similar in pandas data-frame?
Grouped = df.groupby(['a', 'b'])
Read more on "split-apply-combine" strategy in the pandas docs to see how pandas deals with these issues compared to R.
From your comment it seem you want assign the grouped frames. You can either use a groupbyobject through the API, eg grouped.mean(), or you can iterate through the groupby object. You will get name and group in each loop.
I want to apply a group by on a pandas dataframe. I want to group by three columns and calculate their count. I used the following code
data.groupby(['post_product_list','cust_visid','date_time']).count()
But it didn't seem to work
data.groupby(['post_product_list','cust_visid','date_time']).size()