When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do
df.groupby(["Name", "Type"]).sum()
we get a DataFrame:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID.
but if I apply a size() function, the result is a Series:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change(), we get only the resulting DataFrame column:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.
From the document
Size:
Returns
Series
Number of rows in each group.
For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4
Related
I'm working with pandas on a dataset and I want to get the attribute that performed more victory in the matches. I am able to created a dataframe with the groupby function.
For example for the attribute "surface" that could have 3+ alternative I have this dataframe:
Now I would like to have a output dataframe like:
fullname best_surface
Zuzana Zlochova Hard
Zuzanna Bednarz Clay
....
I managed to solve this with some merge for attributes that can have just two values, but it doesn't work with attribute that can have 3 or more values.
The dataset is big so I have to work with pandas operations, I cannot use iters.
Thank you
Use DataFrameGroupBy.idxmax for indices by first maximal by column hasWon, select rows and convert MultiIndex to DataFrame by MultiIndex.to_frame:
df = df.loc[df.groupby(level='fullname')['hasWon'].idxmax()].index.to_frame(index=False)
print (df)
fullname surface
0 Zuzana Zlochova Hard
1 Zuzanna Bednarz Clay
2 Zuzanna Szczepanska Clay
3 Zvonimir Oreskovic Hard
Or convert tuples to DataFrame in constructor:
df = pd.DataFrame(df.groupby('fullname')['hasWon'].idxmax().tolist(),
columns=['fullname','best_surface'])
print (df)
fullname best_surface
0 Zuzana Zlochova Hard
1 Zuzanna Bednarz Clay
2 Zuzanna Szczepanska Clay
3 Zvonimir Oreskovic Hard
I have a dataframe with dimensions [1,126], where each column corresponds to a specific economic variable and these economic variables fall into one of 8 groups like Output, Labor, Housing etc. I have a separate dataframe where this group allocation is described.
Is it possible to aggregate the values of initial dataframe into a new [1,8] array according to the groups? I have no prior knowledge on the number of variables belonging to each group.
here is the code for replication on smaller scale:
data = {'RPI':[1], 'IP':[1], 'Labor1':[2], 'Labor2':[2], 'Housing1':[3], 'Housing2':[3]}
df = pd.DataFrame(data)
groups = {'Description':['RPI','IP','Labor1','Labor2','Housing1','Housing2'],
'Groups':['Real','Real','Labor','Labor','Housing','Housing']}
groups = pd.DataFrame(groups)
The final version should look like smth like this:
aggregate = {'Real':[2],'Labor':[4],'Housing':[6]}
aggregate = pd.DataFrame(aggregate)
You can merge the group to the description, then groupby and sum.
(df.T
.rename({0:'value'}, axis=1)
.merge(groups, left_index=True, right_on='Description')
.groupby('Groups')['value'].sum())
returns
Groups
Housing 6
Labor 4
Real 2
Name: value, dtype: int64
I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).
frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1
a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
I was hoping values[picked_values] would work, but this ends up with five columns.
In the real-life example, picked_values is a lot larger and calculated.
Thank you for your time.
Use df.lookup
pd.Series(frame1.lookup(picked_values.index,picked_values))
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
dtype: float64
Here's a NumPy based approach using integer indexing and Series.searchsorted:
frame1.values[frame1.index, frame1.columns.searchsorted(picked_values.values)]
# array([0.22095278, 0.86200616, 1.88047197, 0.49816937, 0.10962954])
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)
I am getting familiar with Pandas and I want to learn the logic with a few simple examples.
Let us say I have the following panda DataFrame object:
import pandas as pd
d = {'year':pd.Series([2014,2014,2014,2014], index=['a','b','c','d']),
'dico':pd.Series(['A','A','A','B'], index=['a','b','c','d']),
'mybool':pd.Series([True,False,True,True], index=['a','b','c','d']),
'values':pd.Series([10.1,1.2,9.5,4.2], index=['a','b','c','d'])}
df = pd.DataFrame(d)
Basic Question.
How do I take a column as a list.
I.e., d['year']
would return
[2013,2014,2014,2014]
Question 0
How do I take rows 'a' and 'b' and columns 'year' and 'values' as a new dataFrame?
If I try:
d[['a','b'],['year','values']]
it doesn't work.
Question 1.
How would I aggregate (sum/average) the values column by the year, and dico columns, for example. I.e., such that different years/dico combinations would not be added, but basically mybool would be removed from the list.
I.e., after aggregation (this case average) I should get:
tipo values year
A 10.1 2013
A (9.5+1.2)/2 2014
B 4.2 2014
If I try the groupby function it seems to output some odd new DataFrame structure with bool in it, and all possible years/dico combinations - my objective is rather to have that simpler new sliced and smaller dataframe I showed above.
Question 2. How do I filter by a condition?
I.e., I want to filter out all bool columns that are False.
It'd return:
tipo values year mybool
A 10.1 2013 True
A 9.5 2014 True
B 4.2 2014 True
I've tried the panda tutorial but I still get some odd behavior so asking directly seems to be a better idea.
Thanks!
values from series in a list:
df['year'].values #returns an array
loc lets you subset a dateframe by index labels:
df.loc[['a','b'],['year','values']]
Group by lets you aggregate over columns:
df.groupby(['year','dico'],as_index=False).mean() #don't have 2013 in your df
Filtering by a column value:
df[df['mybool']==True]