Filtering Pandas Dataframe Aggregate - python

I have a pandas dataframe that I groupby, and then perform an aggregate calculation to get the mean for:
grouped = df.groupby(['year_month', 'company'])
means = grouped.agg({'size':['mean']})
Which gives me a dataframe back, but I can't seem to filter it to the specific company and year_month that I want:
means[(means['year_month']=='201412')]
gives me a KeyError

The issue is that you are grouping based on 'year_month' and 'company' . Hence in the means DataFrame, year_month and company would be part of the index (MutliIndex). You cannot access them as you access other columns.
One method to do this would be to get the values of the level 'year_month' of index . Example -
means.loc[means.index.get_level_values('year_month') == '201412']
Demo -
In [38]: df
Out[38]:
A B C
0 1 2 10
1 3 4 11
2 5 6 12
3 1 7 13
4 2 8 14
5 1 9 15
In [39]: means = df.groupby(['A','B']).mean()
In [40]: means
Out[40]:
C
A B
1 2 10
7 13
9 15
2 8 14
3 4 11
5 6 12
In [41]: means.loc[means.index.get_level_values('A') == 1]
Out[41]:
C
A B
1 2 10
7 13
9 15

As already pointed out, you will end up with a 2 level index. You could try to unstack the aggregated dataframe:
means = df.groupby(['year_month', 'company']).agg({'size':['mean']}).unstack(level=1)
This should give you a single 'year_month' index, 'company' as columns and your aggregate size as values. You can then slice by the index:
means.loc['201412']

Related

first/count applied to groupby returns empty dataframe

import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
0 1
1 1
2 2
3 3
4 4
5 5
6 5
7 6
8 7
9 7
10 7
11 8
Name: A, dtype: int64
res = df.groupby(dummy)
print(res.first())
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8]
Why the last print results in an empty dataframe? I except each group to be a slice of the original df, where each slice would contain as many rows as the number of duplicates for a given value in column "A". What am I missing?
My guess is by default, A is set to index before applying the groupby operator (e.g. first). Therefore, df is essentially empty before the actual first operator is applied. If you have another column B:
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8], 'B':range(12)} )
then you would see A as the index and the first values for B in each group with df.groupby(dummy).first():
B
A
1 0
2 2
3 3
4 4
5 5
6 7
7 8
8 11
On the other note, if you force as_index=False, groupby would not set A as index and you would have the non-empty data:
df.groupby(dummy, as_index=False).first()
gives:
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
Or, you can groupby on a copy of the column:
df.groupby(dummy.copy()).first()
and you get:
A
A
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
By default, as_index is True which means it will take the passed column and make it the index and then group the other columns of DataFrame accordingly. You need to make as_index=False to get your desired results.
import pandas as pd
df = pd.DataFrame( {'A': [1,1,2,3,4,5,5,6,7,7,7,8]} )
dummy = df["A"]
print(dummy)
res = df.groupby(dummy,as_index=False)
print(res.first())
A
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
as_index : bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

How to sum columns with a duplicate name with Pandas?

I have a dataframe with duplicate column name and I would like to sum these columns.
>df
A B A B
1 12 2 4 1
2 10 5 4 9
3 2 1 4 8
4 2 4 3 8
What i would like is something like this:
A B
1 16 3
2 14 14
3 6 9
4 5 12
I can select duplicate columns in a loop but I don't know how to remove the columns and recreate a new column with summed values. I would like to know if there a more elegant way?
col = list(df.columns)
dup = list(set([x for x in col if col.count(x) > 1]))
for d in dup:
sum = df[d].sum(axis=1)
Let us try
sum_df=df.sum(level=0,axis=1)
Try this
df.groupby(lambda x:x, axis=1).sum()

Pandas - How to swap column contents leaving label sequence intact?

I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)

Aggregating dataframe to give sum of elements and string of grouped indices

I'm trying to use groupby to give me the sum or mean of a number of elements, and a string of the original row indices for each group. So for instance, the dataframe:
>>> df = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,5,6],[7,8,3],[11,12,13],[11,2,3]],index = ['p','q','r','s','t','u','v'],columns =['a','b','c'])
a b c
p 1 2 3
q 1 3 4
r 2 3 4
s 2 5 6
t 7 8 3
u 11 12 13
v 11 2 3
I would then like df to be grouped by 'a', to give:
b c indices
1 5 7 p,q
2 8 10 r,s
7 8 3 t
11 14 16 u,v
So far, I've tried:
df.groupby('a').agg({'score' : np.sum, 'indices' : lambda x: ",".join(list(x.index.values))})
But am receiving an error based on 'indices' not existing, can anyone advise how to accomplish what I'm trying to do?
Thanks
The way aggregation works is that you give a key and a value, where the key is a pre existing column name and the value is a function to map on the column.
So to get the sums the way you want, you do the following:
>>> grouped = df.groupby('a')
>>> grouped.agg({'b' : np.sum, 'c' : np.sum}).head()
c b
a
1 7 5
2 10 8
7 3 8
11 16 14
But you want to know the rows that have been combined in a third column. So you actually need to add this column before you groupby! Here is the full code:
df['indices'] = range(len(df))
grouped = df.groupby('a')
final = grouped.agg({'b' : np.sum, 'c' : np.sum, 'indices': lambda x: ",".join(list(x.index.values))})
then you get the following result:
>>> final.head()
indices c b
a
1 p,q 7 5
2 r,s 10 8
7 t 3 8
11 u,v 16 14
if you have any further questions, feel free to comment.

Sum all columns with a wildcard name search using Python Pandas

I have a dataframe in python pandas with several columns taken from a CSV file.
For instance, data =:
Day P1S1 P1S2 P1S3 P2S1 P2S2 P2S3
1 1 2 2 3 1 2
2 2 2 3 5 4 2
And what I need is to get the sum of all columns which name starts with P1... something like P1* with a wildcard.
Something like the following which gives an error:
P1Sum = data["P1*"]
Is there any why to do this with pandas?
I found the answer.
Using the data, dataframe from the question:
from pandas import *
P1Channels = data.filter(regex="P1")
P1Sum = P1Channels.sum(axis=1)
List comprehensions on columns allow more filters in the if condition:
In [1]: df = pd.DataFrame(np.arange(15).reshape(5, 3), columns=['P1S1', 'P1S2', 'P2S1'])
In [2]: df
Out[2]:
P1S1 P1S2 P2S1
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [3]: df.loc[:, [x for x in df.columns if x.startswith('P1')]].sum(axis=1)
Out[3]:
0 1
1 7
2 13
3 19
4 25
dtype: int64
Thanks for the tip jbssm, for anyone else looking for a sum total, I ended up adding .sum() at the end, so:
P1Sum= P1Channels.sum(axis=1).sum()

Categories

Resources