python groupby statement only leaving aggregate fields - python

frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.

After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.

Related

Groupby, sum, reset index & keep first all together

I am using the following code and my goal is to group by 2 columns (out of tens of them), then keep the first value of all the other columns while summing the values of two other columns. And it doesn't really work no matter the combination that I tried.
Code used:
df1 = df.groupby(['col_1', 'Col_2'], as_index = False)[['Age', 'Income']].apply(sum).first()
The error that I am getting is the following which just leads me to believe that this can be done with a slightly different version of the code that I used.
TypeError: first() missing 1 required positional argument: 'offset'
Any suggestions would be more than appreciated!
You can use agg with configuring corresponding functions for each column.
group = ['col_1', 'col_2']
(df.groupby(group, as_index=False)
.agg({
**{x: 'first' for x in df.columns[~df.columns.isin(group)]}, # for all columns other than grouping column
**{'Age': 'sum', 'Income': 'sum'} # Overwrite aggregation for specific columns
})
)
This part { **{...}, **{...} } will generate
{
'Age': 'sum',
'Income': 'sum',
'othercol': 'first',
'morecol': 'first'
}

Groupby giving keyerror

I have a dataframe, df, defined as:
Empty DataFrame
Columns: []
Index: [timestamp, device_type, os]
I am trying to groupby timestamp and device type and preform .agg on it such as:
df.groupby(['timestamp', 'device_type']).agg({'sessions_sum': 'sum'})
This is giving me a KeyError:
** KeyError: KeyError('timestamp',)
I have read over pandas documentation but I am unsure where I am going wrong. How can I successfully use groupby?
Error is because, timestamp, device_type, os are index of the df, not actual columns.
So, you can either do a:
df.reset_index(inplace=True)
df.groupby(['timestamp', 'device_type']).agg({'sessions_sum': 'sum'})
OR:
df.groupby(level=0).agg

Pandas aggregation warning with lambdas (FutureWarning: using a dict with renaming is deprecated)

My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?
Use map for flatten columns names:
df.columns = df.columns.map('_'.join)

Pandas .describe() only returning 4 statistics on int dataframe (count, unique, top, freq)... no min, max, etc

Why could this be? My data seems pretty simple and straightforward, it's a 1 column dataframe of ints, but .describe only returns count, unique, top, freq... not max, min, and other expected outputs.
(Note .describe() functionality is as expected in other projects/datasets)
It seems pandas doesn't recognize your data as int.
Try to do this explicitly:
print(df.astype(int).describe())
Try:
df.agg(['count', 'nunique', 'min', 'max'])
You can add or remove the different aggregation functions to that list.
And when I have quite a few columns I personally like to transpose it:
df.agg(['count', 'nunique', 'min', 'max']).transpose()
To reduce the aggregations on a subset of columns you different ways to do it.
By containig a word: example 'ID'
df.filter(like='ID').agg(['count', 'nunique'])
By type of data:
df.select_dtypes(include=['int']).agg(['count', 'nunique'])
df.select_dtypes(exclude=['float64']).agg(['count', 'nunique'])
try to change your features into numerical values to return all the statics you need :
df1['age'] = pd.to_numeric(df1['age'], errors='coerce')

Pandas Groupby - naming aggregate output column

I have a pandas groupby command which looks like this:
df.groupby(['year', 'month'], as_index=False).agg({'users':sum})
Is there a way I can name the agg output something other than 'users' during the groupby command? For example, what if I wanted the sum of users to be total_users? I could rename the column after the groupby is complete, but wonder if there is another way.
I like #Alexander answer, but there is also add_prefix:
df.groupby(['year','month']).agg({'users':sum}).add_prefix('total_')
Per the docs:
If a dict is passed, the keys will be used to name the columns.
Otherwise the function’s name (stored in the function object) will be
used.
In [58]: grouped['D'].agg({'result1' : np.sum, ....:
'result2' : np.mean})
In your case:
df.groupby(['year', 'month'], as_index=False).users.agg({'total_users': np.sum})

Categories

Resources