Groupby, sum, reset index & keep first all together - python

I am using the following code and my goal is to group by 2 columns (out of tens of them), then keep the first value of all the other columns while summing the values of two other columns. And it doesn't really work no matter the combination that I tried.
Code used:
df1 = df.groupby(['col_1', 'Col_2'], as_index = False)[['Age', 'Income']].apply(sum).first()
The error that I am getting is the following which just leads me to believe that this can be done with a slightly different version of the code that I used.
TypeError: first() missing 1 required positional argument: 'offset'
Any suggestions would be more than appreciated!

You can use agg with configuring corresponding functions for each column.
group = ['col_1', 'col_2']
(df.groupby(group, as_index=False)
.agg({
**{x: 'first' for x in df.columns[~df.columns.isin(group)]}, # for all columns other than grouping column
**{'Age': 'sum', 'Income': 'sum'} # Overwrite aggregation for specific columns
})
)
This part { **{...}, **{...} } will generate
{
'Age': 'sum',
'Income': 'sum',
'othercol': 'first',
'morecol': 'first'
}

Related

Filter pandas frame by values that are already grouped on consecutive rows [duplicate]

DataFrame:
c_os_family_ss c_os_major_is l_customer_id_i
0 Windows 7 90418
1 Windows 7 90418
2 Windows 7 90418
Code:
print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
print name
print group
I'm trying to just loop over the aggregated data, but I get the error:
ValueError: too many values to unpack
#EdChum, here's the expected output:
c_os_family_ss \
l_customer_id_i
131572 Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467 Windows 7,Windows 7,Windows 7,Windows 7,Window...
c_os_major_is
l_customer_id_i
131572 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
The output is not the problem, I wish to loop over every group.
df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) does already return a dataframe, so you cannot loop over the groups anymore.
In general:
df.groupby(...) returns a GroupBy object (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:
grouped = df.groupby('A')
for name, group in grouped:
...
When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).
Here is an example of iterating over a pd.DataFrame grouped by the column atable. For this sample, "create" statements for an SQL database are generated within the for loop:
import pandas as pd
df1 = pd.DataFrame({
'atable': ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
'column': ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
'is_null': ['No', 'No', 'Yes', 'No', 'Yes'],
})
df1_grouped = df1.groupby('atable')
# iterate over each group
for group_name, df_group in df1_grouped:
print('\nCREATE TABLE {}('.format(group_name))
for row_index, row in df_group.iterrows():
col = row['column']
column_type = row['column_type']
is_null = 'NOT NULL' if row['is_null'] == 'No' else ''
print('\t{} {} {},'.format(col, column_type, is_null))
print(");")
You can iterate over the index values if your dataframe has already been created.
df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
print name
print df.loc[name]

Flatten a Dataframe that is pivoted

I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired
This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])

Python Pandas - How to find rows where a column value is different from two data frame

I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter

python groupby statement only leaving aggregate fields

frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.
After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.

Pandas aggregation warning with lambdas (FutureWarning: using a dict with renaming is deprecated)

My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?
Use map for flatten columns names:
df.columns = df.columns.map('_'.join)

Categories

Resources