DataFrame:
c_os_family_ss c_os_major_is l_customer_id_i
0 Windows 7 90418
1 Windows 7 90418
2 Windows 7 90418
Code:
print df
for name, group in df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)):
print name
print group
I'm trying to just loop over the aggregated data, but I get the error:
ValueError: too many values to unpack
#EdChum, here's the expected output:
c_os_family_ss \
l_customer_id_i
131572 Windows 7,Windows 7,Windows 7,Windows 7,Window...
135467 Windows 7,Windows 7,Windows 7,Windows 7,Window...
c_os_major_is
l_customer_id_i
131572 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
135467 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...
The output is not the problem, I wish to loop over every group.
df.groupby('l_customer_id_i').agg(lambda x: ','.join(x)) does already return a dataframe, so you cannot loop over the groups anymore.
In general:
df.groupby(...) returns a GroupBy object (a DataFrameGroupBy or SeriesGroupBy), and with this, you can iterate through the groups (as explained in the docs here). You can do something like:
grouped = df.groupby('A')
for name, group in grouped:
...
When you apply a function on the groupby, in your example df.groupby(...).agg(...) (but this can also be transform, apply, mean, ...), you combine the result of applying the function to the different groups together in one dataframe (the apply and combine step of the 'split-apply-combine' paradigm of groupby). So the result of this will always be again a DataFrame (or a Series depending on the applied function).
Here is an example of iterating over a pd.DataFrame grouped by the column atable. For this sample, "create" statements for an SQL database are generated within the for loop:
import pandas as pd
df1 = pd.DataFrame({
'atable': ['Users', 'Users', 'Domains', 'Domains', 'Locks'],
'column': ['col_1', 'col_2', 'col_a', 'col_b', 'col'],
'column_type':['varchar', 'varchar', 'int', 'varchar', 'varchar'],
'is_null': ['No', 'No', 'Yes', 'No', 'Yes'],
})
df1_grouped = df1.groupby('atable')
# iterate over each group
for group_name, df_group in df1_grouped:
print('\nCREATE TABLE {}('.format(group_name))
for row_index, row in df_group.iterrows():
col = row['column']
column_type = row['column_type']
is_null = 'NOT NULL' if row['is_null'] == 'No' else ''
print('\t{} {} {},'.format(col, column_type, is_null))
print(");")
You can iterate over the index values if your dataframe has already been created.
df = df.groupby('l_customer_id_i').agg(lambda x: ','.join(x))
for name in df.index:
print name
print df.loc[name]
I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired
This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])
I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter
frame = frame2.groupby(['name1', 'name2', 'date', 'agID','figi', 'exch', 'figi', 'marketSector','name','fx_currency', 'id_type', 'id', 'currency']).agg({'call_agreed_amount' : 'sum' , 'pledge_current_market_value' : 'sum', 'pledge_quantity' : 'sum', 'pledge_adjusted_collateral_value' : 'sum', 'count' : 'count'})
print(frame.head())
for value in frame['call_currency']:
doStuff()
In the code above, all columns exist before the groupby statement. After the groupby statement is executed, the frame.head() returns all of the same columns. My code fails at my for loop with a KeyError trying to access frame['call_currency'], which 100% exists in frame.
After troubleshooting myself, I realized that python's groupby function returns a GroupBy object with a hierarchical index. The grouped columns were added as the index for the aggregate values. In order to fix this, I added .reset_index() to the end of my groupby statement.
My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?
Use map for flatten columns names:
df.columns = df.columns.map('_'.join)