How to aggregate, combining dataframes, with pandas groupby - python

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!

Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.

Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8

Related

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

How to remove duplication of columns names using Pandas Merge function

When we merge two dataframes using pandas merge function, is it possible to ensure the key(s) based on which the two dataframes are merged is not repeated twice in the result? For e.g., I tried to merge two DFs with a column named 'isin_code' in the left DF and a column named 'isin' in the right DF. Even though the column/header names are different, the values of both the columns are same. In, the eventual result though, I get to see both 'isin_code' column and 'isin' column, which I am trying to avoid.
Code used:
result = pd.merge(df1,df2[['isin','issue_date']],how='left',left_on='isin_code',right_on = 'isin')
Either rename the columns to match before merge to uniform the column names and specify only on:
result = pd.merge(
df1,
df2[['isin', 'issue_date']].rename(columns={'isin': 'isin_code'}),
on='isin_code',
how='left'
)
OR drop the duplicate column after merge:
result = pd.merge(
df1,
df2[['isin', 'issue_date']],
how='left',
left_on='isin_code',
right_on='isin'
).drop(columns='isin')
Sample DataFrames and output:
import pandas as pd
df1 = pd.DataFrame({'isin_code': [1, 2, 3], 'a': [4, 5, 6]})
df2 = pd.DataFrame({'isin': [1, 3], 'issue_date': ['2021-01-02', '2021-03-04']})
df1:
isin_code a
0 1 4
1 2 5
2 3 6
df2:
isin issue_date
0 1 2021-01-02
1 3 2021-03-04
result:
isin_code a issue_date
0 1 4 2021-01-02
1 2 5 NaN
2 3 6 2021-03-04

How to concatenate multiple dataframes having different column names and length

I am having multiple dataframes. All are having different column names and lengths. For example df1 has columns like ['c1', 'c2', 'c3'], df2 has columns like ['d1', 'd2', 'd3', 'd4'] and so on.
I want to concatenate all the dfs one under another. I don't care about the column name preservation. Resultant df will have all the values of df1 and df2 and so on.
Right now I'm doing pd.concat([df1, df2], axis=0) which is making the resultant df both df1 and df2 columns side by side. I want them one under another.
If column names isn't important, we can also consider taking the numpy values of the dataframes by DataFrame.values and concat together using pd.concat(), like below:
pd.concat([pd.DataFrame(dfi.values) for dfi in [df1, df2]], ignore_index=True)
Since your DataFrames can have a different number of columns, rename the labels to be their integer position that way they align underneath for the join. The result will have an Int64Index on the columns, up to the length of the widest DataFrame you provide in the concat.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df1 = pd.DataFrame(np.random.choice(['foo', 'bar'], (2, 3)),
columns=['c1', 'c2', 'c3'])
df2 = pd.DataFrame(np.random.randint(11, 20, (3, 4)),
columns=['d1', 'd2', 'd3', 'd4'])
Code
df = pd.concat([x.rename(columns=dict(zip(x.columns, range(x.shape[1]))))
for x in [df1, df2]],
ignore_index=True)
# 0 1 2 3
#0 foo bar foo NaN
#1 foo foo foo NaN
#2 17 12 14 17.0
#3 12 11 12 11.0
#4 11 14 15 11.0

Can I avoid that the join column of the right data frame in a pandas merge appears in the output?

I am merging two data frames with pandas. I would like to avoid that, when joining, the output includes the join column of the right table.
Example:
import pandas as pd
age = [['tom', 10], ['nick', 15], ['juli', 14]]
df1 = pd.DataFrame(age, columns = ['Name', 'Age'])
toy = [['tom', 'GIJoe'], ['nick', 'car']]
df2 = pd.DataFrame(toy, columns = ['Name_child', 'Toy'])
df = pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
df.columns will give the output Index(['Name', 'Age', 'Name_child', 'Toy'], dtype='object'). Is there an easy way to obtain Index(['Name', 'Age', 'Toy'], dtype='object') instead? I can drop the column afterwards of course like this del df['Name_child'], but I'd like my code to be as short as possible.
Based on #mgc comments, you don't have to rename the columns of df2. Just you pass df2 to merge function with renamed columns. df2 column names will remain as it is.
df = pd.merge(df1,df2.rename(columns={'Name_child': 'Name'}),on='Name', how='left')
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
df2.columns
Index(['Name_child', 'Toy'], dtype='object')
Set the index of the second dataframe to "Name_child". If you do this in the merge statement the columns in df2 remain unchanged.
df = pd.merge(df1,df2.set_index('Name_child'),left_on='Name',right_index=True,how='left')
This ouputs the correct columns:
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
Seems to be even simpler to drop the column right after.
df = (pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
.drop('Name_child', axis=1))
#----------------
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.

Retrieve unique column names over multiple DataFrames and append all to a list

Description of task
I am wanting to retrieve the column names across multiple DataFrames and append the unique names to a list. The following code appends the first column names to a list, but I am not sure how to retrieve and append the different column names of the remaining DataFrames to desiredlist. Any ideas would be awesome!
alldf = [df, df1, df2, df3, df4]
for index, dataframe in enumerate(alldf):
desiredlist = []
a = dataframe.columns.values.tolist()
desiredlist.append(a)
Example of DataFrames
df
ID AA TA TL
Date
2001 a 1.0 44 50
df1
ID AA TM TP
Date
2001 a 1.0 44 50
df2
ID TP TZ TK
Date
2001 a 1.0 44 50
df3
ID AA TA TG
Date
2001 a 1.0 44 50
df4
ID AB TT TQ
Date
2001 a 1.0 44 50
List Output Desired
All column names output across multiple DataFrames, but only appearing once
desiredlist = ['AA', 'TA', 'TL', 'TM', 'TP', 'TZ', 'TK','TG', 'AB', 'TT', 'TQ']
You could iterate through the list "a" and add values that haven't already been added to "desiredlist".
I think this is what you were going for.
alldf = [df, df1, df2, df3, df4]
desiredlist = []
for index, dataframe in enumerate(alldf):
a = dataframe.columns.values.tolist()
for column_name in a:
if not column_name in desiredlist:
desiredlist.append(column_name)
You can use set.update() to populate set and then get unique column names.
For example:
df1 = pd.DataFrame({'A':[1], 'B':[2]})
df2 = pd.DataFrame({'A':[1], 'C':[2]})
df3 = pd.DataFrame({'D':[1], 'E':[2]})
df4 = pd.DataFrame({'D':[1], 'B':[2]})
unique = set()
for d in [df1, df2, df3, df4]:
unique.update(d)
print(unique)
Prints:
{'A', 'D', 'C', 'E', 'B'}

Categories

Resources