I have a dictionary dict of dataframes such as:
{
‘table_1’: name color type
Banana Yellow Fruit,
‘another_table_1’: city state country
Atlanta Georgia United States,
‘and_another_table_1’: firstname middlename lastname
John Patrick Snow,
‘table_2’: name color type
Red Apple Fruit,
‘another_table_2’: city state country
Arlington Virginia United States,
‘and_another_table_2’: firstname middlename lastname
Alex Justin Brown,
‘table_3’: name color type
Lettuce Green Vegetable,
‘another_table_3’: city state country
Dallas Texas United States,
‘and_another_table_3’: firstname middlename lastname
Michael Alex Smith }
I would like to merge these dataframes together based on their names so that in the end I will have only 3 dataframes:
table
name color type
Banana Yellow Fruit
Red Apple Fruit
Lettuce Green Vegetable
another_table
city state country
Atlanta Georgia United States
Arlington Virginia United States
Dallas Texas United States
and_another_table
firstname middlename lastname
John Patrick Snow
Alex Justin Brown
Michael Alex Smith
Based on my initial research it seems like this should be possible with Python:
By using .split, dictionary comprehension and itertools.groupby to group together dataframes inside the dictionary based on key names
Creating dictionary of dictionaries with these grouped results
Using pandas.concat function to loop through these dictionaries and group dataframes together
I don't have a lot of experience with Python and I am a bit lost on how to actually code this.
I have reviewed
How to group similar items in a list? and
Merge dataframes in a dictionary posts but they were not as helpful because in my case name length of dataframes varies.
Also I do not want to hardcode any dataframe names, because there are more than a 1000 of them.
Here is one way:
Give this dictionary of dataframes:
dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
}
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
dict_of_dfs = {i:pd.concat([dd[x] for x in dd.keys() if x.startswith(i)]) for i in tables}
Outputs a new dictionary of combined tables:
dict_of_dfs['table']
# Name color type
# 0 Banana Yellow Fruit
# 0 Apple Red Fruit
dict_of_dfs['another_table']
# city state Country
# 0 Atlanta Georgia United States
# 0 Arlinton Virginia United States
dict_of_dfs['and_another_table']
# firstname middlename lastnme
# 0 John Patrick Snow
# 0 Alex Justin Brown
Another way using defaultdict from collections, create a list of combined dataframes:
from collections import defaultdict
import pandas as pd
dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
}
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
d = defaultdict(list)
[d[i].append(dd[k]) for i in tables for k in dd.keys() if k.startswith(i)]
l_of_dfs = [pd.concat(d[i]) for i in d.keys()]
print(l_of_dfs[0])
print('\n')
print(l_of_dfs[1])
print('\n')
print(l_of_dfs[2])
Output:
city state Country
0 Atlanta Georgia United States
0 Arlinton Virginia United States
firstname middlename lastnme
0 John Patrick Snow
0 Alex Justin Brown
Name color type
0 Banana Yellow Fruit
0 Apple Red Fruit
Related
I want to change the data in a row
database:
Name
City
Country
John
Toronto,Canada
Canada
Smith
Seattle,United States
United States
Raj
Greater Toronto Area,Canada
Canada
The Records in the city with "," should be removed only the name of the city should be their others should be deleted
output required
Name
City
Country
John
Toronto
Canada
Smith
Seattle
United States
Raj
Greater Toronto Area
Canada
USE -df['City'] = df['City'].str.split(',').str[0]
Reproducible Code-
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 'Toronto ,ON','Canada'], [ "Raj","Greater Toronto Area, Canada","Canada"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'City','Country'])
# print dataframe.
df['City'] = df['City'].str.split(',').str[0]
df
Ouptut-
Name City Country
0 tom Toronto Canada
1 Raj Greater Toronto Area Canada
Ref link-https://stackoverflow.com/questions/40705480/python-pandas-remove-everything-after-a-delimiter-in-a-string
I have two datasets that I want to merge based off the persons name. One data set player_nationalities has their full name:
Player, Nationality
Kylian Mbappé, France
Wissam Ben Yedder, France
Gianluigi Donnarumma, Italy
The other dataset player_ratings shortens their first names with a full stop and keeps the other name(s).
Player, Rating
K. Mbappé, 93
W. Ben Yedder, 89
G. Donnarumma, 91
How do I merge these tables based on the column Player and avoid merging people with the same last name? This is my attempt:
df = pd.merge(player_nationality, player_ratings, on='Player', how='left')
Player, Nationality, Rating
K. Mbappé, France, NaN
W. Ben Yedder, France, NaN
G. Donnarumma, Italy, NaN
You would need to normalize the keys in both DataFrames in order to merge them.
One idea would be to create a function to process the full name in player_nationalities and merge on the processed value for player name. eg:
def convert_player_name(name):
try:
first_name, last_name = name.split(' ', maxsplit=1)
return f'{first_name[0]}. {last_name}'
except ValueError:
return name
player_nationalities['processed_name'] = [convert_player_name(name) for name in player_nationalities['Player']]
df_merged = player_nationalities.merge(player_ratings, left_on='processed_name', right_on='Player')
[out]
Player_x Nationality processed_name Player_y Rating
0 Kylian Mbappé France K. Mbappé K. Mbappé 93
1 Wissam Ben Yedder France W. Ben Yedder W. Ben Yedder 89
2 Gianluigi Donnarumma Italy G. Donnarumma G. Donnarumma 91
I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}
To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]
Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?
I have a dataframe which looks something like this:
dfA
name field country action
Sam elec USA POS
Sam elec USA POS
Sam elec USA NEG
Tommy mech Canada NEG
Tommy mech Canada NEG
Brian IT Spain NEG
Brian IT Spain NEG
Brian IT Spain POS
I want to group the dataframe based on the first 3 columns adding a new column "No of data". This is something which I do using this:
dfB = dfA.groupby(["name", "field", "country"], dropna=False).size().reset_index(name = "No_of_data")
This gives me a new dataframe which looks something like this:
dfB
name field country No_of_data
Sam elec USA 3
Tommy mech Canada 2
Brian IT Spain 3
But now I also want to add a new column to this particular dataframe which tells me what is the count of number of "POS" for every combination of "name", "field" and "country". Which should look something like this:
dfB
name field country No_of_data No_of_POS
Sam elec USA 3 2
Tommy mech Canada 2 0
Brian IT Spain 3 1
How do I add the new column (No_of_POS) to the table dfB when I dont have the information about "POS NEG" in it and needs to be taken from dfA.
You can use a dictionary with functions in the aggregate method:
dfA.groupby(["name", "field", "country"], as_index=False)['action']\
.agg({'No_of_data': 'size', 'No_of_POS': lambda x: x.eq('POS').sum()})
You can precompute the boolean before aggregating; performance should be better as the data size increases :
(df.assign(action = df.action.eq('POS'))
.groupby(['name', 'field', 'country'],
sort = False,
as_index = False)
.agg(no_of_data = ('action', 'size'),
no_of_pos = ('action', 'sum'))
name field country no_of_data no_of_pos
0 Sam elec USA 3 2
1 Tommy mech Canada 2 0
2 Brian IT Spain 3 1
You can add an aggregation function when you're grouping your data. Check agg() function, maybe this will help.
I am looking for an efficient way to select matching rows in 2 x dataframes based on a shared row value, and upsert these into a new dataframe I can use to map differences between the intersection of them into a third slightly different dataframe that compares them.
**Example:**
DataFrame1
FirstName, City
Mark, London
Mary, Dallas
Abi, Madrid
Eve, Paris
Robin, New York
DataFrame2
FirstName, City
Mark, Berlin
Abi, Delhi
Eve, Paris
Mary, Dallas
Francis, Rome
In the dataframes, I have potential matching/overlapping on 'name', so the intersection on these is:
Mark, Mary, Abi, Eve
excluded from the join are:
Robin, Francis
I construct a dataframe that allows values from both to be compared:
DataFrameMatch
FirstName_1, FirstName_2, FirstName_Match, City_1, City_2, City_Match
And insert/update (upsert) so my output is:
DataFrameMatch
FirstName_1 FirstName_2 FirstName_Match City_1 City_2 City_Match
Mark Mark True London Berlin False
Abi Abi True Madrid Delhi False
Mary Mary True Dallas Dallas True
Eve Eve True Paris Paris True
I can then report on the difference between the two lists, and what particular fields are different.
merge
According to your output. You only want rows where 'FirstName' matches. You then want another column that evaluates whether cities match.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2']).eval('City_Match = City_1 == City_2')
FirstName City_1 City_2 City_Match
0 Mark London Berlin False
1 Mary Dallas Dallas True
2 Abi Madrid Delhi False
3 Eve Paris Paris True
Details
You could do a simple merge and end up with
FirstName City
0 Mary Dallas
1 Eve Paris
Which takes all common columns by default. So I had to restrict the columns via the on argument, hence on='FirstName'
d1.merge(d2, on='FirstName')
FirstName City_x City_y
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Which gets us closer but now I want to adjust those suffixes.
d1.merge(d2, on='FirstName', suffixes=['_1', '_2'])
FirstName City_1 City_2
0 Mark London Berlin
1 Mary Dallas Dallas
2 Abi Madrid Delhi
3 Eve Paris Paris
Lastly, I'll add a new column that shows the evaluation of 'city_1' being equal to 'city_2'. I chose to use pandas.DataFrame.eval. You can see the results above.