pandas three-way joining multiple dataframes on columns - python

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

Zero's answer is basically a reduce operation. If I had more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, ..., dfN]
Assuming they have a common column, like name in your example, I'd do the following:
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.

You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')

This is an ideal situation for the join method
The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With #zero's data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9

In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining:
pd.concat(
objs=(iDF.set_index('name') for iDF in (df1, df2, df3)),
axis=1,
join='inner'
).reset_index()
where df1, df2, and df3 are defined as in John Galt's answer:
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)

This can also be done as follows for a list of dataframes df_list:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')

Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)
The join operation is by default performed on index.
In your case, you just have to specify that the Name column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

There is another solution from the pandas documentation (that I don't see here),
using the .append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
If there are different column names, Nan will be introduced.

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.
from functools import reduce
dfs_with_suffixes = [(df2,suffix2), (df3,suffix3),
(df4,suffix4)]
merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)
merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['d', 14, 16]]
),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['d', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
df4 = pd.DataFrame(np.array([
['a', 15, 49],
['c', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr41', 'attr42']
)
Three ways to join list dataframe
pandas.concat
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
# cant not run if index not unique
dfs = pd.concat(dfs, join='outer', axis = 1)
functools.reduce
dfs = [df1, df2, df3, df4]
# still run with index not unique
import functools as ft
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='name', how = 'outer'), dfs)
join
# cant not run if index not unique
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:], how = 'outer')

Joining together all three can be done using .join() function.
You have three DataFrames lets say
df1, df2, df3.
To join these into one DataFrame you can:
df = df1.join(df2).join(df3)
This is the simplest way I found to do this task.

Related

Left Join tables in python where the number of input tables can change each time

I have a need where the number of tables to be merged can change each time as per the data availability. For example below sometimes it can be all the 3 (df1, df2, df3) or any two (df1,df2 or df2,df3 or df1, df3) or a single. The join key in all is 'name' and the type of join is left.
I don't want the end user to input the table names manually but rather have a function which can do it depending on the available data from previous steps dynamically and skip the dataframes which are not available.
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
Can this be done by left join using try & except?
One way I found out to do a dynamic merging is create a empty dataframe with Schema for each table:
Example for df1
#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
from pyspark.sql.types import StructType,StructField, StringType, DoubleType
schema = StructType([
StructField('name', StringType(), True),
StructField('attr11', DoubleType(), True),
StructField('attr11', LongType(), True)
])
#Create the dataset even if not used
df1 = spark.createDataFrame(emptyRDD, schema)
Then pass it to the join function . So, even when there is no data an empty dataframe will get merged. We only have to delete the columns which have all null values and final output is ready.

Mapping a multiindex dataframe to another using row ID

I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.
Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)

Dynamically Concatenate/Merge Columns of Same Name from n Data Frames Into New Data Frame

I have 17 data frames in a list dataframes, they all have the same column names and length save for the first column which describes the source of the data. There are 7 columns which describe the date for the data which is again the same for each data frame across each row. So, there are a total of 19 columns per data frame. What I would like to do is dynamically concatenate each of the columns which have the same column name such that there is a total of 11 data frames with 24 columns 7 of which describe the date and the other 17 are the concatenated columns which shared the same column name for the list of 17 data frames.
Below is just an example of 3 data frames and the expected outcome.
df1 = pd.DataFrame(np.array([
['a', 1, 3, 9],
['a', 2, 4, 61],
['a', 3, 24, 9]]),
columns=['name', 'date','attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['b', 1, 5, 19],
['b', 2, 14, 16],
['b', 3, 4, 9]]),
columns=['name','date', 'attr11', 'attr12'])
df3 = pd.DataFrame(np.array([
['c', 1, 3, 49],
['c', 2, 4, 36],
['c', 3, 14, 9]]),
columns=['name','date' ,'attr11', 'attr12']
Result
dfattr11
[1, 3, 5, 49],
[2, 4, 14, 36],
[3, 24, 4, 9]]),
columns=['date', 'attr11', 'attr11', 'attr11']
dfattr12...
new_dataframes = [dfattr11, dfattr12, ...]
I tried using Pandas Python: Concatenate dataframes having same columns for guidance but it seems like the solution stacked the columns opposed to parallel.
I know how I would use concat to create a new data frame but the challenge arises when trying to do it iteratively or dynamically as there are 17 data frames each with 11 columns that need to be put into their separate df. Any help would be greatly appreciated.
IIUC, you could use pandas.concat to generate a big dataframe with all data and split it using groupby. You will get a dictionary of dataframes as output:
dfs = [df1,df2,df3]
out = {k: d.droplevel(0, axis=1) for k,d in
pd.concat({d['name'].iloc[0]: d.set_index('date')
.drop(columns='name')
for d in dfs}, axis=1)
.groupby(level=1, axis=1)
}
Output:
{'attr11': attr11 attr11 attr11
date
1 3 5 3
2 4 14 4
3 24 4 14,
'attr12': attr12 attr12 attr12
date
1 9 19 49
2 61 16 36
3 9 9 9}

Manage the missing value in a dataframe with string and number

I have a dataframe with some string columns and number columns. I want to manage the missing values.
I want to change the "nan" values with mean of each row.
I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average
If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df
Note:
I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.
index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]
rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])
for i, cols in df_withno_id.iterrows():
if i not in rowsidx_with_all_NaN:
endsidx = len(cols)
extract_data = list(cols[0:endsidx])
mean = np.nanmean(extract_data)
fill_nan = [mean for x in extract_data if np.isnan(x)]
df.loc[i] = df.loc[i].replace(np.nan, mean)
Can anybody help me with this? thanks.
First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]
numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df

How to match two dataframe columns and return matching values on a separate column in Python?

I am a newbie in Python and I need some help.
I have 2 data frame containing a list of users with a list of recommended friends from two tables.
I would like to achieve the following:
Sort the list of recommended friends by ascending order from 2 data frame for each user.
Match the list of matching recommended friends from dataframe2 to dataframe1 for each user. Return only the matched values.
I have tried my code but it didn't achieve the desired results.
import pandas as pd
import numpy as np
///load data from csv
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
///convert values to list to sort by recommended friends ID
df1.values.tolist()
df1.sort_values(by=['User','RecommendedFriends'])
df2.values.tolist()
df2.sort_values(by=['User','RecommendedFriends'])
///obtain only matched values from list of recommended friends from df1 and df2.
df3 = df1.merge(df2, how='inner', on='User')
/// return dataframe with user, matched recommendedfriends ID
print(df3)
Problem encountered:
The elements in each list are not sorted in ascending order.
While matching each elements through pandas merge with "inner-join". It seems that it is not able to read certain elements.
Updates: Below are the data frame header which cause some error in the code.
This should be the solution to your problem. You might have to change a few variables but you get the idea: you merge the two dataframes on the users, so you get a dataframe with both lists for each user. You then take the intersection of both lists and store that in a new column.
df1 = pd.DataFrame(np.array([[1, [5, 7, 10, 11]], [2, [3, 8, 5, 12]]]),
columns=['User', 'Recommended friends'])
df2 = pd.DataFrame(np.array([[1, [5, 7, 9]], [2, [4, 7, 10]], [3, [15, 7, 9]]]),
columns=['User', 'Recommended friends'])
df3 = pd.merge(df1, df2, on='User')
df3['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df3['Recommended friends_x'], df3['Recommended friends_y'])]
Output df3:
User Recommended friends_x Recommended friends_y intersection
0 1 [5, 7, 10, 11] [5, 7, 9] [5, 7]
1 2 [3, 8, 5, 12] [4, 7, 10] []
I do not quite understand what exactly your problem is, but in general you will have to assign the dataframe to itself again.
import pandas as pd
import numpy as np
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
df1 = df1.values.tolist()
df1 = df1.sort_values(by=['User','RecommendedFriends'])
df2 = df2.values.tolist()
df2 = df2.sort_values(by=['User','RecommendedFriends'])
df3 = df1.merge(df2, how='inner', on='User')
print(df3)

Categories

Resources