Merging multiple pandas dataframes by common STRINGS in a column - python

I have 6 csv files in which one column is a sentence and the second column is an integer.
The sentences are the same across all csv files, but they are out of key order from file to file.
I want to merge all data frames by sentence, so that I have one column of the sentences and then each integer column associated with that sentence from each csv file.
I've tried various merging and reducing techniques by the common 'sentence' column, but I end up with orders of magnitude more rows than I should have.
For example:
data_frames = [df1, df2, df3, df4, df5, df6]
reduce(lambda x,y: pd.merge(x,y, on='sentence', how='inner'), data_frames)
results in a dataframe with 12,502,455 rows!! I only have 4,825 rows in each csv file.
even using:
pd.merge(df1,df2, on='sentence', how='inner')
results in a dataframe with 5295 rows.
I know all the sentences are identical across csv files because I uploaded the same csv file of sentences to mTurk to be labeled.

it looks like your code is running correctly. I'm guessing the problem is that your sentences are not distinct. if you have duplicate sentences, running an inner join will multiply them. google "cartesian product"
can you post how many duplicate sentences are in each file?

You might have strings with different values.
Ensure to preprocess them before by doing lower and strip.
Example:
new_dfs = []
for df in dfs:
df['sentence'] = df['sentence'].apply(lambda x: x.lower().strip())
new_dfs.append(df)
Then, you can do simply merge as you mentioned. Make sure to have similar named columns.
Here's a simple working example:
import pandas as pd
vals1 = [[1, 'doc'], [2, 'bac'], [3, 'mec']]
vals2 = [[22, 'doc'], [12, 'mec'], [67, 'bac']]
vals3 = [[15, 'mec'], [35, 'bac'], [122, 'doc']]
df1 = pd.DataFrame(data=vals1, columns=["x","y"])
df2 = pd.DataFrame(data=vals2, columns=["x","y"])
df3 = pd.DataFrame(data=vals3, columns=["x","y"])
df4 = pd.merge(df1, df2, on='y', how='inner', suffixes=("1","2"))
df4 = pd.merge(df4, df3, on='y', how='inner')
df4.head()
Result:

Related

how to render distinct columns/rows by comparing two dataframes in pandas?

I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:
df19.shape
(39831, 1952)
df20.shape
(39821, 1962)
here is dummy data:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])
current attempt
I came across SO and tried following:
res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)
this gave me distinct rows but not distinct columns.I also tried this one but gave me error:
df19.compare(df20, keep_equal=True, keep_shape=True)
how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks
objective
I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?
I would recommend getting the columns by filtering by the name of the columns.
common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]
distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]
Thanks to #Shaido, this one worked for me:
import pandas as pd
df1=pd.read_csv(data1)
df2=pd.read_csv(data2)
df1_cols = df1.columns
df2_cols = df2.columns
common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)

Multiple data frames contains one same column

I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g
table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.
The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.
So far I tried below methods:
dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)
AND
dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]
print(pd.concat(dfs, axis=1).reset_index())
So, is it possible that I may get more rows than 400 or am I doing something wrong?
Thanks
Consider creating a person count column with groupby().cumcount() in each data frame, then concatenate on person and accident identifiers:
dfs = [
(df.assign(
PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
).set_index(["PERSON_NO", "ACCIDENT_NO"])
)
for df in [df1, df2, df3, df4, df5, df6, df7]
]
final_df = pd.concat(dfs, axis=1).reset_index()
you can try ;
table1 = table1.merge(table2,on = ['accident_no'],how = 'left')
and try for other tables.

Pandas join dataframes based on different columns

I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)
Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

How to merge pandas dataframes from two separate lists of dataframes

I have two lists of dfs
List1 = [df1,df2,df3]
List2 = [df4,df5,df6]
I want to merge the first df from List1 with the corresponding df from List 2. ie df1 with df4 and df2 with df5, etc.
The dfs share a common column, 'Col1'. I have tried the following code
NewList = []
for i in len(List1),len(List2):
NewList[i]=pd.merge(List1[i],List2[i],on='Col1')
I get the error 'list index out of range'.
I realise that this seems to be a common problem, however I cannot apply any of the solutions that I have found on Stack to my particular problem.
Thanks in advance for any help
Use
pd.concat(
[df1, df2]
)
To loop over two lists and compare them or perform an operation on them element-to-element, you can use the zip() function.
import pandas as pd
List1 = [df1,df2,df3]
List2 = [df4,df5,df6]
NewList = []
for dfa, dfb in zip(List1, List2):
# Merges df1, df4; df2, df5; df3, df6
mdf = dfa.merge(dfb, on = 'Col1')
NewList.append(mdf)

delete a part of pd.DataFrame with Python

I'm iterating over rows in my DataFrame with DataFrame.iterrows() and if a row meets certain criteria I store it in the other DataFrame. Is there a way to delete rows that appear in both of them like set.difference(another_set)?
I was asked to provide a code, so, since I dont know the answer to my question, I worked around my problem and created another DataFrame, to which I save good data instead of having two DataFrames and taking a difference of them both.
def test_right_chain(self, temp):
temp__=pd.DataFrame()
temp_=pd.DataFrame()
key=temp["nr right"].iloc[0]
temp_=temp_.append(temp.iloc[0])
temp=temp[1:]
for index, row in temp.iterrows():
print row
key_=row['nr right']
if abs(key_-key)==1:
pass
elif len(temp_)>2:
print row
temp__.append(temp_)
temp_=pd.DataFrame()
else:
temp_=pd.DataFrame()
temp_=temp_.append(row)
key=key_
return temp__
You can do an intersection of both DataFrames with df.merge(df1, df2, right_index=True, how='inner') function, leaving indexes that appear by the rows in left DataFrame (I don't know why, but this happens when I use right_index=True) and then retrieve indexes of those rows. (I used answer from this question: Compare Python Pandas DataFrames for matching rows)
df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
df3=pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], right_index=True, how='inner')
Now you need indexes of rows that appear in both DataFrames:
indexes= df3.index.values
And then you just need to drop those rows from your DataFrame:
df1=df1.drop(df1.index[indexes])

Categories

Resources