I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g
table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.
The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.
So far I tried below methods:
dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)
AND
dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]
print(pd.concat(dfs, axis=1).reset_index())
So, is it possible that I may get more rows than 400 or am I doing something wrong?
Thanks
Consider creating a person count column with groupby().cumcount() in each data frame, then concatenate on person and accident identifiers:
dfs = [
(df.assign(
PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
).set_index(["PERSON_NO", "ACCIDENT_NO"])
)
for df in [df1, df2, df3, df4, df5, df6, df7]
]
final_df = pd.concat(dfs, axis=1).reset_index()
you can try ;
table1 = table1.merge(table2,on = ['accident_no'],how = 'left')
and try for other tables.
Related
For example if i have multiple dataframes like df1, df2 and df3. I have a column 'phone_no' in every dataframe. How do i search for a phone_no in every dataframe and return the rows where that dataframe is present?
For example
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'] == 9999999999])
The above code is returning empty output. The output must be the row where the phone_no contains that particular phone number. How to resolve this issue?
Check if this works by comparing phone_no to a string:
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'].astype(str) == '9999999999'])
Maybe you don't need to convert phone_no as str if it's already the case. You have to check:
>>> print(df1['phone_no'].dtype)
object
# OR
>>> print(df1['phone_no'].dtype)
int64
Update
df_all = [df1, df2, df3]
df_filtered = []
for i in df_all:
df_filtered.append(i.loc[i['phone_no'].astype(str) == '9999999999'])
I want to use reduce() function to merge data.
final = reduce(lambda left,right: pd.merge(left,right,on='KEY',how="outer"), [df1, df2, df3, df4, df5, df6, df7, df8])
However, sometimes some dataframe df1 to df8 might be blank (but there is at least one dataframe not be blank).
And I do not want to detect which one.
For example, this time df1 to df7 are blank and only df8 is non-blank. Next time df1, df2, df5 are non-blank.
How should I do so?
You can rewrite your function to check for blank dataframes using the property DataFrame.empty:
def my_merge(left,right):
if left.empty: return right
if right.empty: return left
return pd.merge(left,right)
final = reduce(my_merge, list_of_dfs)
I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,
#divingTobi's answer:
pd.concat([df1, df2]) does the trick.
I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.
I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.
import pandas as pd
...
...
# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
'table1' : '<dataframe1>',
'table2' : '<dataframe2>',
'table3' : '<dataframe3>'
}
# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
'table1': {
'table2': ['NAME', 'DIFF_NAME']},
'table2': {
'table3': ['T2_ID', 'T3_ID']}
}
def _join_dataframes(_rel_pair):
# copy
df_temp = dict(_df)
for ele in _rel_pair:
first_table = ele[0]
second_table = ele[1]
lefton = _onetomany[first_table][second_table][0]
righton = _onetomany[first_table][second_table][1]
_merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
left_on=lefton, right_on=righton, how="inner")
df_temp[ele[1]] = _merged_df
return _merged_df
# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]
_join_dataframes(_rel_pair)
Why don't you just rename the columns of all the dataframes first?
df0.rename({'commonname': 'old_column_name0'}, inplace=True)
.
.
.
.
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)
dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
Try using the concat function, instead of reduce.
A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:
# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']
df = pd.concat([dfA, dfB], axis=1)
You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.
Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:
dfA.merge(dfB, left_on='name', right_on='username')
A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
I have 4 data frames:
df1 = pd.read_csv('values1.csv')
df2 = pd.read_csv('values2.csv')
df3 = pd.read_csv('values3.csv')
df4 = pd.read_csv('values4.csv')
each of them have a structure as follows:
I want to create a new data frame such that it has aggregated values for each category in all the data frames. So, the new data frame should have values which are calculated using the formula :-
Total['values'][0] = df1['values'][0] / (df1['values'][0] + df2['values'][0] + df3['values'][0] + df4['values'][0] )
Like this it should generate values for all the rows.
Can someone please help me out.
First join all DataFrames with concat and aggregate sum for Series and then convert column category to index for Series from df1 and divide by Series.div:
s = pd.concat([df1, df2, df3, df4]).groupby('category')['values'].sum()
out = df1.set_index('category')['values'].div(s).reset_index(name='total')
EDIT:
s = pd.concat([df1, df2, df3, df4]).groupby('category')['values'].sum()
s1 = pd.concat([df1, df2]).groupby('category')['values'].sum()
out = s1.div(s2).reset_index(name='new')