reduce() to merge if there are blank DataFrame - python

I want to use reduce() function to merge data.
final = reduce(lambda left,right: pd.merge(left,right,on='KEY',how="outer"), [df1, df2, df3, df4, df5, df6, df7, df8])
However, sometimes some dataframe df1 to df8 might be blank (but there is at least one dataframe not be blank).
And I do not want to detect which one.
For example, this time df1 to df7 are blank and only df8 is non-blank. Next time df1, df2, df5 are non-blank.
How should I do so?

You can rewrite your function to check for blank dataframes using the property DataFrame.empty:
def my_merge(left,right):
if left.empty: return right
if right.empty: return left
return pd.merge(left,right)
final = reduce(my_merge, list_of_dfs)

Related

how to apply same filter to multiple dataframes

How can I filter the list of dataframes with the same column category in one code?
filter_list = [df1, df2, df3, df4]
for name in filter_list:
name.columns[columns['category'] == 'category1']
You need to use a method that has an inplace argument for the assignment in a for loop.
I will demonstrate with dropping everything where category does not equal category1
filter_list = [df1,df2,df3,df4]
for x in filter_list:
x.drop(x[x['category']!='category1'].index, inplace=True)

Multiple data frames contains one same column

I am trying to merge 7 different data frames on the basis of same column (accident_no) but the problem is some data frame contains more rows and duplication of (accident_no) e.g
table 1(Accident) contains 200 accident_no (all unique), table 3 contains 196 accident_no (all unique) but table 4 (Person) contains 400 accident_no (some duplications) as there may be multiple passengers were involved in the same crash so accident_no would be same and information can be used for analysis.
The problem I am facing is I have tried concat, join, merge but the answer reaches the highest number of rows and I am getting more rows than 400.
So far I tried below methods:
dfs = [df1,df2,df3,df5,df6,df7]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ACCIDENT_NO', how = 'left'), dfs)
AND
dfs = [df.set_index(['ACCIDENT_NO']) for df in [df1, df2, df3, df4, df5, df6, df7]]
print(pd.concat(dfs, axis=1).reset_index())
So, is it possible that I may get more rows than 400 or am I doing something wrong?
Thanks
Consider creating a person count column with groupby().cumcount() in each data frame, then concatenate on person and accident identifiers:
dfs = [
(df.assign(
PERSON_NO = lambda x: x.groupby(["ACCIDENT_NO"]).cumcount().add(1)
).set_index(["PERSON_NO", "ACCIDENT_NO"])
)
for df in [df1, df2, df3, df4, df5, df6, df7]
]
final_df = pd.concat(dfs, axis=1).reset_index()
you can try ;
table1 = table1.merge(table2,on = ['accident_no'],how = 'left')
and try for other tables.

Formatting multiple dataframes with a function returning the correct output but then recalling the old variable

I keep running into this issue and have not been able to find a solution. I have 10 separate dataframes and am trying to use one function to format all of them at once. When running the function in Jupyter Notebook, it shows me that the correct formatting takes place by showing the correctly formatted last dataframe (df10, odds_sb). However, when I call what should be one of the newly formatted dataframes again, what is returned is the old format.
#Create function to format odds dataframes
def format_odds(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10):
for idx, df in enumerate((df1, df2, df3, df4, df5, df6, df7, df8, df9, df10)):
df = df.T
df = df.add_suffix(idx)
return df
# Run format odds function to transpose and add number to each column
# This shows that they were correctly formatted
format_odds(odds_opening, odds_bovada, odds_betonline, odds_intertops, odds_sbtng,
odds_betnow, odds_gtbets, odds_skybook, odds_5dimes, odds_sb)
#Back to old formatting for some reason
odds_opening
Any help is greatly appreciated!
You need to create a temp table and add each df to it.. also you have to call the enumerate on a list...
Please note that when you add suffix the column names will be different and the append will not add rows but columns. But anyways the following code will demonstrate the idea of how to append data frames with the same format (same columns count/names/type)
def format_odds(df1, df2, df3):
res = pd.DataFrame()
for idx, df in enumerate([df1, df2]):
df = df.T
df = df.add_suffix(idx)
res = res.append(df)
return res

Call a dataframe from a list with the names of dataframes

I have a list with all the names of my dataframes (e.g list =['df1','df2','df3','df4'] I would like to extract specifically df4, by using something like list[3], meaning instead of getting the 'df4' to get the df4 dataframe itself. help?
It sounds like you have this in pseduocode:
df1 = DataFrame()
df2 = DataFrame()
df3 = DataFrame()
df4 = DataFrame()
your_list = ["df1", "df2", "df3", "df4"]
And your goal is to get df4 from your_list['df4']
You could, instead, put all the dataframes in the list in the first place, rather than strings.
your_list = [df1, df2, df3, df4]
Or even better, a dictionary with names:
list_but_really_a_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}
And then you can do list_but_really_a_dict['df1'] and get df1

Looping through list of data frames and performing operation

I have a list of dataframes and I am performing an operation on the list using a for loop. df1, df2, df3 and df4 are data frames. After the operations, I am not finding the modifications on the dataframe. Please help me understand what am I missing and why this is not working?
What modifications do I need to make in order to get the changes passed to the source dataframes.
sheetnames = [df1, df2, df3, df4]
i=0
for sheet in sheetnames:
ixNaNList = sheet[sheet.isnull().all(axis=1) == True].index.tolist()
if len(ixNaNList) > 0:
ixNaN = ixNaNList[0]
sheetnames[i]=sheet[:ixNaN]
i=i+1
Your assingment sheetnames[i] = ... replaces the i-th element of the list sheetnames with whatever sheet[:ixNaN] evaluates to.
It thus has no effect on the content of df1, df2, df3 or df4.
try this:
sheetnames = [df1, df2, df3, df4]
def drop_after_na(df):
return df[df.isnull().all(axis=1).astype(int).cumsum() <= 0]
sheetnames = map(drop_after_na, sheetnames)
and try this:
sheetnames = ['df1', 'df2', 'df3', 'df4']
for sheet in sheetnames:
exec('{sheet} = {sheet}[{sheet}.isnull().all(axis=1).astype(int).cumsum() <= 0]'.format(sheet=sheet))

Categories

Resources