Looping through dataframe list with IF Statement - python

Just running a simple for-loop on a list of dataframes, however trying to add an IF clause... and it keeps erroring out.
df_list = [df1, df2, df3]
for df in df_list:
if df in [df1, df2]:
x = 1
else:
x = 2
.
.
.
ValueError: Can only compare identically-labeled DataFrame objects
Above is a simplified version of what I'm attempting. Can anyone tell me why this isn't working and a fix?

You could use DataFrame.equals with any instead:
df_list = [df1, df2, df3]
for df in df_list:
if any(df.equals(y) for y in [df1, df2]):
x = 1
else:
x = 2

Do NOT use .equals() here!
It's unnecessary and slowing down you program, use id() instead:
df_list = [df1, df2, df3]
for df in df_list:
if id(df) in [id(df1), id(df2)]:
x = 1
else:
x = 2
Because here you just need to compare the identities, rather than the values.

You could use a better container and reference them by labels.
Equality checks for large DataFrames with object types can become slow, >> seconds, but it will take ~ns to check if the label is in a list.
dfs = {'df1': df1, 'df2': df2, 'df3': df3}
for label, df in dfs.items():
if label in ['df1', 'df2']:
x = 1
else:
x = 2

You need to use df.equals()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.equals.html
df_list = [df1, df2, df3]
for df in df_list:
if df.equals(df1) or df.equals(df2):
# blah blah

The following link might help:
Pandas "Can only compare identically-labeled DataFrame objects" error
According to this, the data frames being compared with == should have the same columns and index otherwise it gives the error.
Alternatively, you can compare the data frames using dataframe.equals method. Please refer to the documentation below: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.equals.html

Related

Pyspark Improve Repetitive Function Calls When Returning Dataframes

I have a multiple dataframes that I need to apply different functions to and I want to know if there is a way better way to do this in pyspark ?
I am doing the following right now:
df1 = function_one(df1)
df2 = function_one(df2)
df3 = function_one(df3)
df1 = function_two(df1, dfx, 0)
df2 = function_two(df2, dfx, 1)
df3 = function_two(df3, dfx, 2)
I have tried this:
list_dfs = [df1, df2, df3]
num_list = [0,1,2]
for dataframe,num in zip(list_dfs,num_list):
dataframe = function(dataframe)
dataframe = function_two(dataframe , dfx, num)
This does not apply the changes.
Is there a way I can maybe do a loop in pyspark and apply the function to the multiple dataframes?
This is just something I wrote quickly (not tested)
Just a suggestion, not sure if it's more convenient than what you're doing already (obviously it is if you have more dfs)
def make_changes(df):
df = func(df)
df = func2(df)
return df
new_df_list = []
df_list = [df1, df2, df3]
for dfs in df_list:
new_df_list.append(make_changes(dfs))

How to search a value and return the row from multiple dataframes in pandas?

For example if i have multiple dataframes like df1, df2 and df3. I have a column 'phone_no' in every dataframe. How do i search for a phone_no in every dataframe and return the rows where that dataframe is present?
For example
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'] == 9999999999])
The above code is returning empty output. The output must be the row where the phone_no contains that particular phone number. How to resolve this issue?
Check if this works by comparing phone_no to a string:
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'].astype(str) == '9999999999'])
Maybe you don't need to convert phone_no as str if it's already the case. You have to check:
>>> print(df1['phone_no'].dtype)
object
# OR
>>> print(df1['phone_no'].dtype)
int64
Update
df_all = [df1, df2, df3]
df_filtered = []
for i in df_all:
df_filtered.append(i.loc[i['phone_no'].astype(str) == '9999999999'])

Call a dataframe from a list with the names of dataframes

I have a list with all the names of my dataframes (e.g list =['df1','df2','df3','df4'] I would like to extract specifically df4, by using something like list[3], meaning instead of getting the 'df4' to get the df4 dataframe itself. help?
It sounds like you have this in pseduocode:
df1 = DataFrame()
df2 = DataFrame()
df3 = DataFrame()
df4 = DataFrame()
your_list = ["df1", "df2", "df3", "df4"]
And your goal is to get df4 from your_list['df4']
You could, instead, put all the dataframes in the list in the first place, rather than strings.
your_list = [df1, df2, df3, df4]
Or even better, a dictionary with names:
list_but_really_a_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}
And then you can do list_but_really_a_dict['df1'] and get df1

How to merge pandas dataframes from two separate lists of dataframes

I have two lists of dfs
List1 = [df1,df2,df3]
List2 = [df4,df5,df6]
I want to merge the first df from List1 with the corresponding df from List 2. ie df1 with df4 and df2 with df5, etc.
The dfs share a common column, 'Col1'. I have tried the following code
NewList = []
for i in len(List1),len(List2):
NewList[i]=pd.merge(List1[i],List2[i],on='Col1')
I get the error 'list index out of range'.
I realise that this seems to be a common problem, however I cannot apply any of the solutions that I have found on Stack to my particular problem.
Thanks in advance for any help
Use
pd.concat(
[df1, df2]
)
To loop over two lists and compare them or perform an operation on them element-to-element, you can use the zip() function.
import pandas as pd
List1 = [df1,df2,df3]
List2 = [df4,df5,df6]
NewList = []
for dfa, dfb in zip(List1, List2):
# Merges df1, df4; df2, df5; df3, df6
mdf = dfa.merge(dfb, on = 'Col1')
NewList.append(mdf)

Looping through list of data frames and performing operation

I have a list of dataframes and I am performing an operation on the list using a for loop. df1, df2, df3 and df4 are data frames. After the operations, I am not finding the modifications on the dataframe. Please help me understand what am I missing and why this is not working?
What modifications do I need to make in order to get the changes passed to the source dataframes.
sheetnames = [df1, df2, df3, df4]
i=0
for sheet in sheetnames:
ixNaNList = sheet[sheet.isnull().all(axis=1) == True].index.tolist()
if len(ixNaNList) > 0:
ixNaN = ixNaNList[0]
sheetnames[i]=sheet[:ixNaN]
i=i+1
Your assingment sheetnames[i] = ... replaces the i-th element of the list sheetnames with whatever sheet[:ixNaN] evaluates to.
It thus has no effect on the content of df1, df2, df3 or df4.
try this:
sheetnames = [df1, df2, df3, df4]
def drop_after_na(df):
return df[df.isnull().all(axis=1).astype(int).cumsum() <= 0]
sheetnames = map(drop_after_na, sheetnames)
and try this:
sheetnames = ['df1', 'df2', 'df3', 'df4']
for sheet in sheetnames:
exec('{sheet} = {sheet}[{sheet}.isnull().all(axis=1).astype(int).cumsum() <= 0]'.format(sheet=sheet))

Categories

Resources