Looping through list of data frames and performing operation

Looping through list of data frames and performing operation - python

I have a list of dataframes and I am performing an operation on the list using a for loop. df1, df2, df3 and df4 are data frames. After the operations, I am not finding the modifications on the dataframe. Please help me understand what am I missing and why this is not working?
What modifications do I need to make in order to get the changes passed to the source dataframes.
sheetnames = [df1, df2, df3, df4]
i=0
for sheet in sheetnames:
ixNaNList = sheet[sheet.isnull().all(axis=1) == True].index.tolist()
if len(ixNaNList) > 0:
ixNaN = ixNaNList[0]
sheetnames[i]=sheet[:ixNaN]
i=i+1

Your assingment sheetnames[i] = ... replaces the i-th element of the list sheetnames with whatever sheet[:ixNaN] evaluates to.
It thus has no effect on the content of df1, df2, df3 or df4.

try this:
sheetnames = [df1, df2, df3, df4]
def drop_after_na(df):
return df[df.isnull().all(axis=1).astype(int).cumsum() <= 0]
sheetnames = map(drop_after_na, sheetnames)
and try this:
sheetnames = ['df1', 'df2', 'df3', 'df4']
for sheet in sheetnames:
exec('{sheet} = {sheet}[{sheet}.isnull().all(axis=1).astype(int).cumsum() <= 0]'.format(sheet=sheet))

Related

Pyspark Improve Repetitive Function Calls When Returning Dataframes

I have a multiple dataframes that I need to apply different functions to and I want to know if there is a way better way to do this in pyspark ?
I am doing the following right now:
df1 = function_one(df1)
df2 = function_one(df2)
df3 = function_one(df3)
df1 = function_two(df1, dfx, 0)
df2 = function_two(df2, dfx, 1)
df3 = function_two(df3, dfx, 2)
I have tried this:
list_dfs = [df1, df2, df3]
num_list = [0,1,2]
for dataframe,num in zip(list_dfs,num_list):
dataframe = function(dataframe)
dataframe = function_two(dataframe , dfx, num)
This does not apply the changes.
Is there a way I can maybe do a loop in pyspark and apply the function to the multiple dataframes?

This is just something I wrote quickly (not tested)
Just a suggestion, not sure if it's more convenient than what you're doing already (obviously it is if you have more dfs)
def make_changes(df):
df = func(df)
df = func2(df)
return df
new_df_list = []
df_list = [df1, df2, df3]
for dfs in df_list:
new_df_list.append(make_changes(dfs))

Looping through dataframe list with IF Statement

Just running a simple for-loop on a list of dataframes, however trying to add an IF clause... and it keeps erroring out.
df_list = [df1, df2, df3]
for df in df_list:
if df in [df1, df2]:
x = 1
else:
x = 2
.
.
.
ValueError: Can only compare identically-labeled DataFrame objects
Above is a simplified version of what I'm attempting. Can anyone tell me why this isn't working and a fix?

You could use DataFrame.equals with any instead:
df_list = [df1, df2, df3]
for df in df_list:
if any(df.equals(y) for y in [df1, df2]):
x = 1
else:
x = 2

Do NOT use .equals() here!
It's unnecessary and slowing down you program, use id() instead:
df_list = [df1, df2, df3]
for df in df_list:
if id(df) in [id(df1), id(df2)]:
x = 1
else:
x = 2
Because here you just need to compare the identities, rather than the values.

You could use a better container and reference them by labels.
Equality checks for large DataFrames with object types can become slow, >> seconds, but it will take ~ns to check if the label is in a list.
dfs = {'df1': df1, 'df2': df2, 'df3': df3}
for label, df in dfs.items():
if label in ['df1', 'df2']:
x = 1
else:
x = 2

You need to use df.equals()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.equals.html
df_list = [df1, df2, df3]
for df in df_list:
if df.equals(df1) or df.equals(df2):
# blah blah

The following link might help:
Pandas "Can only compare identically-labeled DataFrame objects" error
According to this, the data frames being compared with == should have the same columns and index otherwise it gives the error.
Alternatively, you can compare the data frames using dataframe.equals method. Please refer to the documentation below: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.equals.html

How to search a value and return the row from multiple dataframes in pandas?

For example if i have multiple dataframes like df1, df2 and df3. I have a column 'phone_no' in every dataframe. How do i search for a phone_no in every dataframe and return the rows where that dataframe is present?
For example
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'] == 9999999999])
The above code is returning empty output. The output must be the row where the phone_no contains that particular phone number. How to resolve this issue?

Check if this works by comparing phone_no to a string:
df_all = [df1, df2, df3]
for i in df_all:
print(i.loc[i['phone_no'].astype(str) == '9999999999'])
Maybe you don't need to convert phone_no as str if it's already the case. You have to check:
>>> print(df1['phone_no'].dtype)
object
# OR
>>> print(df1['phone_no'].dtype)
int64
Update
df_all = [df1, df2, df3]
df_filtered = []
for i in df_all:
df_filtered.append(i.loc[i['phone_no'].astype(str) == '9999999999'])

Formatting multiple dataframes with a function returning the correct output but then recalling the old variable

I keep running into this issue and have not been able to find a solution. I have 10 separate dataframes and am trying to use one function to format all of them at once. When running the function in Jupyter Notebook, it shows me that the correct formatting takes place by showing the correctly formatted last dataframe (df10, odds_sb). However, when I call what should be one of the newly formatted dataframes again, what is returned is the old format.
#Create function to format odds dataframes
def format_odds(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10):
for idx, df in enumerate((df1, df2, df3, df4, df5, df6, df7, df8, df9, df10)):
df = df.T
df = df.add_suffix(idx)
return df
# Run format odds function to transpose and add number to each column
# This shows that they were correctly formatted
format_odds(odds_opening, odds_bovada, odds_betonline, odds_intertops, odds_sbtng,
odds_betnow, odds_gtbets, odds_skybook, odds_5dimes, odds_sb)
#Back to old formatting for some reason
odds_opening
Any help is greatly appreciated!

You need to create a temp table and add each df to it.. also you have to call the enumerate on a list...
Please note that when you add suffix the column names will be different and the append will not add rows but columns. But anyways the following code will demonstrate the idea of how to append data frames with the same format (same columns count/names/type)
def format_odds(df1, df2, df3):
res = pd.DataFrame()
for idx, df in enumerate([df1, df2]):
df = df.T
df = df.add_suffix(idx)
res = res.append(df)
return res

Call a dataframe from a list with the names of dataframes

I have a list with all the names of my dataframes (e.g list =['df1','df2','df3','df4'] I would like to extract specifically df4, by using something like list[3], meaning instead of getting the 'df4' to get the df4 dataframe itself. help?

It sounds like you have this in pseduocode:
df1 = DataFrame()
df2 = DataFrame()
df3 = DataFrame()
df4 = DataFrame()
your_list = ["df1", "df2", "df3", "df4"]
And your goal is to get df4 from your_list['df4']
You could, instead, put all the dataframes in the list in the first place, rather than strings.
your_list = [df1, df2, df3, df4]
Or even better, a dictionary with names:
list_but_really_a_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}
And then you can do list_but_really_a_dict['df1'] and get df1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping through list of data frames and performing operation - python

Your assingment sheetnames[i] = ... replaces the i-th element of the list sheetnames with whatever sheet[:ixNaN] evaluates to. It thus has no effect on the content of df1, df2, df3 or df4.

Related

Pyspark Improve Repetitive Function Calls When Returning Dataframes

Looping through dataframe list with IF Statement

How to search a value and return the row from multiple dataframes in pandas?

Formatting multiple dataframes with a function returning the correct output but then recalling the old variable

Call a dataframe from a list with the names of dataframes

Categories

Resources