I have quite a difficult issue to explain, I'll try my best.
I have a function a() that calls function b() and passes to b() a dataframe (called "df_a").
I learned that this is done by reference, meaning that when/if in inside function b() I add a new column to the input dataframe, this will also modify the original one. For example:
def b(df_b):
df_b['Country'] = "not sure"
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will now have the column "Country"
So far so good. The problem is that today I realized that if inside b() we merge the dataframe with another dataframe, this create a new local dataframe.
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
b(df_a)
print(df_a) # this dataframe will *not* have the column "Country"
So my question is, how to I make sure in this second example the column "Country" is also assigned to the original df_a dataframe, without returning it back?
(I would prefer not to use "return df_b" inside function b() since I would have to change the logic in many many parts of the code.
Thank you
I have modified the function b() and a() so the changes made in b are returned back to a
def b(df_b):
df_c = pd.DataFrame({"Name":['Mark','Annie'], 'Country':['Brazil','Japan']})
df_b = pd.merge(df_b, df_c, left_on = 'Name', right_on='Name', how='left')
return df_b
def a():
df_a = pd.DataFrame({"Name":['Mark','Annie'], 'Age':[30,28]})
df_a = b(df_a)
print(df_a)
**Output:** a()
Name Age Country
0 Mark 30 Brazil
1 Annie 28 Japan
Related
Having tripped up passing a mutable object to a function (a dataframe) and forgetting that any changes made to the 'df' in the function would also change the original. But I had passed two frames to the function and only one of the originals had changed! Example of the behaviour below (btw I know making a copy of the frame in the function before processing would have created a new instance and this example should not be a class, but I was interested to see if it made a difference - it does not)
import pandas as pd
class Frames:
def frame_player(self, df1, df2):
df2 = pd.concat([df2, df1], ignore_index=True)
df1.drop(df1.loc[df1['idx']==1].index, inplace=True)
print(df2.head())
def main():
data = [[1, 'abc', 'def'],[2, 'ghi', 'jkl']]
df_a = pd.DataFrame(data=data, columns=['idx', 'value1', 'value2'])
df_b = pd.DataFrame(columns=df_a.columns.tolist())
Framer = Frames()
Framer.frame_player(df_a, df_b)
print(f'{len(df_a)=}')
print(f'{len(df_b)=}')
if __name__ == "__main__":
main()
Shell:
idx value1 value2
0 1 abc def
1 2 ghi jkl
len(df_a)=1
len(df_b)=0
df_a has been changed, but df_b has not - it is still empty!!! This seems totally inconsistent, or am I missing something?
Please be patient I am new to Python and Pandas.
I have a lot of pandas dataframe, but some are duplicates. So I wrote a function that check if 2 dataframes are equal, if they are 1 will be deleted:
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print( "Deleted %s" (df_name) )
The function works, but I wish to know how to have the variable "df_name" as string with the name of the dataframe.
I don't understand, the parameters df1 and df2 are dataframe objects how I can get their name at run-time if I wish to print it?
Thanks in advance.
What you are trying to use is an f-string.
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print(f"Deleted {df2.name}")
I'm not certain whether you can call this print method, though. Since you deleted the dataframe right before you call its name attribute. So df2 is unbound.
Instead try this:
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
del df2
Now, do note that your usage of 'del' is also not correct. I assume you want to delete the second dataframe in your code. However, you only delete it inside the scope of the check_eq method. You should familiarize yourself with the scope concept first. https://www.w3schools.com/python/python_scope.asp
The code I used:
d = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df1.name='dataframe1'
df2.name='dataframe2'
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
I am working on a jupyter notebook using python
I have created two dataframes like as shown below
The below two dataframes are declared outside the function - Meaning they are just defined/declared/initialized in jupyter notebook cell [And I wish to use them inside a function like as shown below]
subcols = ["subjid","marks"] #written in jupyter cell 1
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"] #written in jupyter cell 2
testdf= pd.DataFrame(columns=testcolumns)
def fun1(): #written in jupyter cell 3
....
....
return df1,df2
def fun2(df1,df2):
...
...
return df1,df2,df3
def fun3(df1,df2,df3):
...
subjdf['subid'] = df1['indid']
...
return df1,df2,df3,subjdf
def fun4(df1,df2,df3,subjdf):
...
testdf['testid'] = df2['examid']
...
return df1,df2,df3,subjdf,testdf
The above way of writing throws an error in fun3 as below
UnboundLocalError: local variable 'subjdf' referenced before assignment
but I have already created subjdf outside the function blocks [Refer 1st Jupyter cell]
Two things to note here
a] I don't get an error if I use global subjdf in fun3
b] If I use global subjdf, I don't get any error for testdf in fun4. I was expecting testdf to have similar error as well because I have used them the same way in fun4.
So, my question is why not for testdf but only for subjdf
Additionally, I have followed similar approach earlier [without using global variable but just declaring the df outside the function blocks] and it was working fine. Not sure, why it is throwing error only now.
Can help me to avoid this error? please.
You have created subjdf, but your function fun3 needs it as argument :
def fun3(subjdf, df1, df2, df3):
...
subjdf['subid'] = df1['indid']
You're not using python functions properly. You don't need to use global in your case. Whether you pass the correct argument and return it, or think about creating an instance method using self. You have many solutions, but Instance methods are a good solution when you have to handle pandas.Dataframe within classes and functions.
It's possible run you snippet as you guess. So many lines of code is missing.
If you don't want to use a class, and that you want keep using this recursive manner, then rebuild you code that way :
subcols = ["subjid","marks"]
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"]
testdf= pd.DataFrame(columns=testcolumns)
def fun1():
# DO SOMETHING to generate df1 and df2
return df1, df2
def fun2():
df1, df2 = fun1()
# DO SOMETHING to generate df3
return df1, df2, df3
def fun3(subjdf):
df1, df2, df3 = fun2()
subjdf['subid'] = df1['indid']
return df1, df2, df3, subjdf
def fun4(subjdf, testdf):
df1, df2, df3, subjdf = fun3()
testdf['testid'] = df2['examid']
return df1, df2, df3, subjdf, testdf
fun4(subjdf, testdf)
But I repeat, build an instance method with self for building this.
I have a dataframe:
df = pd.DataFrame([1,2,3,4,5,6])
I have created a function which takes dataframe and % split as input and creates two new dataframes based on inputs
def splitdf(df,split=0.5):
a = df.iloc[:int(len(df)]*split)]
b = df.iloc[int((1-split)*len(df)):]
Now, when I run this function and call "a"
splitdf(df)
display(a)
I get the error: name 'a' is not defined
a and b are local to splitdf, so you have to return them.
def splitdf(df,split=0.5):
a = df.iloc[:int(len(df)]*split)]
b = df.iloc[int((1-split)*len(df)):]
return a, b
Then when you call splitdf, assign the return values to some variable(s):
df_a, df_b = splitdf(df)
You are currently just defining a and b within splitdf, and then they stop existing when the function exits because they've gone out of scope.
Based on a recent question I came to wonder what exactly goes wrong when sorting a group using inplace=True inside a function applied to groupby.
Example and problem
df = pd.DataFrame({'A': ['a', 'a', 'b'],
'B': [3, 2, 1]})
def func(x):
x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
dfg.apply(func)
This gives
A
a 3
b 3
while one would expect
A
a 3
b 1
Printing x inside the function shows that the function func is applied to the group 'a' during each call (the group 'b' is "replaced" entirely):
def func(x):
print(x)
x.sort_values('B', inplace=True)
return x.B.max()
# Output (including the usual pandas apply zero-call)
A B
0 a 3
1 a 2
A B
0 a 3
1 a 2
A B
1 a 2
0 a 3
Solution to the problem
This issue can be fixed by performing the sort inside func like x = x.sort_values('B'). In this case, everything works as expected.
Question
Now to my conceptual problem: As a first thought I would expect
that inplace modifies the DataFrame/DataFrameGroupBy itself, while the assignment x = x.sort_values('B') creates a copy
that this is the groupby equivalent of modifying a list while looping over it
However, inspection of both the Dataframe df and the DataFrameGroupBy instance dfg reveals that they are unchanged after the apply, which suggests that the problem is not the modification of the original instances. So what is going on here?
When I did
def func(x):
x = x.copy()
x.sort_values('B', inplace=True)
return x.B.max()
It returns
A
a 3
b 1
so it verifies your first thought
i.e.
that inplace modifies the DataFrame/DataFrameGroupBy itself, while
the assignment x = x.sort_values('B') creates a copy
I iterated over dfg groupby object as well.
def func(x):
x = x.sort_values('B', inplace=True)
return x.B.max()
dfg = df.groupby('A')
for x in dfg:
print(func(x[1]))
It returns
3
1
Hence from my understanding, this issue is something to do with how DataFrame.groupby().apply() iterates over its elements.
It just assigns same memory block to all it's elements and once you overwrite that block by using inplace=True, it gets updated permanently.
Hence your dfg and df variables still have original values but you're still getting the wrong output.