I am working on a jupyter notebook using python
I have created two dataframes like as shown below
The below two dataframes are declared outside the function - Meaning they are just defined/declared/initialized in jupyter notebook cell [And I wish to use them inside a function like as shown below]
subcols = ["subjid","marks"] #written in jupyter cell 1
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"] #written in jupyter cell 2
testdf= pd.DataFrame(columns=testcolumns)
def fun1(): #written in jupyter cell 3
....
....
return df1,df2
def fun2(df1,df2):
...
...
return df1,df2,df3
def fun3(df1,df2,df3):
...
subjdf['subid'] = df1['indid']
...
return df1,df2,df3,subjdf
def fun4(df1,df2,df3,subjdf):
...
testdf['testid'] = df2['examid']
...
return df1,df2,df3,subjdf,testdf
The above way of writing throws an error in fun3 as below
UnboundLocalError: local variable 'subjdf' referenced before assignment
but I have already created subjdf outside the function blocks [Refer 1st Jupyter cell]
Two things to note here
a] I don't get an error if I use global subjdf in fun3
b] If I use global subjdf, I don't get any error for testdf in fun4. I was expecting testdf to have similar error as well because I have used them the same way in fun4.
So, my question is why not for testdf but only for subjdf
Additionally, I have followed similar approach earlier [without using global variable but just declaring the df outside the function blocks] and it was working fine. Not sure, why it is throwing error only now.
Can help me to avoid this error? please.
You have created subjdf, but your function fun3 needs it as argument :
def fun3(subjdf, df1, df2, df3):
...
subjdf['subid'] = df1['indid']
You're not using python functions properly. You don't need to use global in your case. Whether you pass the correct argument and return it, or think about creating an instance method using self. You have many solutions, but Instance methods are a good solution when you have to handle pandas.Dataframe within classes and functions.
It's possible run you snippet as you guess. So many lines of code is missing.
If you don't want to use a class, and that you want keep using this recursive manner, then rebuild you code that way :
subcols = ["subjid","marks"]
subjdf= pd.DataFrame(columns=subcols)
testcolumns = ["testid","testmarks"]
testdf= pd.DataFrame(columns=testcolumns)
def fun1():
# DO SOMETHING to generate df1 and df2
return df1, df2
def fun2():
df1, df2 = fun1()
# DO SOMETHING to generate df3
return df1, df2, df3
def fun3(subjdf):
df1, df2, df3 = fun2()
subjdf['subid'] = df1['indid']
return df1, df2, df3, subjdf
def fun4(subjdf, testdf):
df1, df2, df3, subjdf = fun3()
testdf['testid'] = df2['examid']
return df1, df2, df3, subjdf, testdf
fun4(subjdf, testdf)
But I repeat, build an instance method with self for building this.
Related
Please be patient I am new to Python and Pandas.
I have a lot of pandas dataframe, but some are duplicates. So I wrote a function that check if 2 dataframes are equal, if they are 1 will be deleted:
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print( "Deleted %s" (df_name) )
The function works, but I wish to know how to have the variable "df_name" as string with the name of the dataframe.
I don't understand, the parameters df1 and df2 are dataframe objects how I can get their name at run-time if I wish to print it?
Thanks in advance.
What you are trying to use is an f-string.
def check_eq(df1, df2):
if df1.equals(df2):
del[df2]
print(f"Deleted {df2.name}")
I'm not certain whether you can call this print method, though. Since you deleted the dataframe right before you call its name attribute. So df2 is unbound.
Instead try this:
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
del df2
Now, do note that your usage of 'del' is also not correct. I assume you want to delete the second dataframe in your code. However, you only delete it inside the scope of the check_eq method. You should familiarize yourself with the scope concept first. https://www.w3schools.com/python/python_scope.asp
The code I used:
d = {'col1': [1, 2], 'col2': [3, 4]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d)
df1.name='dataframe1'
df2.name='dataframe2'
def check_eq(df1, df2):
if df1.equals(df2):
print(f"Deleted {df2.name}")
I'm working on a dataframe that i have been able to clean by running the following codes in separate cells in jupyter notebook. However, I need to run these same tasks on several dataframes that are organized exactly the same. How can i write a function that can execute the tasks 2 through 4 below?
For reference, the date I'm working with is located here.
[1]: df1 = pd.read_csv('202110-divvy-tripdata.csv')
[2]: df1.drop(columns=['start_station_name','start_station_id','end_station_name','end_station_id','start_lat','start_lng','end_lat','end_lng'],inplace=True)
[3]: df1['ride_length'] = pd.to_datetime(df1.ended_at) - pd.to_datetime(df1.started_at)
[4]: df1['day_of_week'] = pd.to_datetime(df1.started_at).dt.day_name()
You can define a function in a cell in Jupyter, run this cell and then call the function:
def process_df(df):
df['ride_length'] = pd.to_datetime(df.ended_at) - pd.to_datetime(df.started_at)
df['day_of_week'] = pd.to_datetime(df.started_at).dt.day_name()
Call the function with each DataFrame:
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
process_df(df1)
process_df(df2)
According to this answer, both DataFrames will be altered in place and there's no need to return a new object from the function.
I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.
Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.
Let's assume we have the following Pandas dataframe df:
df = pd.DataFrame({'food' : ['spam', 'ham', 'eggs', 'ham', 'ham', 'eggs', 'milk'],
'sales' : [10, 15, 12, 5, 14, 3, 8]})
Let's further assume that we have the following function that squares the value of the sales column in df:
def square_sales(df):
df['sales'] = df['sales']**2
return df
Now, let's assume we have a requirement to: "return df to the caller"
Does this mean that we pass a df to the square_sales function, then return the processed df (i.e. the df with the squared sales column?
Or, does this mean that we pass df to square_sales, then assign that function call to a variable named df? For example:
df = square_sales(df)
Thanks!
The function changes the df itself (inplace operation). Even if you don't return the df, it will change in the calling scope as well.
The way it is written will work the same for both cases:
df = square_sales(df)
and
square_sales(df)
If you need to return a new df w/o altering the original you'll have to first make a copy and only then assign the new column. In this case you will also have to return the new df to a new variable:
def square_sales(df):
df2 = df.copy(deep=True)
df2['sales'] = df2['sales']**2
return df2
new_df = square_sales(df)
I think there's some aspect of functions and variable scope that you're confused about, but I'm not sure precisely what. If the function returns a DataFrame, then outside of the function you can assign that returned DataFrame to whatever variable you want. Whether or not the variable name outside the function is the same as the variable name inside the function doesn't matter, as far as the function is concerned.
SiP's answer already points out that your function modifies the original input DataFrame in place and returns the updated version. I would caution that this is a misleading antipattern. Functions that operate on a mutable value (like a DataFrame) are usually expected to only do one or the other. And Pandas' own methods, by default, return the new value without modifying in placeāas it appears you've been asked to do.
So I would advise that you use the modified function suggested by SiP, that copies the supplied DataFrame before making changes. As for using it, all of these do basically the same thing:
# Define df
df = square_sales(df)
# Define df
new_df = square_sales(df)
# Define df
some_other_variable_name = square_sales(df)
The only real difference is that in the first case, you no longer have access to the previous, unmodified DataFrame. But if you don't need that, and from henceforth you only plan to need the squared version, then it can make perfect sense.
(Also, if you wanted to, you could alter the function definition to use a different parameter name, say my_internal_df. This would not in any way affect how any of those three examples work.)
Why would the following code not affect the Output DataFrame? (This example is not interesting in itself - it is a convoluted way of 'copying' a DataFrame.)
def getRow(row):
Output.append(row)
Output = pd.DataFrame()
Input = pd.read_csv('Input.csv')
Input.apply(getRow)
Is there a way of obtaining such a functionality that is using the apply function so that it affects other variables?
What happens
DataFrame.append() returns a new dataframe. It does not modify Output but rather creates a new one every time.
DataFrame.append(self, other, ignore_index=False, verify_integrity=False)
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Here:
Output.append(row)
you create a new dataframe but throw it away immediately.
You have access - But you shouldn't use it in this way
While this works, I strongly recommend against using global:
df = DataFrame([1, 2, 3])
df2 = DataFrame()
def get_row(row):
global df2
df2 = df2.append(row)
df.apply(get_row)
print(df2)
Output:
0 1 2
0 1 2 3
Take it as demonstration what happens. Don't use it in your code.