dataframe overwritten when using list comprehension - python

I am attempting to create four new pandas dataframes via a list comprehension. Each new dataframe should be the original 'constituents_list' dataframe with two new columns. These two columns add a defined number of years to an existing column and return the value. The example code is below
def add_maturity(df, tenor):
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
year_list = [3, 5, 7, 10]
new_dfs = [add_maturity(constituents_file, tenor) for tenor in year_list]
My expected output in in the new_dfs list should have four dataframes, each with a different value for 'tenor' and 'maturity'. In my results, all four dataframes have the same data with 'tenor' of '10Y' and a 'maturity' that is 10 years greater than the 'effectivedate' column.
I suspect that each time I iterate through the list comprehension each existing dataframe is overwritten with the latest call to the function. I just can't work out how to stop this happening.
Many thanks

When you're assigning to the DataFrame object, you're modifying in place. And when you pass it as an argument to a function, what you're passing is a reference to the DataFrame object, in this case a reference to the same DataFrame object every time, so that's overwriting the previous results.
To solve this issue, you can either create a copy of the DataFrame at the start of the function:
def add_maturity(df, tenor):
df = df.copy()
df['tenor'] = str(tenor) + 'Y'
df['maturity'] = df['effectivedate'] + pd.DateOffset(years=tenor)
return df
(Or you could keep the function as is, and have the caller copy the DataFrame first when passing it as an argument...)
Or you can use the assign() method, which returns a new DataFrame with the modified columns:
def add_maturity(df, tenor):
return df.assign(
tenor= str(tenor) + 'Y',
maturity=df['effectivedate'] + pd.DateOffset(years=tenor),
)
(Personally, I'd go with the latter. It's similar to how most DataFrame methods work, in that they typically return a new DataFrame rather than modifying it in place.)

Related

Modifying a dataframe inside of multiprocessing

I am using multiprocessing to run some pretty timely tasks concurrently creating a number of separate dataframes which I will merge into one later like so:
manager = Manager()
ns = manager.Namespace()
ns.df_one = data_format.init() # this just creates a new dataframe with predefined columns
ns.df_two = data_format.init() # this just creates a new dataframe with predefined columns
p_search_one = Process(target=search_function_one, args=(ns,))
p_search_one.start()
p_search_two = Process(target=search_function_one, args=(ns,))
p_search_two.start()
p_search_one.join()
p_search_two.join()
pprint(f'search_one: {ns.df_one }') # prints an empty dataframe
Then, in the function I am modifying the dataframe:
def search_function_one(ns):
df = ns.df_one
do_some_magic(df) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
pprint(f'df from ns: {df}') # prints an empty dataframe
I have also tried not making df a copy (?) of ns.df_one like so:
def search_function_one(ns):
do_some_magic(ns.df_one) # just adds rows to the dataframe, not returning anything just modifies in place
pprint(f'df from ns: {ns.df_one}') # prints an empty dataframe
But that just prints an empty dataframe. Without using concurrency this works as expected, modifying the dataframe in place, but with concurrency it doesn't work.
I'm also wondering where do_some_magic is the issue as its in another file, but it doesn't make a new reference of the input parameter, so it doesn't do df = input_var it just accesses the input variable directly.
Am I doing something fundamentally wrong in how I'm managing my datatypes?

Apply the same block of formatting code to multiple dataframes at once

My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc
I want to perform all of the below code to each of the dataframes.
I therefore created a function to do it:
def my_func(df):
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.date = pd.to_datetime(df.date)
df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
df = df.drop(columns=['long_margin'])
df = df.drop(columns=['cash_interest'])
mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
df = df.rename(columns=mapping)
return(df)
and then tried to call the function as follows:
list_of_datasets = [df1, df2, df3]
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
If I manually ran this code changing df to df1, df2 etc it works, but it doesn't seem to work in my function (or the way I am calling it).
What am I missing?
As I understand, in
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).
If inside the function you just modify this object "by reference", it's ok. The changes will propagate to the object in the list.
But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won't be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.
In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).
Alex is correct.
To make this work you could use list comprehension:
list_of_datasets = [my_func(df) for df in list_of_datasets]
or create a new list for the outputs
formatted_dfs = []
for dataframe in list_of_datasets:
formatted_dfs.append(my_func(dataframe))

Iterate through different dataframes and apply a function to each one

I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.

Returning a Pandas dataframe to the caller of a function (return vs. assign variable to function call)

Let's assume we have the following Pandas dataframe df:
df = pd.DataFrame({'food' : ['spam', 'ham', 'eggs', 'ham', 'ham', 'eggs', 'milk'],
'sales' : [10, 15, 12, 5, 14, 3, 8]})
Let's further assume that we have the following function that squares the value of the sales column in df:
def square_sales(df):
df['sales'] = df['sales']**2
return df
Now, let's assume we have a requirement to: "return df to the caller"
Does this mean that we pass a df to the square_sales function, then return the processed df (i.e. the df with the squared sales column?
Or, does this mean that we pass df to square_sales, then assign that function call to a variable named df? For example:
df = square_sales(df)
Thanks!
The function changes the df itself (inplace operation). Even if you don't return the df, it will change in the calling scope as well.
The way it is written will work the same for both cases:
df = square_sales(df)
and
square_sales(df)
If you need to return a new df w/o altering the original you'll have to first make a copy and only then assign the new column. In this case you will also have to return the new df to a new variable:
def square_sales(df):
df2 = df.copy(deep=True)
df2['sales'] = df2['sales']**2
return df2
new_df = square_sales(df)
I think there's some aspect of functions and variable scope that you're confused about, but I'm not sure precisely what. If the function returns a DataFrame, then outside of the function you can assign that returned DataFrame to whatever variable you want. Whether or not the variable name outside the function is the same as the variable name inside the function doesn't matter, as far as the function is concerned.
SiP's answer already points out that your function modifies the original input DataFrame in place and returns the updated version. I would caution that this is a misleading antipattern. Functions that operate on a mutable value (like a DataFrame) are usually expected to only do one or the other. And Pandas' own methods, by default, return the new value without modifying in placeā€”as it appears you've been asked to do.
So I would advise that you use the modified function suggested by SiP, that copies the supplied DataFrame before making changes. As for using it, all of these do basically the same thing:
# Define df
df = square_sales(df)
# Define df
new_df = square_sales(df)
# Define df
some_other_variable_name = square_sales(df)
The only real difference is that in the first case, you no longer have access to the previous, unmodified DataFrame. But if you don't need that, and from henceforth you only plan to need the squared version, then it can make perfect sense.
(Also, if you wanted to, you could alter the function definition to use a different parameter name, say my_internal_df. This would not in any way affect how any of those three examples work.)

DataFrame modified inside a function

I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.
In[30]: def test(df):
df['tt'] = np.nan
return df
In[31]: dff = pd.DataFrame(data=[])
In[32]: dff
Out[32]:
Empty DataFrame
Columns: []
Index: []
In[33]: df = test(dff)
In[34]: dff
Out[34]:
Empty DataFrame
Columns: [tt]
Index: []
def test(df):
df = df.copy(deep=True)
df['tt'] = np.nan
return df
If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html

Categories

Resources