Changes to pandas dataframe in for loop is only partially saved

Changes to pandas dataframe in for loop is only partially saved - python

I have two dfs, and want to manipulate them in some way with a for loop.
I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.
import pandas as pd
import numpy as np
gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))
df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)
all_df = [df1, df2]
for x in all_df:
x['test'] = x[1]+1
x = x.set_index(0).drop(2, axis=1)
print(x)
Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.
Am I missing something as to why only one of the commands have been made permanent? Thank you.

Here's what's going on:
x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.
However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).
You could try something like this instead:
for x in all_df:
x['test'] = x[1]+1
x.set_index(0, inplace=True)
x.drop(2, axis=1, inplace=True)
print(df1)
print(df2)
Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.

Related

Apply the same block of formatting code to multiple dataframes at once

My raw data is in multiple datafiles that have the same format. After importing the various (10) csv files using pd.read_csv(filename.csv) I have a series of dataframes df1, df2, df3 etc etc
I want to perform all of the below code to each of the dataframes.
I therefore created a function to do it:
def my_func(df):
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.date = pd.to_datetime(df.date)
df = df.join(df['long_margin'].str.split(' ', 1, expand=True).rename(columns={0:'A', 1:'B'}))
df = df.drop(columns=['long_margin'])
df = df.drop(columns=['cash_interest'])
mapping = {df.columns[6]: 'daily_turnover', df.columns[7]: 'cash_interest', df.columns[8]: 'long_margin', df.columns[9]: 'short_margin'}
df = df.rename(columns=mapping)
return(df)
and then tried to call the function as follows:
list_of_datasets = [df1, df2, df3]
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
If I manually ran this code changing df to df1, df2 etc it works, but it doesn't seem to work in my function (or the way I am calling it).
What am I missing?

As I understand, in
for dataframe in list_of_datasets:
dataframe = my_func(dataframe)
dataframe is a pointer to an object in the list. It is not the DataFrame itself. When for x in something: is executed, Python creates a new variable x, which points to an element of the list, and (this new pointer) is usually discarded by you when the loop ends (the pointer (the new variable created by the loop) is not deleted though).
If inside the function you just modify this object "by reference", it's ok. The changes will propagate to the object in the list.
But as soon as the function starts to create a new object named "df" instead of the previous object (not modifying the previous, but creating a new one with a new ID) and then returning this new object to dataframe in the for loop, the assignment of this new object to dataframe will basically mean that dataframe will start to point to the new object instead of the element of the list. And the element in the list won't be affected or rather will be affected to the point when the function created a new DataFrame instead of the previous.
In order to see when exactly it happens, I would suggest that you add print(id(df)) after (and before) each line of code in the function and in the loop. When the id changes, you deal with the new object (not with the element of the list).

Alex is correct.
To make this work you could use list comprehension:
list_of_datasets = [my_func(df) for df in list_of_datasets]
or create a new list for the outputs
formatted_dfs = []
for dataframe in list_of_datasets:
formatted_dfs.append(my_func(dataframe))

Action on one pandas dataframe does the same to the one it was copied from

I was using this bit of code (re-worked for my application) when I found that the df_temp.drop(index=sample.index, inplace=True) performed the same action on df_input i.e. it emptied it!!! I was not expecting that at all.
I solved it by changing df_temp = df_input to df_temp = df_input.copy() but can someone illuminate me on what is going on here?
import seaborn as sns
import pandas as pd
df_input = sns.load_dataset('diamonds')
df = df_input.loc[[]]
df_temp = df_input # this is where we're sampling from
n_samples = 1000
for _ in range(n_samples):
sample = df_temp.sample(1)
df_temp.drop(index=sample.index, inplace=True)
df = df.append(sample)
assert((df.index.value_counts() > 1).sum() == 0)
df

Pandas does not copy the whole df if you simply assign it to a new variable. After executing df_temp = df_input you end up with two variables referring to the exact same df. It's not the case that both are referring to an identical df, they are actually pointing to the same df. (think: you just gave this one df two names (variable names)) So no matter which variable (think: name) you are using to alter the df you're also changing for the other variable. If you use .copy() you get what you intended, namely two variables with two distinct versions of the df.

Updating a list of df variables after modifying a df

I have a list of predictor (X) and an outcome (y) variables from my df. There are 100s of variables in my df so I only care about a few of them below.
X = df[['a', 'b', 'c']]
y = df['d']
I then want to delete all of the rows with missing data for any of my "X" variables, so I ran this:
for i in X:
df = df[df[i].notna()]
This then leaves me with a modified df with no missing values in the columns of interest. However, my list X and y are still populated with the old df, thus I can not use these as inputs to my model. While I know I could just copy and paste the code I used to create those lists in the first place to "refresh" the code, that seems inefficient. Though I can not seem to think of a better way. Thoughts appreciated!

You can use df.dropna:
X = X.dropna()

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.

As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.

Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Why is it not possible to access other variables from inside the apply function in Python?

Why would the following code not affect the Output DataFrame? (This example is not interesting in itself - it is a convoluted way of 'copying' a DataFrame.)
def getRow(row):
Output.append(row)
Output = pd.DataFrame()
Input = pd.read_csv('Input.csv')
Input.apply(getRow)
Is there a way of obtaining such a functionality that is using the apply function so that it affects other variables?

What happens
DataFrame.append() returns a new dataframe. It does not modify Output but rather creates a new one every time.
DataFrame.append(self, other, ignore_index=False, verify_integrity=False)
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
Here:
Output.append(row)
you create a new dataframe but throw it away immediately.
You have access - But you shouldn't use it in this way
While this works, I strongly recommend against using global:
df = DataFrame([1, 2, 3])
df2 = DataFrame()
def get_row(row):
global df2
df2 = df2.append(row)
df.apply(get_row)
print(df2)
Output:
0 1 2
0 1 2 3
Take it as demonstration what happens. Don't use it in your code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changes to pandas dataframe in for loop is only partially saved - python

Related

Apply the same block of formatting code to multiple dataframes at once

Action on one pandas dataframe does the same to the one it was copied from

Updating a list of df variables after modifying a df

pandas appending df1 to df2 get 0s/NaNs in result

Why is it not possible to access other variables from inside the apply function in Python?

Categories

Resources