Pandas: Knowing when an operation affects the original dataframe

Pandas: Knowing when an operation affects the original dataframe - python

I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though I use a lot of assertions to be sure). I also know that there have been tons of questions about SettingWithCopyWarning, e.g. How to deal with SettingWithCopyWarning in Pandas?
and some great recent guides on wrapping your head around when it happens, e.g. Understanding SettingWithCopyWarning in pandas.
But I also know specific things like the quote from this answer are no longer in the most recent docs (0.22.0) and that many things have been deprecated over the years (leading to some inappropriate old SO answers), and that things are continuing to change.
Recently after teaching pandas to complete newcomers with very basic general Python knowledge about things like avoiding chained-indexing (and using .iloc/.loc), I've still struggled to provide general rules of thumb to know when it's important to pay attention to the SettingWithCopyWarning (e.g. when it's safe to ignore it).
I've personally found that the specific pattern of subsetting a dataframe according so some rule (e.g. slicing or boolean operation) and then modifying that subset, independent of the original dataframe, is a much more common operation than the docs suggest. In this situation we want to modify the copy not the original and the warning is confusing/scary to newcomers.
I know it's not trivial to know ahead of time when a view vs a copy is returned, e.g.
What rules does Pandas use to generate a view vs a copy?
Checking whether data frame is copy or view in Pandas
So instead I'm looking for the answer to a more general (beginner friendly) question: when does performing an operation on a subsetted dataframe affect the original dataframe from which it was created, and when are they independent?.
I've created some cases below that I think seem reasonable, but I'm not sure if there's a "gotcha" I'm missing or if there's any easier way to think/check this. I was hoping someone could confirm that my intuitions about the following use cases are correct as the pertain to my question above.
import pandas as pd
df1 = pd.DataFrame({'A':[2,4,6,8,10],'B':[1,3,5,7,9],'C':[10,20,30,40,50]})
1) Warning: No
Original changed: No
# df1 will be unaffected because we use .copy() method explicitly
df2 = df1.copy()
#
# Reference: docs
df2.iloc[0,1] = 100
2) Warning: Yes (I don't really understood why)
Original changed: No
# df1 will be unaffected because .query() always returns a copy
#
# Reference:
# https://stackoverflow.com/a/23296545/8022335
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100
3) Warning: Yes
Original changed: No
# df1 will be unaffected because boolean indexing with .loc
# always returns a copy
#
# Reference:
# https://stackoverflow.com/a/17961468/8022335
df2 = df1.loc[df1['A'] < 10,:]
df2.iloc[0,1] = 100
4) Warning: No
Original changed: No
# df1 will be unaffected because list indexing with .loc (or .iloc)
# always returns a copy
#
# Reference:
# Same as 4)
df2 = df1.loc[[0,3,4],:]
df2.iloc[0,1] = 100
5) Warning: No
Original changed: Yes (confusing to newcomers but makes sense)
# df1 will be affected because scalar/slice indexing with .iloc/.loc
# always references the original dataframe, but may sometimes
# provide a view and sometimes provide a copy
#
# Reference: docs
df2 = df1.loc[:10,:]
df2.iloc[0,1] = 100
tl;dr
When creating a new dataframe from the original, changing the new dataframe:
Will change the original when scalar/slice indexing with .loc/.iloc is used to create the new dataframe.
Will not change the original when boolean indexing with .loc, .query(), or .copy() is used to create the new dataframe

This is a somewhat confusing and even frustrating part of pandas, but for the most part you shouldn't really have to worry about this if you follow some simple workflow rules. In particular, note that there are only two general cases here when you have two dataframes, with one being a subset of the other.
This is a case where the Zen of Python rule "explicit is better than implicit" is a great guideline to follow.
Case A: Changes to df2 should NOT affect df1
This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy:
df2 = df1.copy()
After this anything you do to df2 affects only df2 and not df1 and vice versa.
Case B: Changes to df2 should ALSO affect df1
In this case I don't think there is one general way to solve the problem because it depends on exactly what you're trying to do. However, there are a couple of standard approaches that are pretty straightforward and should not have any ambiguity about how they are working.
Method 1: Copy df1 to df2, then use df2 to update df1
In this case, you can basically do a one to one conversion of the examples above. Here's example #2:
df2 = df1.copy()
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100
df1 = df2.append(df1).reset_index().drop_duplicates(subset='index').drop(columns='index')
Unfortunately the re-merging via append is a bit verbose there. You can do it more cleanly with the following, although it has the side effect of converting integers to floats.
df1.update(df2) # note that this is an inplace operation
Method 2: Use a mask (don't create df2 at all)
I think the best general approach here is not to create df2 at all, but rather have it be a masked version of df1. Somewhat unfortunately, you can't do a direct translation of the above code due to its mixing of loc and iloc which is fine for this example though probably unrealistic for actual use.
The advantage is that you can write very simple and readable code. Here's an alternative version of example #2 above where df2 is actually just a masked version of df1. But instead of changing via iloc, I'll change if column "C" == 10.
df2_mask = df1['A'] < 10
df1.loc[ df2_mask & (df1['C'] == 10), 'B'] = 100
Now if you print df1 or df1[df2_mask] you will see that column "B" = 100 for the first row of each dataframe. Obviously this is not very surprising here, but that's the inherent advantage of following "explicit is better than implicit".

I have the same doubt, I searched for this response in the past without success. So now, I just certify that original is not changing and use this peace of code to the program at begining to remove warnings:
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'

Here is and example of the simple effects of why you need .copy()
When the second block of code below is executed the first time it will cut off the "12", the second time it will cut of "34", etc.
df1=pd.DataFrame({'colA':['123456789'],'colB':['123456789']})
df1
| colA| colB
0| 123456789 | 123456789
df2=df1
df2['col1'] = df2['col1'].[2:]
| colA| colB
0| 3456789 | 123456789
df2=df1
df2['col1'] = df2['col1'].[2:]
| colA| colB
0| 56789 | 123456789
The fix, as mentioned is to change the second block:
df2=df1.copy()
df2['col1'] = df2['col1'].[2:]

You only need to replace .iloc[0,1] with .iat[0,1].
More in general if you want to modify only one element you should use .iat or .at method. Instead when you are modifying more elements at one time you should use .loc or .iloc methods.
Doing in this way pandas shuldn't throw any warning.

Related

Why does a DF sometimes automatically update, other times does NOT?

So, I'm unclear why doing certain operations on a DF updates it right away, but other times it does not update it unless you re-use the old name or use a new df variable name.
Doesn't this make it really confusing where the last 'real' change is?

First of all, a df behaves like a list in python, meaning that if you make a soft copy of it and change it, the original df changes too. To answer your question you must know that some methods of updating a df, write on a hard copy version of that df (which are indicated by a warning given by pandas letting you know that you are writing on a copy) so the data might not change. the best and most reliable way to change data using pandas is either:
df.at[a_cell] = some_data
or:
df.loc[some_rows, some_columns] = some_data

Why does Pandas .loc achieve more hits than MultiIndex.intersection?

I have two dataframes, each with 49 layered multiindexes (made up of floats, strings, np.nan, etc) and I'm trying to find the intersection of those multindexes. My initial approach was:
df3 = df1.loc[df2.index]
Which gave me nearly 100% match rate, which was about what I was expecting. Using this method though, pandas was throwing a warning
FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
So, following the suggestion in the documentation that best suited my purpose, I re-implemented my solution to:
df3 = df1.loc[df1.index.intersection(df2.index)]
However, this achieved less than 10% match rate.
I know that the intersection method is missing expected index matches. I validated this with the following
df1.index[0] in df2.index[0:1] # returns true
while
df1.index[0:1].intersection(df2.index[0:1]) # returns empty
How does .loc achieve the appropriate number of matches in nearly the equivalent time while intersection cannot? How can I replicate .loc's performance while still being future proof?
For context, I started with two datetime indexed dataframes with 49 common columns. The data in one of the dataframes is almost a subset of the data in the other (it may have some additional data). Also, the ordering of their indexes cannot be guaranteed to match. I am trying to use the datetime index of the subset dataframe as a reference time for the equivalent data row in the larger dataframe. The solution for this needs to be efficient too. I would appreciate any input on alternative approaches to this problem as well.
EDIT: I avoided using reindex because my index is duplicated, however I realised I could use it to find the index intersection as follows:
temp_df = df1[~df1.index.duplicated()].reindex(df2.index.drop_duplicates())
index_intersection = temp_df[temp_df.SomeColumn.notnull()].index
df3 = df1.loc[index_intersection]

Loop Interpolate and then update values

I have a dataset containing both timeseries and cross-sectional data. There are some missing columns that I want to handle through linear interpolation.
I tried this code but there was a caveat error that appeared. The code still worked but I'm just worried that it might not work after some time. Is there a better way to do this process?
for i in merged_df.country_code.unique():
merged_df[merged_df.country_code == i].interpolate(inplace=True)
Error code below:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

The problem, as indicated in the doc, that merged_df[merged_df.country_code == i] is a part of your merged_df. Once you try to chain with with some inplace operations, pandas cannot guarantee the operations work on the original dataframe or a copy of it. It's safer to do a copy and reassign with loc:
for i in merged_df.country_code.unique():
mask = merged_df.country_code == i
merged_df.loc[mask] = merged_df.loc[mask].interpolate()
This is, IMHO, one of the reasons why inplace=True is not a good practice.
That said, in this case, you can bypass for loop with a groupby:
merge_df = merge_df.groupby('country_code').interpolate()
or:
merge_df = merge_df.groupby('country_code').apply(lambda x: x.interpolate())

The problem is that pandas can't guarantee that the object that you're assigning the new data to is a temporary object or the correct object. Whilst it will probably work, it's better you use
merged_df.loc[merged_df["country_code"]==i,0].interpolate(inplace=True)
As this guarantees that you use the correct object.

Dataframe .where() method returns None [duplicate]

In the pandas library many times there is an option to change the object inplace such as with the following statement...
df.dropna(axis='index', how='all', inplace=True)
I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.
Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?
If you are trying to close a question where someone should use inplace=True and hasn't, consider replace() method not working on Pandas DataFrame instead.

When inplace=True is passed, the data is renamed in place (it returns nothing), so you'd use:
df.an_operation(inplace=True)
When inplace=False is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use:
df = df.an_operation(inplace=False)

In pandas, is inplace = True considered harmful, or not?
TLDR; Yes, yes it is.
inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace can lead to SettingWithCopyWarning if used on a DataFrame column, and may prevent the operation from going though, leading to hard-to-debug errors in code
The pain points above are common pitfalls for beginners, so removing this option will simplify the API.
I don't advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.
inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.
As if the problems described above aren't enough, inplace=True also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
The former lends itself to better code organization and readability.
Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!

The way I use it is
# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False)
Or
# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)
CONCLUSION:
if inplace is False
Assign to a new variable;
else
No need to assign

The inplace parameter:
df.dropna(axis='index', how='all', inplace=True)
in Pandas and in general means:
1. Pandas creates a copy of the original data
2. ... does some computation on it
3. ... assigns the results to the original data.
4. ... deletes the copy.
As you can read in the rest of my answer's further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:
1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)
2. Conflict with method chaining
So there is even case when we should use it yet?
Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory.
To avoid this unwanted effect we can use some technics like method chaining:
(
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method's returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.
Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.
Final conclusion:
Avoid using inplace parameter unless you don't work with huge data and be aware of its possible issues in case of still using of it.

Save it to the same variable
data["column01"].where(data["column01"]< 5, inplace=True)
Save it to a separate variable
data["column02"] = data["column01"].where(data["column1"]< 5)
But, you can always overwrite the variable
data["column01"] = data["column01"].where(data["column1"]< 5)
FYI: In default inplace = False

When trying to make changes to a Pandas dataframe using a function, we use 'inplace=True' if we want to commit the changes to the dataframe.
Therefore, the first line in the following code changes the name of the first column in 'df' to 'Grades'. We need to call the database if we want to see the resulting database.
df.rename(columns={0: 'Grades'}, inplace=True)
df
We use 'inplace=False' (this is also the default value) when we don't want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.
Just to be more clear, the following codes do the same thing:
#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}

Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.
So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.
But, when you change the inplace parameter to True
Then it is equivalent to explicitly say that I do not want a new copy
of the DataFrame instead do the changes on the given DataFrame
This forces the Python interpreter to not to create a new DataFrame
But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame
df = df.dropna(axis='index', how='all')

inplace=True is used depending if you want to make changes to the original df or not.
df.drop_duplicates()
will only make a view of dropped values but not make any changes to df
df.drop_duplicates(inplace = True)
will drop values and make changes to df.
Hope this helps.:)

inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain.
Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like
df.sort_values().rename().to_csv()
Function call with inplace=True returns None and DSL chain is broken. For example
df.sort_values(inplace=True).rename().to_csv()
will throw NoneType object has no attribute 'rename'
Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.
Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.
df = pd.read_csv().sort_values()...

As Far my experience in pandas I would like to answer.
The 'inplace=True' argument stands for the data frame has to make changes permanent
eg.
df.dropna(axis='index', how='all', inplace=True)
changes the same dataframe (as this pandas find NaN entries in index and drops them).
If we try
df.dropna(axis='index', how='all')
pandas shows the dataframe with changes we make but will not modify the original dataframe 'df'.

If you don't use inplace=True or you use inplace=False you basically get back a copy.
So for instance:
testdf.sort_values(inplace=True, by='volume', ascending=False)
will alter the structure with the data sorted in descending order.
then:
testdf2 = testdf.sort_values( by='volume', ascending=True)
will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.
then given another column, say LongMA and you do:
testdf2.LongMA = testdf2.LongMA -1
the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.
It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.

Python dataframe; trouble changing value of column with multiple filters

I have a large dataframe I took off an ODBC database. The Dataframe has multiple columns; I'm trying to change the values of one column by filtering two other.
First, I filter my dataframe data_prem with both conditions which gives me the correct rows:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]
Then I use the replace function on the selection to change 'M' value to 'H' value:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]['Reinsurer'].replace(to_replace='M',value='H',inplace=True,regex=True)
Python warns me I'm trying to modify a copy of the dataframe, even though I'm clearly refering to the original dataframe (I'm posting image so you can see my results).
dataframe filtering
I also tried using .loc function in the following manner:
data_prem.loc[((data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))),'Reinsurer'] = 'H'
which changed all rows that fit the second condition (str.contains...), but it didn't apply the first condition. I got replacements in the 'Reinsurer' column for other 'PRODUCT_NAME' values as well.
I've been scouring the web for an answer to this for some time. I've seen some mentions of a bug in the pandas library, not sure if this is what they were talking about.
I would value any opinions you might have, would also be interesting in alternative ways to solving this problem. I filled the 'Reinsurer' column with the map function with 'PRODUCT_NAME' as the input (had a dictionary that connected all 'PRODUCT_NAME' values with 'Reinsurer' values).

Given your Boolean mask, you've demonstrated two ways of applying chained indexing. This is the cause of the warning and the reason why you aren't seeing your logic being applied as you anticipate.
mask = (data_prem['PRODUCT_NAME']=='ŽZ08') & df['BENEFIT'].str.contains('19.08.16')
Chained indexing: Example #1
df[mask]['Reinsurer'].replace(to_replace='M', value='H', inplace=True, regex=True)
Chained indexing: Example #2
df[mask].loc[mask, 'Reinsurer'] = 'H'
Avoid chained indexing
You can keep things simple by applying your mask once and using a single loc call:
df.loc[mask, 'Reinsurer'] = 'H'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Knowing when an operation affects the original dataframe - python

I have the same doubt, I searched for this response in the past without success. So now, I just certify that original is not changing and use this peace of code to the program at begining to remove warnings: import pandas as pd pd.options.mode.chained_assignment = None # default='warn'

You only need to replace .iloc[0,1] with .iat[0,1]. More in general if you want to modify only one element you should use .iat or .at method. Instead when you are modifying more elements at one time you should use .loc or .iloc methods. Doing in this way pandas shuldn't throw any warning.

Related

Why does a DF sometimes automatically update, other times does NOT?

Why does Pandas .loc achieve more hits than MultiIndex.intersection?

Loop Interpolate and then update values

Dataframe .where() method returns None [duplicate]

Python dataframe; trouble changing value of column with multiple filters

Categories

Resources