I'm trying to create a column in a dataframe using the following code:
df['engagement_clicks_event_subscribers'] = df['brands_publishers_lifetime_clicks'] / df['brands_publishers_lifetime_events'] / df['content_subscriber_count']
For some reason, this simply does NOT run. It throws a SettingwithCopy warning, and no new column is created.
The most confusing part to me is that it does run on a simpler mock dataframe that I just took from the pandas documentation, with no warnings: https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html
The only thing I can think of here is that there is a dtype issue causing a copy to be created instead of a view.
These columns are int64:
df['brands_publishers_lifetime_clicks']
df['brands_publishers_lifetime_events']
This one is float64:
df['content_subscriber_count']
There are no NaN or null values.
Any tips here would be appreciated, it's driving me fairly nuts.
Edit to include the full traceback set of warning I get from running this line:
/Users/sangbinlee/PycharmProjects/pr-seismic/data_shaping.py:41: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['engagement_clicks_event_subscribers'] = df['brands_publishers_lifetime_clicks']/ df['brands_publishers_lifetime_events'] / df['content_subscriber_count']
I've also tried, per the documentation in the warning:
df.loc[:,'engagement_clicks_event_subscribers'] = df['brands_publishers_lifetime_clicks']/ df['brands_publishers_lifetime_events'] / df['content_subscriber_count']
This generated 3 times as many settingwithcopy warnings and also doesn't work.
I figured it out.
Purely a Pycharm issue.
I'm using the Pycharm debugger and for some reason, the dataframe doesn't update in the "Variables" window of the debugger.
When I Data View to view it as a dataframe, I can see the columns – I imagine it would work the same way if I were to do .head().
Seems like a known issue: https://youtrack.jetbrains.com/issue/PY-22369
I have a dataset containing both timeseries and cross-sectional data. There are some missing columns that I want to handle through linear interpolation.
I tried this code but there was a caveat error that appeared. The code still worked but I'm just worried that it might not work after some time. Is there a better way to do this process?
for i in merged_df.country_code.unique():
merged_df[merged_df.country_code == i].interpolate(inplace=True)
Error code below:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The problem, as indicated in the doc, that merged_df[merged_df.country_code == i] is a part of your merged_df. Once you try to chain with with some inplace operations, pandas cannot guarantee the operations work on the original dataframe or a copy of it. It's safer to do a copy and reassign with loc:
for i in merged_df.country_code.unique():
mask = merged_df.country_code == i
merged_df.loc[mask] = merged_df.loc[mask].interpolate()
This is, IMHO, one of the reasons why inplace=True is not a good practice.
That said, in this case, you can bypass for loop with a groupby:
merge_df = merge_df.groupby('country_code').interpolate()
or:
merge_df = merge_df.groupby('country_code').apply(lambda x: x.interpolate())
The problem is that pandas can't guarantee that the object that you're assigning the new data to is a temporary object or the correct object. Whilst it will probably work, it's better you use
merged_df.loc[merged_df["country_code"]==i,0].interpolate(inplace=True)
As this guarantees that you use the correct object.
In the pandas library many times there is an option to change the object inplace such as with the following statement...
df.dropna(axis='index', how='all', inplace=True)
I am curious what is being returned as well as how the object is handled when inplace=True is passed vs. when inplace=False.
Are all operations modifying self when inplace=True? And when inplace=False is a new object created immediately such as new_df = self and then new_df is returned?
If you are trying to close a question where someone should use inplace=True and hasn't, consider replace() method not working on Pandas DataFrame instead.
When inplace=True is passed, the data is renamed in place (it returns nothing), so you'd use:
df.an_operation(inplace=True)
When inplace=False is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use:
df = df.an_operation(inplace=False)
In pandas, is inplace = True considered harmful, or not?
TLDR; Yes, yes it is.
inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace can lead to SettingWithCopyWarning if used on a DataFrame column, and may prevent the operation from going though, leading to hard-to-debug errors in code
The pain points above are common pitfalls for beginners, so removing this option will simplify the API.
I don't advise setting this parameter as it serves little purpose. See this GitHub issue which proposes the inplace argument be deprecated api-wide.
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In reality, there are absolutely no performance benefits to using inplace=True. Both the in-place and out-of-place versions create a copy of the data anyway, with the in-place version automatically assigning the copy back.
inplace=True is a common pitfall for beginners. For example, it can trigger the SettingWithCopyWarning:
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
Calling a function on a DataFrame column with inplace=True may or may not work. This is especially true when chained indexing is involved.
As if the problems described above aren't enough, inplace=True also hinders method chaining. Contrast the working of
result = df.some_function1().reset_index().some_function2()
As opposed to
temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()
The former lends itself to better code organization and readability.
Another supporting claim is that the API for set_axis was recently changed such that inplace default value was switched from True to False. See GH27600. Great job devs!
The way I use it is
# Have to assign back to dataframe (because it is a new copy)
df = df.some_operation(inplace=False)
Or
# No need to assign back to dataframe (because it is on the same copy)
df.some_operation(inplace=True)
CONCLUSION:
if inplace is False
Assign to a new variable;
else
No need to assign
The inplace parameter:
df.dropna(axis='index', how='all', inplace=True)
in Pandas and in general means:
1. Pandas creates a copy of the original data
2. ... does some computation on it
3. ... assigns the results to the original data.
4. ... deletes the copy.
As you can read in the rest of my answer's further below, we still can have good reason to use this parameter i.e. the inplace operations, but we should avoid it if we can, as it generate more issues, as:
1. Your code will be harder to debug (Actually SettingwithCopyWarning stands for warning you to this possible problem)
2. Conflict with method chaining
So there is even case when we should use it yet?
Definitely yes. If we use pandas or any tool for handeling huge dataset, we can easily face the situation, where some big data can consume our entire memory.
To avoid this unwanted effect we can use some technics like method chaining:
(
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
which make our code more compact (though harder to interpret and debug too) and consumes less memory as the chained methods works with the other method's returned values, thus resulting in only one copy of the input data. We can see clearly, that we will have 2 x original data memory consumption after this operations.
Or we can use inplace parameter (though harder to interpret and debug too) our memory consumption will be 2 x original data, but our memory consumption after this operation remains 1 x original data, which if somebody whenever worked with huge datasets exactly knows can be a big benefit.
Final conclusion:
Avoid using inplace parameter unless you don't work with huge data and be aware of its possible issues in case of still using of it.
Save it to the same variable
data["column01"].where(data["column01"]< 5, inplace=True)
Save it to a separate variable
data["column02"] = data["column01"].where(data["column1"]< 5)
But, you can always overwrite the variable
data["column01"] = data["column01"].where(data["column1"]< 5)
FYI: In default inplace = False
When trying to make changes to a Pandas dataframe using a function, we use 'inplace=True' if we want to commit the changes to the dataframe.
Therefore, the first line in the following code changes the name of the first column in 'df' to 'Grades'. We need to call the database if we want to see the resulting database.
df.rename(columns={0: 'Grades'}, inplace=True)
df
We use 'inplace=False' (this is also the default value) when we don't want to commit the changes but just print the resulting database. So, in effect a copy of the original database with the committed changes is printed without altering the original database.
Just to be more clear, the following codes do the same thing:
#Code 1
df.rename(columns={0: 'Grades'}, inplace=True)
#Code 2
df=df.rename(columns={0: 'Grades'}, inplace=False}
Yes, in Pandas we have many functions has the parameter inplace but by default it is assigned to False.
So, when you do df.dropna(axis='index', how='all', inplace=False) it thinks that you do not want to change the orignial DataFrame, therefore it instead creates a new copy for you with the required changes.
But, when you change the inplace parameter to True
Then it is equivalent to explicitly say that I do not want a new copy
of the DataFrame instead do the changes on the given DataFrame
This forces the Python interpreter to not to create a new DataFrame
But you can also avoid using the inplace parameter by reassigning the result to the orignal DataFrame
df = df.dropna(axis='index', how='all')
inplace=True is used depending if you want to make changes to the original df or not.
df.drop_duplicates()
will only make a view of dropped values but not make any changes to df
df.drop_duplicates(inplace = True)
will drop values and make changes to df.
Hope this helps.:)
inplace=True makes the function impure. It changes the original dataframe and returns None. In that case, You breaks the DSL chain.
Because most of dataframe functions return a new dataframe, you can use the DSL conveniently. Like
df.sort_values().rename().to_csv()
Function call with inplace=True returns None and DSL chain is broken. For example
df.sort_values(inplace=True).rename().to_csv()
will throw NoneType object has no attribute 'rename'
Something similar with python’s build-in sort and sorted. lst.sort() returns None and sorted(lst) returns a new list.
Generally, do not use inplace=True unless you have specific reason of doing so. When you have to write reassignment code like df = df.sort_values(), try attaching the function call in the DSL chain, e.g.
df = pd.read_csv().sort_values()...
As Far my experience in pandas I would like to answer.
The 'inplace=True' argument stands for the data frame has to make changes permanent
eg.
df.dropna(axis='index', how='all', inplace=True)
changes the same dataframe (as this pandas find NaN entries in index and drops them).
If we try
df.dropna(axis='index', how='all')
pandas shows the dataframe with changes we make but will not modify the original dataframe 'df'.
If you don't use inplace=True or you use inplace=False you basically get back a copy.
So for instance:
testdf.sort_values(inplace=True, by='volume', ascending=False)
will alter the structure with the data sorted in descending order.
then:
testdf2 = testdf.sort_values( by='volume', ascending=True)
will make testdf2 a copy. the values will all be the same but the sort will be reversed and you will have an independent object.
then given another column, say LongMA and you do:
testdf2.LongMA = testdf2.LongMA -1
the LongMA column in testdf will have the original values and testdf2 will have the decrimented values.
It is important to keep track of the difference as the chain of calculations grows and the copies of dataframes have their own lifecycle.
Background
I've got two DataFrames of timestamped-ids (the index is the id). I want to get all of the ids where the timestamps differ by, say, 5 minutes.
Code
time_delta = abs(df2.time - df1.time).dt.total_seconds()
ids_out_of_range = df1[time_delta > 300].index
This gives me the ids I want, so it is working code.
Problem
Like many, I face this warning:
file.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
ids_out_of_range = df1[time_delta > 300].index
Most explanations center on the "length" of the index differing from the "length" of the dataframe. But:
+(Pdb) time_delta.shape
(176,)
+(Pdb) df1.shape
(176, 1)
+(Pdb) sorted(time_delta.index.values.tolist()) == sorted(df1.index.values.tolist())
True
The shapes are the same, except that one is a Series and the other is a DataFrame. The indices (appear) to be the same; perhaps the ordering is the issue? They did not compare equal without sorted.
(I've tried wrapping time_delta in a DataFrame, to no avail.)
Long-term, I would like this warning to go away (and not with 2>/dev/null, thank you). It's visual clutter in the output of my script, and, well, it is a warning—so theoretically I should pay attention to it.
Question
What am I doing "wrong" that I get this warning, since the sizes seem to be right?
How do I fix (1) so I can avoid this warning?
The warning is saying that your time_delta index is different from df index.
But when I tried to reproduce the warning, it didn't show up. I'm using pandas 0.25.1. So if you are using a different version, there might be a warning.
Please refer to this page for suppressing warnings
The following fixed my issue:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
time_delta.sort_index(inplace=True)
This allowed the indices to align perfectly, so they must have not been in the same order with respect to each other.
I have a dataframe which looks like this
I tried to delete matchId but no matter what I use to delete it, for preprocessing, its outputting this error:
KeyError: "['matchId'] not found in axis"
What you attempted to do (which you should have mentioned in the question) is probably failing because you assume that the matchID column is a normal column. It is actually a special, index column and so cannot be accessed in the same way other columns can be accessed.
As suggested by anky_91, because of that, you should do
df = df.reset_index(drop=True)
if you want to completely remove the indexes in your table. This will replace them with the default indexes. To just make them into another column, you can just remove the drop=True from the above statement.
Your table will always have indexes, however, so you cannot completely get rid of them.
You can, however, output it with
df.values
and this will ignore the indexes and show just the values as arrays.