I've been reading over this and still find the subject a little confusing :
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Say I have a Pandas DataFrame and I wish to simultaneously set the first and last row elements of a single column to whatever value. I can do this :
df.iloc[[0, -1]].mycol = [1, 2]
which tells me A value is trying to be set on a copy of a slice from a DataFrame. and that this is potentially dangerous.
I could use .loc instead, but then I need to know the index of the first and last rows ( in constrast, .iloc allows me to access by location ).
What's the safest Pandasy way to do this ?
To get to this point :
# Django queryset
query = market.stats_set.annotate(distance=F("end_date") - query_date)
# Generate a dataframe from this queryset, and order by distance
df = pd.DataFrame.from_records(query.values("distance", *fields), coerce_float=True)
df = df.sort_values("distance").reset_index(drop=True)
Then, I try calling df.distance.iloc[[0, -1]] = [1, 2]. This raises the warning.
The issue isn't with iloc, it's when you access .mycol that a copy is created. You can do this all within iloc:
df.iloc[[0, -1], df.columns.get_loc('mycol')] = [1, 2]
Usually ix is used if you want mixed integer and label based access, but doesn't work in this case since -1 isn't actually in the index, and apparently ix isn't smart enough to know it should be the last index.
What you're doing is called chained indexing, you can use iloc just on that column to avoid the warning:
In [24]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
Out[24]:
a b c
0 1.589940 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.123221 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 0.344794 1.095861 -1.300339
In [25]:
df['a'].iloc[[0,-1]] ='foo'
df
Out[25]:
a b c
0 foo 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.12322 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 foo 1.095861 -1.300339
If you do it the other way then it raises the warning:
In [27]:
df.iloc[[0,-1]]['a'] ='foo'
C:\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\IPython\kernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
Related
df = pd.DataFrame(np.zeros((6, 2)), columns=['A', 'B'])
I want to assign a value to the last value of column B. In addition to the following two methods, is there an easier way?
df.at[df.index[-1], 'B'] = 1
df.loc[df.index[-1], 'B'] = 1
Using df.iloc[-1]['B'] will bring warnings.
df.iloc[-1, df.columns.get_loc('B')] is the answer I want.
df.iloc[-1, df.columns.get_loc('B')] = 1
df.iloc[-1]['B'] = 1
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is a mixture of labeled indexing and indexed indexing, so I your solutions are nice already, but you could do:
df.iloc[-1, df.columns.get_loc('B')] = 1
Which is pretty neat, but it might be long but it's just the function name that's long.
Or why not just?
df.iloc[-1]['B'] = 1
If you know the index of column B then,
df.iloc[-1,1] = 1
Else,
df.iloc[-1].B = 1
I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute b to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()
I run into a strange situation that I didn't have an explanation for it at all. Let's start by having a simple pandas dataframe:
import pandas as pd
a = pd.DataFrame({'A': [1, 1, -1]})
say I want to select only postive rows to a dataframe and then modify the selected/filtered dataframe,
b = a[a['A'] > 0]
then when I modify b the pandas SettingWithCopyWarning will be raised, and it is expected since b is just a view of a:
b['B'] = -999
warning is raised:
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
but it will be okay when I overwrite the variable a:
a = a[a['A'] > 0]
a['B'] = -999
this will NOT raise warning and a simple id() check also shows this a now is a completely different object now. However, in an interactive session (notebook, ipython or python shell), this still raises the warning if you VIEWED the variable, that is:
in one cell you do:
a = pd.DataFrame({'A': [1, 1, -1]})
a
which will display nicely:
Out[4]:
A
0 1
1 1
2 -1
then in next cell (or line, in ipython or python shell), you do the same thing:
a = a[a['A'] > 0]
a['B'] = -999
the warning is raised:
ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Why would this simple viewing action make this difference? from what I understood it should not raise this warning (especially if you check with id() too, a became a new object with different id value). The second question is, is this the only way for this kind of behavior to happen?
I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]
As per my other question:
Python Anaconda: how to test if updated libraries are compatible with my existing code?
I curse the day I was forced to upgrade to pandas 0.16.
One of the things I don't understand is why I get a chained assignment warning when I do something as banal as adding a new field to an existing dataframe and initialising it with 1:
mydataframe['x']=1
causes the following warning:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
mydataframe['x']=1
I understand there can be problems when assigning values to a copy of a dataframe, but here I am just adding a new field to a dataframe! How am I supposed to change my code (which worked perfectly in previous versions of pandas)?
Here's an attempt at an answer, or at least an attempt to reproduce the message. (Note that you may only get this message once and might need to start a new shell or do %reset in ipython to get this message.)
In [1]: %reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.16.0'
Here are 3 variations of setting a new column to '1'. The first two do not generate the warning, but the third one does. (Second one thanks to #Jeff's suggestion)
In [4]: df = pd.DataFrame({ 'x':[1,2,3], 'y':[77,88,99] })
...: df['z'] = 1
In [5]: df = pd.DataFrame({ 'x':[1,2,3], 'y':[77,88,99] })
...: df = df[1:]
...: df['z'] = 1
In [6]: df = pd.DataFrame({ 'x':[1,2,3], 'y':[77,88,99] })
...: df2 = df[1:]
...: df2['z'] = 1
-c:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable
/indexing.html#indexing-view-versus-copy
Perhaps others can correct me if I'm wrong, but I believe the error message here is relating to df2 being a copy of a slice of df. However, that's not really an issue as the resulting df and df2 are what I would have expected:
In [7]: df
Out[7]:
x y
0 1 77
1 2 88
2 3 99
In [8]: df2
Out[8]:
x y z
1 2 88 1
2 3 99 1
I know this is going to be terrible to say, but when I get that message I just check to see whether the command did what I wanted or not and don't overly think about the warning. But whether you get a warning message or not, checking that a command did what you expected is really something you need to do all the time in pandas (or matlab, or R, or SAS, or Stata, ... )
This will not generate the warning:
df = pd.DataFrame({ 'x':[1,2,3], 'y':[77,88,99] })
df2 = df[1:].copy()
df2['z'] = 1