Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning

Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning - python

It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.
Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data
data[0] == 100
True
I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)
In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe
Edit:
While investigating this, I found another case of a missing warning:
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Even though an almost identical example does trigger a warning:
data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Update: I'm responding to the answer by #firelynx here because it's hard to put it in the comment.
In the answer, #firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True

Explaining what you're doing, step by step
The Dataframe you create, is not a view
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False
new_data is also not a view, because you are taking all columns
new_data = data[['a', 'b']]
new_data._is_view
False
now you are assigning data to be the Series 'a'
data = data['a']
type(data)
pandas.core.series.Series
Which is a view
data._is_view
True
Now you update a value in the non-copy new_data
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
This should not give a warning. It is the whole dataframe.
The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.
Avoiding writing code like this
The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]
The problem is really that you should always be writing
data[['a']] not data['a']
Left creates a dataframe view, right creates a series.
Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.
This does not work. First of all using data.a syntax causes cognitive dissonance.
A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination
Clean code to the rescue
It is hard to see the difference between data[['a']] and data['a']
This is a smell. We should be doing neither.
The proper way using clean code principles and the zen of python "Explicit is better than implicit"
is this:
columns = ['a']
data[columns]
This may not be so mind boggling, but take a look at the following example:
data[['ad', 'cpc', 'roi']]
What does this mean? What are these columns? What data are you getting here?
These are the first questions to arrive in anyone's head when reading this line of code.
How to solve it? Don't say a comment.
ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]
More explicit is always better.
For more, please consider buying a book on clean code. Maybe this one

Related

Pandas astype() extremely slow with large number of nulls?

I have a dataframe of decent size (95,000 rows, 68 columns.) When I load in the excel file it makes a certain column go from text to being interpreted as an int. I either need a way to specify on loading that a certain column is to hold strings, or I need to figure out why astype(str) is performing so slowly.
Example code below
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e'],
'C': ['1.1', '2.1', '3.0', '4.1', '5.1'] })
df['A']=df['A'].astype(str)+'addedtext'
print(df)
This code works fine, does exactly what I want, changing Column A from ints to strings and verifying they are strings by using + with other strings and getting the result I want
The problem is, running this on a single column (95000 rows) of my other dataframe takes 7-8 minutes. I feel like thats very slow for such a simple change? Maybe I'm crazy. Is there a faster method? Is there a way to load a csv or excel file in but specify one column by a certain data type before hand? Could it have to do with the fact that there is a very large number of nulls in that column?
EDIT: Im just dumb. my timing loop contained the saving of the file.

I would maybe try a list comprehension:
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e'],
'C': ['1.1', '2.1', '3.0', '4.1', '5.1'] })
start= time.time()
df['A']=df['A'].astype(str)+'addedtext'
end = time.time()
print('original:', end-start)
start= time.time()
df['A']=[str(x) + 'addedtext' for x in df['A']]
end = time.time()
print('slightly faster alternative', end-start)
my results were:
original: 0.002001523971557617
slightly faster alternative 0.000997781753540039
Hope this helps

Why doesn't pandas reindex() operate in-place?

From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3

reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608

A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)

How to sort index in Dask following pivot_table

Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this:
# make dataframe, fist in pandas and then in dask
df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a','b', 'c'], 'dist': [0, .1, .2, .1, 0, .3, .4, .1, 0]})
df.sort_values(by='A', inplace=True)
dd = dask.dataframe.from_pandas(df, chunksize=3) # just for demo's sake, you obviously don't ever want a chunksize of 3
print(dd.known_divisions) # Here I get True, which means my data is sorted
# now pivot and see if the index remains sorted
dd = dd.categorize('B')
pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
print(pivot_dd.known_divisions) # Here I get False, which makes me sad
I would love to find a way to get pivot_dd to have a sorted index, but I don't see a sort_index method in dask and cannot set 'A' as an index w/out getting a key error (it already is the index!).
In this toy example, I could pivot the pandas table first and then sort. The real application I have in mind won't allow me to do that.
Thanks in advance for any help/suggestions.

This may not be what you were wishing for, and perhaps not even the best answer, but it does seem to work. The first wrinkle, is that pivot operations create a categorical index for the columns, which is annoying. You could do the following.
>>> pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
>>> pivot_dd.columns = list(pivot_dd.columns)
>>> pivot_dd = pivot_dd.reset_index().set_index('A', sorted=True)
>>> pivot_dd.known_divisions
True

Pandas - Category variable and group by - is this a bug?

I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.
cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']
This gives "15.0" as result.
cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']
This gives as result a Series:
sc
b 15.0
Name: mvl, dtype: float64
type(df1.loc['b','mvl']) -> pandas.core.series.Series
type(df.loc['b','mvl']) -> numpy.float64
Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?
I hope it is not a stupid question. Thanks!

This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:
nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)
>>> xno.loc['a']
8
>>> xcat.loc['a']
a 8
dtype: int64
The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.
There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.

Why can't I assign to part of my Pandas DataFrame?

I'm confused why the following pandas does not successfully assign the last two values of column A to the first two entries of column B:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7], 'B': [10, 20, 30, 40, 50, 60, 70]})
df = df.join(pd.DataFrame({'C': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}))
df['B2'] = df.B.shift(2)
df[:2].B2 = list(df[-2:].A)
What's perplexing to me is that in an (apparently) equivalent "real" application, it does appear to work (and to generate some strange behavior).
Why does the final assignment fail to change the values of the two entries in the dataframe?

It can work and that's why its insidious, see here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Generally with multi-dtyped frames it depends on the construction of when it would work (e.g. if you create it all at once, I think it will always work). Since you are creating it after (via join) it is dependent on the underlying numpy view creation mechanisms.
don't ever ever ever assign like that, use loc
df.loc[:2,'B2'] = ....

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning - python

Related

Pandas astype() extremely slow with large number of nulls?

Why doesn't pandas reindex() operate in-place?

How to sort index in Dask following pivot_table

Pandas - Category variable and group by - is this a bug?

Why can't I assign to part of my Pandas DataFrame?

Categories

Resources