What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.
Related
I have a dataframe (df) with the following columns:
print(df.columns)
['A','B','C','D','E']
And let's assume all the columns have numbers as data.
Then I select some of the columns to become indexes
Index = ['A','B','C']
df.set_index(Index).sort_index()
and I use it this way for some analysis.At some point I need to change the rows of column 'E' when index 'C' has certain values, for instance something like :
df.loc[df[(slice(None,None),slice(None,None),slice(5,10))], 'E' ] = 6
Which, obviously, doesn't work. I have tried a bunch of different approaches: using tuples and slices for the index as shown in my line above, re-arranging the indexes so i can use a single slice (Moving 'C' to the first level), tried with .xs (cross section) etc and I cannot do it. (I have been looking into de documentation of .loc, .xs, etc) I don't find an example that does exactly this, nor I find conclusive answer that this is not possible. Right now I was able to do the following:
df.reset_index(inplace=True) # returning it back into a normal DataFrame
df.loc[(DataFrame['C'] >= 5) & (df['C'] <= 10),'E'] = 6 # Modifying normally based on column data
df.set_index(Index).sort_index() # bring it back to a multiindex
But this doesn't seem right. It would seem to me that indexes should be able to be sliced somehow, I just can't find how. Perhaps I'm not searching the correct terms on Google. if anyone could give me a hand or point me in the right direction I'd greatly appreciate it.
You can use df.index.get_level_values('C')--which returns an index array of the values--like below.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
df = df.set_index(['A','B','C']).sort_index()
df.loc[(df.index.get_level_values('C') <= 10) & (df.index.get_level_values('C') >= 5), 'E'] = 6
print(df)
Results:
D E
A B C
0 0 6 3 6
2 0 6 1
7 2 6
3 6 5 6
9 1 6
... .. ..
9 3 3 5 0
6 6 6
4 3 5 7
7 6 6
6 8 6 6
Note: The the parenthesis around both .get_level_values()s are required because otherwise the answer is ambiguous and will throw an error.
Sorry for a longread, The question is actually much shorter than is seems to be.
Can anyone explain how function-typed argument of pandas.core.groupby.groupby.DataFrameGroupBy.transorm is being used?
I wrote this snippet to find out what arguments are fed into function:
def printer(x): print(''); print(type(x)); print(x); return x
df = pd.DataFrame({'A': [1,1,2], 'B':[3,4,5], 'C':[6,7,8]})
print('initial dataframe:', df, '\n===TRANSFORM LOG BEGIN===', sep='\n')
df2 = df.groupby('A').transform(printer)
print('\n===TRANSFORM LOG END===', 'final dataframe:', df2, sep='\n')
The output is (split into chunks)
initial dataframe:
A B C
0 1 3 6
1 1 4 7
2 2 5 8
OK, move on
===TRANSFORM LOG BEGIN===
<class 'pandas.core.series.Series'>
0 3
1 4
Name: B, dtype: int64
Apparently we got a group of values for column B with key (column A value) 1. Carry on
3.
<class 'pandas.core.series.Series'>
0 3
1 4
Name: B, dtype: int64
??. The same Series object is passed twice. The only justification that I could imagine is that there are two rows with column A equal to 1, so for each occurrence of such a row we recompute our transforming function. Seems strange and inefficient, hardly to be true.
4.
<class 'pandas.core.series.Series'>
0 6
1 7
Name: C, dtype: int64
That's analogous to p.2 for another column
5.
<class 'pandas.core.frame.DataFrame'>
B C
0 3 6
1 4 7
Why there is no counterpart of p.3??
6.
<class 'pandas.core.frame.DataFrame'>
B C
2 5 8
===TRANSFORM LOG END===
This is a counterpart to p.6 but why there is no one to p.2 for another grouping key?
7.
final dataframe:
B C
0 3 6
1 4 7
2 5 8
TLDR
Apart from strange behaviour, the main point is that the passed function gets both Series and DataFrame objects as arguments. Does it mean that it (function) must respect both types? Are there any restrictions on transformation type since the function is essentially called several times on the same values (Series, then Dataframe consisting of these Series), sort of reduce-like operation?
pandas is experimenting with the input (Series by Series or the whole DataFrame) to see if the function can be applied more efficiently. The notes from the docstring:
The current implementation imposes three requirements on f:
f must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For
example, f returns a scalar it will be broadcast to have the same
shape as the input subframe.
if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports application to the entire
subframe, then a fast path is used starting from the second chunk.
f must not mutate groups. Mutation is not supported and may produce unexpected results.
The second call to the same function is also about finding a faster path. You see the same behavior with apply:
In the current implementation apply calls func twice on the first
column/row to decide whether it can take a fast or slow code path.
This can lead to unexpected behavior if func has side-effects, as they
will take effect twice for the first column/row.
I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)
Imagine I have the following DataFrames on Pandas:
In [7]: A= pd.DataFrame([['foo'],['bar'],['quz'],['baz']],columns=['key'])
In [8]: A['value'] = 'None'
In [9]: A
Out[9]:
key value
0 foo None
1 bar None
2 quz None
3 baz None
In [10]: B = pd.DataFrame([['foo',5],['bar',6],['quz',7]],columns= ['key','value'])
In [11]: B
Out[11]:
key value
0 foo 5
1 bar 6
2 quz 7
In [12]: pd.merge(A,B, on='key', how='outer')
Out[12]:
key value_x value_y
0 foo None 5
1 bar None 6
2 quz None 7
3 baz None NaN
But what I want is (avoiding the repeat column basically):
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
I suppose I can take the output and drop the _x value and rename the _y but that seems like an overkill. On SQL this would be trivial.
EDIT:
John as recomended to use:
In [1]: A.set_index('key', inplace=True)
A.update(B.set_index('key'), join='left', overwrite=True)
A.reset_index(inplace=True)
This works and does what I asked for.
In the example you are merging two dataframes with the same column, one contains strings ('None') the other integers, pandas doesn't know which column value you want to keep and which should be replaced, so it creates a column for both.
You can use update instead
In [10]: A.update(B, join='left', overwrite=True)
In [11]: A
Out[11]:
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
Another solution would be to just state the values that you want for the given column:
In [15]: A.loc[B.index, 'value'] = B.value
In [16]: A
Out[16]:
key value
0 foo 5
1 bar 6
2 quz 7
3 baz NaN
Personally I prefer the second solution because I know exactly what is happening, but the first is probably closer to what you are looking for in your question.
EDIT:
If the indices don't match, I'm not quite sure how to make this happen. Hence I would suggest making them match:
In [1]: A.set_index('key', inplace=True)
A.update(B.set_index('key'), join='left', overwrite=True)
A.reset_index(inplace=True)
It may be that there is a better way to do this, but I don't believe pandas has a way to perform this operation outright.
The second solution can also be used with the updated index:
In [24]: A.set_index('key', inplace=True)
A.loc[B.key, 'value'] = B.value.tolist()
When selecting data from a Pandas dataframe, sometimes a view is returned and sometimes a copy is returned.
While there is a logic behind this, is there a way to force Pandas to explicitly return a view or a copy?
There are two parts to your question: (1) how to make a view (see bottom of this answer), and (2) how to make a copy.
I'll demonstrate with some example data:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[None,10,20],[7,8,9]], columns=['x','y','z'])
# which looks like this:
x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9
How to make a copy: One option is to explicitly copy your DataFrame after whatever operations you perform. For instance, lets say we are selecting rows that do not have NaN:
df2 = df[~df['x'].isnull()]
df2 = df2.copy()
Then, if you modify values in df2 you will find that the modifications do not propagate back to the original data (df), and that Pandas does not warn that "A value is trying to be set on a copy of a slice from a DataFrame"
df2['x'] *= 100
# original data unchanged
print(df)
x y z
0 1 2 3
1 4 5 6
2 NaN 10 20
3 7 8 9
# modified data
print(df2)
x y z
0 100 2 3
1 400 5 6
3 700 8 9
Note: you may take a performance hit by explicitly making a copy.
How to ignore warnings: Alternatively, in some cases you might not care whether a view or copy is returned, because your intention is to permanently modify the data and never go back to the original data. In this case, you can suppress the warning and go merrily on your way (just don't forget that you've turned it off, and that the original data may or may not be modified by your code, because df2 may or may not be a copy):
pd.options.mode.chained_assignment = None # default='warn'
For more information, see the answers at How to deal with SettingWithCopyWarning in Pandas?
How to make a view: Pandas will implicitly make views wherever and whenever possible. The key to this is to use the df.loc[row_indexer,col_indexer] method. For example, to multiply the values of column y by 100 for only the rows where column x is not null, we would write:
mask = ~df['x'].isnull()
df.loc[mask, 'y'] *= 100
# original data has changed
print(df)
x y z
0 1.0 200 3
1 4.0 500 6
2 NaN 10 20
3 7.0 800 9