Boolean Series key will be reindexed to match DataFrame index - python

Here is how I encountered the warning:
df.loc[a_list][df.a_col.isnull()]
The type of a_list is Int64Index, it contains a list of row indexes. All of these row indexes belong to df.
The df.a_col.isnull() part is a condition I need for filtering.
If I execute the following commands individually, I do not get any warnings:
df.loc[a_list]
df[df.a_col.isnull()]
But if I put them together df.loc[a_list][df.a_col.isnull()], I get the warning message (but I can see the result):
Boolean Series key will be reindexed to match DataFrame index
What is the meaning of this warning message? Does it affect the result that it returned?

Your approach will work despite the warning, but it's best not to rely on implicit, unclear behavior.
Solution 1, make the selection of indices in a_list a boolean mask:
df[df.index.isin(a_list) & df.a_col.isnull()]
Solution 2, do it in two steps:
df2 = df.loc[a_list]
df2[df2.a_col.isnull()]
Solution 3, if you want a one-liner, use a trick found here:
df.loc[a_list].query('a_col != a_col')
The warning comes from the fact that the boolean vector df.a_col.isnull() is the length of df, while df.loc[a_list] is of the length of a_list, i.e. shorter. Therefore, some indices in df.a_col.isnull() are not in df.loc[a_list].
What pandas does is reindex the boolean series on the index of the calling dataframe. In effect, it gets from df.a_col.isnull() the values corresponding to the indices in a_list. This works, but the behavior is implicit, and could easily change in the future, so that's what the warning is about.

If you got this warning, using .loc[] instead of [] suppresses this warning.1
df.loc[boolean_mask] # <--------- OK
df[boolean_mask] # <--------- warning
For the particular case in the OP, you can chain .loc[] indexers:
df.loc[a_list].loc[df['a_col'].isna()]
or chain all conditions using and inside query():
# if a_list is a list of indices of df
df.query("index in #a_list and a_col != a_col")
# if a_list is a list of values in some other column such as b_col
df.query("b_col in #a_list and a_col != a_col")
or chain all conditions using & inside [] (as in #IanS's post).
This warning occurs if
the index of the boolean mask is not in the same order as the index of the dataframe it is filtering.
df = pd.DataFrame({'a_col':[1, 2, np.nan]}, index=[0, 1, 2])
m1 = pd.Series([True, False, True], index=[2, 1, 0])
df.loc[m1] # <--------- OK
df[m1] # <--------- warning
the index of a boolean mask is a super set of the index of the dataframe it is filtering. For example:
m2 = pd.Series([True, False, True, True], np.r_[df.index, 10])
df.loc[m2] # <--------- OK
df[m2] # <--------- warning
1: If we look at the source codes of [] and loc[], literally the only difference when the index of the boolean mask is a (weak) super set of the index of the dataframe is that [] shows this warning (via _getitem_bool_array method) and loc[] does not.

Related

What advantages does the iloc function have in pandas and Python

I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]

What is wrong with this Numpy/Pandas code to construct new boolean column based on the values in two other boolean columns?

I have the following data set:
Beginning Data Set:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I first create two new columns of boolean values called VolPrice and Check based on the values in my starting data set. I think want to create a third additional column called DoubleCheck where the value of this column should be True if either VolPrice OR Check are equal to True, otherwise the value of DoubleCheck should be false. Initially I got the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
but then I added .any() after each column within my statement to construct the DoubleCheck column. However this isn't working either because it is providing 'True' values throughout the DoubleCheck column even when there should be false values as shown below.
Code:
import pandas as pd
import numpy as np
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['VolPrice'] = np.where((Observations['Price']<Observations['Vol']) & (Observations['Vol']<Observations['Mx']), True, False)
Observations['Check'] = np.where(Observations['Vol']<Observations['Price'], True, False)
Observations['DoubleCheck'] = np.where((Observations['Check'].any()==True) or (Observations['VolPrice'].any()==True), True, False)
print(Observations)
Current Result:
ObjectID,Date,Price,Vol,Mx,VolPrice,Check,DoubleCheck
101,2017-01-01,,145,203,False,False,True
101,2017-01-02,,155,163,False,False,True
101,2017-01-03,67.0,140,234,True,False,True
101,2017-01-04,78.0,130,182,True,False,True
101,2017-01-05,58.0,178,202,True,False,True
101,2017-01-06,53.0,134,204,True,False,True
101,2017-01-07,52.0,134,183,True,False,True
101,2017-01-08,62.0,148,176,True,False,True
101,2017-01-09,42.0,152,193,True,False,True
101,2017-01-10,80.0,137,150,True,False,True
Desired Result:
ObjectID,Date,Price,Vol,Mx,VolPrice,Check,DoubleCheck
101,2017-01-01,,145,203,False,False,False
101,2017-01-02,,155,163,False,False,False
101,2017-01-03,67.0,140,234,True,False,True
101,2017-01-04,78.0,130,182,True,False,True
101,2017-01-05,58.0,178,202,True,False,True
101,2017-01-06,53.0,134,204,True,False,True
101,2017-01-07,52.0,134,183,True,False,True
101,2017-01-08,62.0,148,176,True,False,True
101,2017-01-09,42.0,152,193,True,False,True
101,2017-01-10,80.0,137,150,True,False,True
Use | for bitwise OR, working same like & for bitwise AND:
Observations['DoubleCheck'] = Observations['Check'] | Observations['VolPrice']
Or DataFrame.any with both columns:
Observations['DoubleCheck'] = Observations[['Check','VolPrice']].any(axis=1)
All together is possible without np.where:
Observations['VolPrice'] = (Observations['Price']<Observations['Vol']) & (Observations['Vol']<Observations['Mx'])
Observations['Check'] = Observations['Vol']<Observations['Price']
Observations['DoubleCheck'] = Observations['Check'] | Observations['VolPrice']

Select slice of dataframe according to value of multiIndex

I have a multiIndex dataframe.
I am able to create a logical mask using the following:
df.index.get_level_values(0).to_series().str.find('1000')!=-1
This returns a boolean True for all the rows where the first index level contains the characters '1000'and False otherwise.
But I am not able to slice the dataframe using that mask.
I tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1]
and it returned the following error:
ValueError: cannot reindex from a duplicate axis
I also tried with:
df[df.index.get_level_values(0).to_series().str.find('1000')!=-1,:]
which only returns the logic mask as output and the following error:
Length: 1755, dtype: bool, slice(None, None, None))' is an invalid key
Can someone point me to the right solution and to a good reference on how to slice properly a multiIndex dataframe?
One idea is remove to_series() and use Series.str.contains for test substring:
df[df.index.get_level_values(0).str.contains('1000')]
Another is convert mask to numpy array:
df[df.index.get_level_values(0).str.contains('1000').values]
Your solution with converting values of mask to array:
df[(df.index.get_level_values(0).to_series().str.find('1000')!=-1).values]

Pandas selecting with unaligned indexes

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?
Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

Set first and last row of a column in a dataframe

I've been reading over this and still find the subject a little confusing :
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Say I have a Pandas DataFrame and I wish to simultaneously set the first and last row elements of a single column to whatever value. I can do this :
df.iloc[[0, -1]].mycol = [1, 2]
which tells me A value is trying to be set on a copy of a slice from a DataFrame. and that this is potentially dangerous.
I could use .loc instead, but then I need to know the index of the first and last rows ( in constrast, .iloc allows me to access by location ).
What's the safest Pandasy way to do this ?
To get to this point :
# Django queryset
query = market.stats_set.annotate(distance=F("end_date") - query_date)
# Generate a dataframe from this queryset, and order by distance
df = pd.DataFrame.from_records(query.values("distance", *fields), coerce_float=True)
df = df.sort_values("distance").reset_index(drop=True)
Then, I try calling df.distance.iloc[[0, -1]] = [1, 2]. This raises the warning.
The issue isn't with iloc, it's when you access .mycol that a copy is created. You can do this all within iloc:
df.iloc[[0, -1], df.columns.get_loc('mycol')] = [1, 2]
Usually ix is used if you want mixed integer and label based access, but doesn't work in this case since -1 isn't actually in the index, and apparently ix isn't smart enough to know it should be the last index.
What you're doing is called chained indexing, you can use iloc just on that column to avoid the warning:
In [24]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
Out[24]:
a b c
0 1.589940 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.123221 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 0.344794 1.095861 -1.300339
In [25]:
df['a'].iloc[[0,-1]] ='foo'
df
Out[25]:
a b c
0 foo 0.735713 -1.158907
1 0.485653 0.044611 0.070907
2 1.12322 -0.862393 -0.807051
3 0.338653 -0.734169 -0.070471
4 foo 1.095861 -1.300339
If you do it the other way then it raises the warning:
In [27]:
df.iloc[[0,-1]]['a'] ='foo'
C:\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\IPython\kernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':

Categories

Resources