Pandas selecting with unaligned indexes - python

I have 2 series.
The first one contains a list of numbers with an index counting 0..8.
A = pd.Series([2,3,4,6,5,4,7,6,5], name=['A'], index=[0,1,2,3,4,5,6,7,8])
The second one only contains True values, but the index of the series is a subset of the first one.
B = pd.Series([1, 1, 1, 1, 1], name=['B'], index=[0,2,4,7,8], dtype=bool)
I'd like to use B as boolean vector to get the A-values for the corresponding indexes, like:
A[B]
[...]
IndexingError: Unalignable boolean Series key provided
Unfortunately this raises an error.
Do I need to align them first?

Does
A[B.index.values]
work for your version of pandas? (I see we have different versions because now the Series name has to be hashable, so your code gave me an error)

Related

Remove row based on sum of numpy array within each entry in df column

I feel I'm making this harder than it should be: what I have is a dataframe with some columns whose entries each contain numpy arrays (the names of the columns containing these arrays is in an array called names_of_cols_that_contain_arrays). What I want to do is filter out rows for which these numpy arrays have a sum value of zero. This is a similar question on which my code is based, but it doesn't seem to work with the iterator over rows in each column.
What I have currently in my code is
for col_name in names_of_cols_that_contain_arrays:
for i in range(len(df[col_name])):
df = df[df[col_name][i].sum() > 0.0]
which doesn't seem that efficient but is a first attempt that explictly goes through what I thought would be the correct method. But this appears to return a boolean, i.e.
Traceback
...
KeyError: True
In fact in most cases to the code above I get some error associated with a boolean being returned. Any pointers would be appreciated, thanks in advance!
IIUC:
You can try:
df=df.loc[df['names_of_cols_that_contain_arrays'].map(sum)>0]
#OR
df=df.loc[df['names_of_cols_that_contain_arrays'].map(np.sum).gt(0)]
Sample dataframe used:
from numpy import array
d={'names_of_cols_that_contain_arrays': {0: array([-1, 0, -8]),
1: array([-1, -2, 5])}}
df=pd.DataFrame(d)

How to pass a series to call a user defined function?

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

Can't use loc with DatetimeIndex

I can't select using loc when there is DatetimeIndex.
test = pd.DataFrame(data=np.array([[0, 0], [0, 2], [1, 3]), columns=pd.date_range(start='2019-01-01', end='2019-01-02', freq='D'))
test.loc[test>1, '2019-01-02']
I expect it to return pandas.Series([2, 3]), but it returns the error "Cannot index with multidimensional key"
In this case, your index is not a DatetimeIndex, only your columns are. The issue is that when you use test>1 as a comparison, it will return a DataFrame with the same size as test with Booleans for each cell showing whether the value is > 1. When you pass an array of booleans, it expects it to be a 1 dimensional array, but since you're passing it a DataFrame (2 dimensional), you get the "multidemensional key" error. I believe what you want here is:
test.loc[test['2019-01-02']>1, '2019-01-02']

pandas: slicing along first level of multiindex

I've set up a DataFrame with two indices. But slicing doesn't behave as expected.
I realize that this is a very basic problem, so I searched for similar questions:
pandas: slice a MultiIndex by range of secondary index
Python Pandas slice multiindex by second level index (or any other level)
I also looked at the corresponding documentation
Strangely none of the proposed solutions work for me.
I've set up a simple example to showcase the problem:
# this is my DataFrame
frame = pd.DataFrame([
{"a":1, "b":1, "c":"11"},
{"a":1, "b":2, "c":"12"},
{"a":2, "b":1, "c":"21"},
{"a":2, "b":2, "c":"22"},
{"a":3, "b":1, "c":"31"},
{"a":3, "b":2, "c":"32"}])
# now set a and b as multiindex
frame = frame.set_index(["a","b"])
Now I'm trying different ways of slicing the frame.
The first two lines work, the third throws an exception:
# selecting a specific cell works
frame.loc[1,2]
# slicing along the second index works
frame.loc[1,:]
# slicing along the first doesn't work
frame.loc[:,1]
It's a TypeError:
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>
Solution 1: Using tuples of slices
This is proposed in this question: pandas: slice a MultiIndex by range of secondary index
Indeed, you can pass a slice for each level
But that doesn't work for me, the same type error as above is produced.
frame.loc[(slice(1,2), 1)]
Solution 2: Using IndexSlice
Python Pandas slice multiindex by second level index (or any other level)
Use an indexer to slice arbitrary values in arbitrary dimensions
Again, that doesn't work for me, it produces the same type error.
frame.loc[pd.IndexSlice[:,2]]
I don't understand how this typeerror can be produced. After all I can use integers to select specific cells, and ranges along the second dimension work fine.
Googling for my specific error message doesn't really help.
For example, here someone tries to use integers to slice along an index of type float: https://github.com/pandas-dev/pandas/issues/12333
I tried explicitly converting my indices to int, maybe the numpy backend stores everything as float by default ?
But that didn't change anything, afterwards the same errors as above appear:
frame["a"]=frame["a"].apply(lambda x : int(x))
frame["b"]=frame["b"].apply(lambda x : int(x))
type(frame["b"][0]) # it's numpy.int64
IIUC you just have to specify : for columns when indexing a multi-index DF:
In [40]: frame.loc[pd.IndexSlice[:,2], :]
Out[40]:
c
a b
1 2 12
2 2 22
3 2 32

Boolean Series key will be reindexed to match DataFrame index

Here is how I encountered the warning:
df.loc[a_list][df.a_col.isnull()]
The type of a_list is Int64Index, it contains a list of row indexes. All of these row indexes belong to df.
The df.a_col.isnull() part is a condition I need for filtering.
If I execute the following commands individually, I do not get any warnings:
df.loc[a_list]
df[df.a_col.isnull()]
But if I put them together df.loc[a_list][df.a_col.isnull()], I get the warning message (but I can see the result):
Boolean Series key will be reindexed to match DataFrame index
What is the meaning of this warning message? Does it affect the result that it returned?
Your approach will work despite the warning, but it's best not to rely on implicit, unclear behavior.
Solution 1, make the selection of indices in a_list a boolean mask:
df[df.index.isin(a_list) & df.a_col.isnull()]
Solution 2, do it in two steps:
df2 = df.loc[a_list]
df2[df2.a_col.isnull()]
Solution 3, if you want a one-liner, use a trick found here:
df.loc[a_list].query('a_col != a_col')
The warning comes from the fact that the boolean vector df.a_col.isnull() is the length of df, while df.loc[a_list] is of the length of a_list, i.e. shorter. Therefore, some indices in df.a_col.isnull() are not in df.loc[a_list].
What pandas does is reindex the boolean series on the index of the calling dataframe. In effect, it gets from df.a_col.isnull() the values corresponding to the indices in a_list. This works, but the behavior is implicit, and could easily change in the future, so that's what the warning is about.
If you got this warning, using .loc[] instead of [] suppresses this warning.1
df.loc[boolean_mask] # <--------- OK
df[boolean_mask] # <--------- warning
For the particular case in the OP, you can chain .loc[] indexers:
df.loc[a_list].loc[df['a_col'].isna()]
or chain all conditions using and inside query():
# if a_list is a list of indices of df
df.query("index in #a_list and a_col != a_col")
# if a_list is a list of values in some other column such as b_col
df.query("b_col in #a_list and a_col != a_col")
or chain all conditions using & inside [] (as in #IanS's post).
This warning occurs if
the index of the boolean mask is not in the same order as the index of the dataframe it is filtering.
df = pd.DataFrame({'a_col':[1, 2, np.nan]}, index=[0, 1, 2])
m1 = pd.Series([True, False, True], index=[2, 1, 0])
df.loc[m1] # <--------- OK
df[m1] # <--------- warning
the index of a boolean mask is a super set of the index of the dataframe it is filtering. For example:
m2 = pd.Series([True, False, True, True], np.r_[df.index, 10])
df.loc[m2] # <--------- OK
df[m2] # <--------- warning
1: If we look at the source codes of [] and loc[], literally the only difference when the index of the boolean mask is a (weak) super set of the index of the dataframe is that [] shows this warning (via _getitem_bool_array method) and loc[] does not.

Categories

Resources