Pandas mask with composite expression behaviour

Pandas mask with composite expression behaviour - python

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can't seem to make sense of pandas' behaviour so I would appreciate some clarity, the original question stated something along the lines of:
How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?
my setup to reproduce the scenario is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : [x for x in range(4)],
'B' : [x for x in range(-2, 2)]
})
this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:
df[df >= 0 | df.isin([-2])]
which produces:
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
which also cancels the number in the list!
moreover if I mask the dataframe with each of the two conditions I get the correct behavior:
with df[df >= 0] (identical to the compound result)
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
with df[df.isin([-2])] (identical to the compound result)
index
A
B
0
NaN
-2.0
1
NaN
NaN
2
NaN
NaN
3
NaN
NaN
So it seems like I am
Running into some undefined behaviour as a result of performing logic on NaN values
I have got something wrong
Anyone can clarify this situation to me?

Solution
df[(df >= 0) | (df.isin([-2]))]
Explanation
In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence
When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:
Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

Related

How to evaluate conditions after each other in Pandas .loc?

I have a Pandas DataFrame where column B contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str values inside column B:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?

A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False

One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.

There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.

This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.

you can use apply(type) as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b

Pandas select data between both index and value range

I wish to set the values of a dataframe that lie between an index range and a value range to be NaN values. For example, say I have n columns, I want for every numeric data point in these columns to be set to NaN if they meet the following conditions:
The value is between -1 and 1
The index of this value is between 1 and 3
Below I have some code that is trying to do what I'm describing above, and it almost does it, it's just that it is setting these values on a copy of the original dataframe, and trying to use .loc throws the following error:
KeyError: "None of [Index([('a',), ('b',), ('c',)], dtype='object')]
are in the [columns]"
import numpy as np
import pandas as pd
np.random.seed(398)
df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
row_indexer = (df.index > 0) & (df.index < 4)
col_indexer = (df > -1) & (df < 1)
df[row_indexer][col_indexer] = np.nan
I'm sure there's a really simple solution, I just can't figure out the correct syntax.
(Additionally, I want to "extract" these filtered values (the ones I'm setting to NaN) into a second dataframe, but I'm fairly sure any solution that solves the primary question will solve this additional issue)
Any help would be appreciated

Try broadcasting with numpy:
df[row_indexer[:,None] & col_indexer] = np.nan
Output:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914

I will do mul since True * True = True
out = df.mask(col_indexer.mul(row_indexer ,axis=0))
Out[81]:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914

Python filling string column "forward" and groupby attaching groupby result to dataframe

I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.

IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.

pandas table subsets giving invalid type comparison error

I am using pandas and want to select subsets of data and apply it to other columns.
e.g.
if there is data in column A; &
if there is NO data in column B;
then, apply the data in column A to column D
I have this working fine for now using .isnull() and .notnull().
e.g.
df = pd.DataFrame({'A' : pd.Series(np.random.randn(4)),
'B' : pd.Series(np.nan),
'C' : pd.Series(['yes','yes','no','maybe'])})
df['D']=''
df
Out[44]:
A B C D
0 0.516752 NaN yes
1 -0.513194 NaN yes
2 0.861617 NaN no
3 -0.026287 NaN maybe
# Now try the first conditional expression
df['D'][df['A'].notnull() & df['B'].isnull()] \
= df['A'][df['A'].notnull() & df['B'].isnull()]
df
Out[46]:
A B C D
0 0.516752 NaN yes 0.516752
1 -0.513194 NaN yes -0.513194
2 0.861617 NaN no 0.861617
3 -0.026287 NaN maybe -0.0262874
When one adds a third condition, to also check whether data in column C matches a particular string, we get the error:
df['D'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes'] \
= df['A'][df['A'].notnull() & df['B'].isnull() & df['C']=='yes']
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 763, in wrapper
res = na_op(values, other)
File "C:\Anaconda2\Lib\site-packages\pandas\core\ops.py", line 718, in na_op
raise TypeError("invalid type comparison")
TypeError: invalid type comparison
I have read that this occurs due to the different datatypes. And I can get it working if I change all the strings in column C for integers or booleans. We also know that string on its own would work, e.g. df['A'][df['B']=='yes'] gives a boolean list.
So any ideas how/why this is not working when combining these datatypes in this conditional expression? What are the more pythonic ways to do what appears to be quite long-winded?
Thanks

In case this solution doesn't work for anyone, another situation that happened to me was that even though I was reading all data in as dtype=str (and therefore doing any string comparison should be OK [ie df[col] == "some string"]), I had a column of all nulls, which becomes type float, which will give an error when comparing to a string.
To get around that, you can use .astype(str) to ensure a string to string comparison will be performed.

I think you need add parentheses () to conditions, also better is use ix for selecting column with boolean mask which can be assigned to variable mask:
mask = (df['A'].notnull()) & (df['B'].isnull()) & (df['C']=='yes')
print (mask)
0 True
1 True
2 False
3 False
dtype: bool
df.ix[mask, 'D'] = df.ix[mask, 'A']
print (df)
A B C D
0 -0.681771 NaN yes -0.681771
1 -0.871787 NaN yes -0.871787
2 -0.805301 NaN no
3 1.264103 NaN maybe

Subtract two DataFrames with non overlapping indexes

I'm trying to subtract two DataFrames together. I would like to treat missing values as 0. fillna() won't work here because I don't know the common indexes before doing the subtraction:
import pandas as pd
A = pd.DataFrame([1,2], index=['a','b'])
B = pd.DataFrame([3,4], index=['a','c'])
A - B
0
a -2
b NaN
c NaN
Ideally, I would like to have:
A - B
0
a -2
b 2
c -4
Is it possible to get that while keeping the code simple?

You can use the subtract method and specify a fill_value of zero:
A.subtract(B, fill_value=0)
Note: the method below, combineAdd, is deprecated from 0.17.0 onwards.
One way is to use the combineAdd method to add -B to A:
>>> A.combineAdd(-B)
0
a -2
b 2
c -4
With this method, the two DataFrames are added and the values at non-matching indices default to the value in either A or B.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas mask with composite expression behaviour - python

Related

How to evaluate conditions after each other in Pandas .loc?

Pandas select data between both index and value range

Python filling string column "forward" and groupby attaching groupby result to dataframe

pandas table subsets giving invalid type comparison error

Subtract two DataFrames with non overlapping indexes

Categories

Resources