pandas groupby column and check if group meets multiple conditions

pandas groupby column and check if group meets multiple conditions - python

I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False

Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool

Related

Python - Pandas compare columns with NaN returns False

I have 2 DataFrames, one is called old and another is called new.
The 2 DataFrames have multiple columns, but I am interested in column called ADDTEXT. When you open the 2 files in Excel and compare the ADDTEXT columns, they are completely identical.
When I do old == new in Python, it returns False. When I do new['ADDTEXT'].equals(old['ADDTEXT']) it returns True.
Why don't they both return True since both columns contain only the NaN values in them?
Example output:
>>> new = pd.read_excel('3.8_self_input_data.xlsx')
>>>
>>>
>>> old = pd.read_excel('3.7_self_input_data.xlsx')
>>>
>>> old['ADDTEXT']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
13630 NaN
13631 NaN
13632 NaN
13633 NaN
13634 NaN
Name: ADDTEXT, Length: 13635, dtype: object
>>>
>>> new['ADDTEXT']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
13630 NaN
13631 NaN
13632 NaN
13633 NaN
13634 NaN
Name: ADDTEXT, Length: 13635, dtype: object
>>>
>>> new['ADDTEXT'] == old['ADDTEXT']
0 False
1 False
2 False
3 False
4 False
...
13630 False
13631 False
13632 False
13633 False
13634 False
Name: ADDTEXT, Length: 13635, dtype: bool
>>>
>>> new['ADDTEXT'].equals(old['ADDTEXT'])
True

NaN != NaN
Instead of just using .equals(), you can use isna() on the two columns:
(new['ADDTEXT'].eq(old['ADDTEXT']) | (new['ADDTEXT'].isna() & old['ADDTEXT'].isna()))
Basically that reads: return True for each item if both items are equal or both are NaN.

Pandas and Numpy consecutive non Nan values

I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True

The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True

Sum within column based on values from another column

I have a pandas dataframe like below for the columns value_to_sum and indicator. I'd like to sum all values within value_to_sum up to and including the most recent value within that column where indicator == True. If indicator == False, I do not want to sum.
row
value_to_sum
indicator
desired_outcome
1
1
True
NaN
2
3
True
1
3
1
False
NaN
4
2
False
NaN
5
4
False
NaN
6
6
True
10
7
2
True
6
8
3
False
NaN
How can I achieve the values under desired_outcome?

You can set a group based on the .cumsum() of True values of column indicator and then use .groupby() together with .transform() to get the sum of value_to_sum of each group.
Then, for indicator == True, since the desired outcome is up to the previous row, we get the value of desired_outcome from last row by using .shift(). At the same time, for indicator == False, we set the value of desired_outcome to NaN. These last 2 steps are done altogether by a call to np.where().
df['desired_outcome'] = df.assign(group=df['indicator'].cumsum()).groupby('group')['value_to_sum'].transform('sum')
df['desired_outcome'] = np.where(df['indicator'], df['desired_outcome'].shift(), np.nan)
Result:
print(df)
row value_to_sum indicator desired_outcome
0 1 1 True NaN
1 2 3 True 1.0
2 3 1 False NaN
3 4 2 False NaN
4 5 4 False NaN
5 6 6 True 10.0
6 7 2 True 6.0
7 8 3 False NaN

python split dataframe subset with from to condition

I have dataframe and I have to split in subset with this conditions:
start split: c = 1
end split: c = -1
example:
a b c
False False
False False -1
False False
True False 1 start first subset
False False
False False 1
False False 1
False False
False False 1
False False
False False 1
False False
False True -1 end of first subset
False False
False False
False True -1
False False
False False
True False 1 start second subset
False False -1 end of second subset

This could be a solution, although I'm not sure if it's the most efficient approach. This basically uses cumsum and some and/or logic.
import pandas as pd
import numpy as np
df = pd.DataFrame({'c': [np.nan, 1, np.nan, 1, np.nan, np.nan,
-1, np.nan, np.nan, 1, np.nan, np.nan,
1, 1, np.nan, -1, -1, 1, -1]})
c
0 NaN
1 1.0
2 NaN
3 1.0
4 NaN
5 NaN
6 -1.0
7 NaN
8 NaN
9 1.0
10 NaN
11 NaN
12 1.0
13 1.0
14 NaN
15 -1.0
16 -1.0
17 1.0
18 -1.0
(
df
.assign(
start_end=lambda df: df.index.isin(
df
.loc[lambda df: df.c.isin([1,-1])]
.loc[lambda df: df.c.shift(1,fill_value=0)!=df.c]
.index),
start=lambda df: np.where(np.logical_and(df.start_end==True,df.c==1),1,0),
end=lambda df: np.where(np.logical_and(df.start_end==True,df.c==-1),1,0),
subset=lambda df: np.where(df.start.cumsum() != df.end.shift(1, fill_value=0).cumsum(),
df.start.cumsum(),
0)
)
.drop(columns=['start_end','start','end'])
)
c subset
0 NaN 0
1 1.0 1
2 NaN 1
3 1.0 1
4 NaN 1
5 NaN 1
6 -1.0 1
7 NaN 0
8 NaN 0
9 1.0 2
10 NaN 2
11 NaN 2
12 1.0 2
13 1.0 2
14 NaN 2
15 -1.0 2
16 -1.0 0
17 1.0 3
18 -1.0 3
```

Replacing values in a 2nd level column on MultiIndex df in Pandas

I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.

You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby column and check if group meets multiple conditions - python

Related

Python - Pandas compare columns with NaN returns False

Pandas and Numpy consecutive non Nan values

Sum within column based on values from another column

python split dataframe subset with from to condition

Replacing values in a 2nd level column on MultiIndex df in Pandas

Categories

Resources