I have a DataFrame that looks like the following:
X Y Date are_equal
0 50.0 10.0 2018-08-19 False
1 NaN 10.0 2018-08-19 False
2 NaN 50.0 2018-08-19 True
3 10.0 NaN 2018-08-21 False
4 1.0 NaN 2018-08-19 False
5 NaN 10.0 2018-08-22 False
6 10.0 NaN 2018-08-21 False
The are_equal column indicates that a value in Y is in X for the same date (in this case 50.0).
I am trying to group by date and find whether X contains a specific value (say 1.0) for a date that contains are_equal True.
My approach was to use df.iterrows() and get the row at next index after meeting the condition df['are_equal'] == True. However, the rows aren't necessarily ordered.
How can I group by Date and check if a date contains True in are_equal and 1.0 in column X for that same date?
The output I'm trying to achieve is a new Boolean column that looks like this:
contains_specific_value
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Let us do apply, this can add more condition but slow. You can check the other solution from transform
df['New']=df.groupby('Date').apply(lambda x : (x['X']==1)&x['are_equal'].any()).reset_index(level=0,drop=True)
df
Out[101]:
X Y Date are_equal New
0 50.0 10.0 2018-08-19 False False
1 NaN 10.0 2018-08-19 False False
2 NaN 50.0 2018-08-19 True False
3 10.0 NaN 2018-08-21 False False
4 1.0 NaN 2018-08-19 False True
5 NaN 10.0 2018-08-22 False False
6 10.0 NaN 2018-08-21 False False
Or transform
df['X'].eq(1)&(df.groupby('Date').X.transform('any'))
Out[102]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
Name: X, dtype: bool
Related
I have 2 DataFrames, one is called old and another is called new.
The 2 DataFrames have multiple columns, but I am interested in column called ADDTEXT. When you open the 2 files in Excel and compare the ADDTEXT columns, they are completely identical.
When I do old == new in Python, it returns False. When I do new['ADDTEXT'].equals(old['ADDTEXT']) it returns True.
Why don't they both return True since both columns contain only the NaN values in them?
Example output:
>>> new = pd.read_excel('3.8_self_input_data.xlsx')
>>>
>>>
>>> old = pd.read_excel('3.7_self_input_data.xlsx')
>>>
>>> old['ADDTEXT']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
13630 NaN
13631 NaN
13632 NaN
13633 NaN
13634 NaN
Name: ADDTEXT, Length: 13635, dtype: object
>>>
>>> new['ADDTEXT']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
13630 NaN
13631 NaN
13632 NaN
13633 NaN
13634 NaN
Name: ADDTEXT, Length: 13635, dtype: object
>>>
>>> new['ADDTEXT'] == old['ADDTEXT']
0 False
1 False
2 False
3 False
4 False
...
13630 False
13631 False
13632 False
13633 False
13634 False
Name: ADDTEXT, Length: 13635, dtype: bool
>>>
>>> new['ADDTEXT'].equals(old['ADDTEXT'])
True
NaN != NaN
Instead of just using .equals(), you can use isna() on the two columns:
(new['ADDTEXT'].eq(old['ADDTEXT']) | (new['ADDTEXT'].isna() & old['ADDTEXT'].isna()))
Basically that reads: return True for each item if both items are equal or both are NaN.
I'm trying to use np.where to count consecutive non-NaN values longer than a certain length as shown below:
e.g. If there are more than 3 consecutive non-NaN values then return True.
Would appreciate any help!
value
consecutive
nan
False
nan
False
1
False
1
False
nan
False
4
True
2
True
3
True
nan
False
nan
False
1
True
3
True
3
True
5
True
The idea is to create groups by testing missing values and mapping using Series.map with Series.value_counts to have only rows with non NaNs filtered by inverted mask ~m:
#convert values to numeric
df['value'] = df['value'].astype(float)
m = df['value'].isna()
s = m.cumsum()
N = 3
df['new'] = s.map(s[~m].value_counts()).ge(N) & ~m
print (df)
value consecutive new
0 NaN False False
1 NaN False False
2 1.0 False False
3 1.0 False False
4 NaN False False
5 4.0 True True
6 2.0 True True
7 3.0 True True
8 NaN False False
9 NaN False False
10 1.0 True True
11 3.0 True True
12 3.0 True True
13 5.0 True True
I have a pandas dataframe like below for the columns value_to_sum and indicator. I'd like to sum all values within value_to_sum up to and including the most recent value within that column where indicator == True. If indicator == False, I do not want to sum.
row
value_to_sum
indicator
desired_outcome
1
1
True
NaN
2
3
True
1
3
1
False
NaN
4
2
False
NaN
5
4
False
NaN
6
6
True
10
7
2
True
6
8
3
False
NaN
How can I achieve the values under desired_outcome?
You can set a group based on the .cumsum() of True values of column indicator and then use .groupby() together with .transform() to get the sum of value_to_sum of each group.
Then, for indicator == True, since the desired outcome is up to the previous row, we get the value of desired_outcome from last row by using .shift(). At the same time, for indicator == False, we set the value of desired_outcome to NaN. These last 2 steps are done altogether by a call to np.where().
df['desired_outcome'] = df.assign(group=df['indicator'].cumsum()).groupby('group')['value_to_sum'].transform('sum')
df['desired_outcome'] = np.where(df['indicator'], df['desired_outcome'].shift(), np.nan)
Result:
print(df)
row value_to_sum indicator desired_outcome
0 1 1 True NaN
1 2 3 True 1.0
2 3 1 False NaN
3 4 2 False NaN
4 5 4 False NaN
5 6 6 True 10.0
6 7 2 True 6.0
7 8 3 False NaN
I have dataframe and I have to split in subset with this conditions:
start split: c = 1
end split: c = -1
example:
a b c
False False
False False -1
False False
True False 1 start first subset
False False
False False 1
False False 1
False False
False False 1
False False
False False 1
False False
False True -1 end of first subset
False False
False False
False True -1
False False
False False
True False 1 start second subset
False False -1 end of second subset
This could be a solution, although I'm not sure if it's the most efficient approach. This basically uses cumsum and some and/or logic.
import pandas as pd
import numpy as np
df = pd.DataFrame({'c': [np.nan, 1, np.nan, 1, np.nan, np.nan,
-1, np.nan, np.nan, 1, np.nan, np.nan,
1, 1, np.nan, -1, -1, 1, -1]})
c
0 NaN
1 1.0
2 NaN
3 1.0
4 NaN
5 NaN
6 -1.0
7 NaN
8 NaN
9 1.0
10 NaN
11 NaN
12 1.0
13 1.0
14 NaN
15 -1.0
16 -1.0
17 1.0
18 -1.0
(
df
.assign(
start_end=lambda df: df.index.isin(
df
.loc[lambda df: df.c.isin([1,-1])]
.loc[lambda df: df.c.shift(1,fill_value=0)!=df.c]
.index),
start=lambda df: np.where(np.logical_and(df.start_end==True,df.c==1),1,0),
end=lambda df: np.where(np.logical_and(df.start_end==True,df.c==-1),1,0),
subset=lambda df: np.where(df.start.cumsum() != df.end.shift(1, fill_value=0).cumsum(),
df.start.cumsum(),
0)
)
.drop(columns=['start_end','start','end'])
)
c subset
0 NaN 0
1 1.0 1
2 NaN 1
3 1.0 1
4 NaN 1
5 NaN 1
6 -1.0 1
7 NaN 0
8 NaN 0
9 1.0 2
10 NaN 2
11 NaN 2
12 1.0 2
13 1.0 2
14 NaN 2
15 -1.0 2
16 -1.0 0
17 1.0 3
18 -1.0 3
```
I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())