How would I be able to write a rolling condition apply to a column in pandas?
import pandas as pd
import numpy as np
lst = np.random.random_integers(low = -10, high = 10, size = 10)
lst2 = np.random.random_integers(low = -10, high = 10, size = 10)
#lst = [ -2 10 -10 -6 4 2 -5 4 9 3]
#lst2 = [-7 5 6 -4 7 1 -4 -6 -1 -4]
df = pandas.DataFrame({'a' : lst, 'b' : lst2})
Given a dataframe, namely 'df', I want to create a column 'C' such that it will display True if the element in a > 0 and b > 0 or False if a < 0 and b < 0.
For the rows that do not meet this condition, I want to roll the entry in the previous row to the current one (i.e. if the previous row has value 'True' but does not meet the specified conditions, it should have value 'True'.)
How can I do this?
Follow-up Question : How would I do this for conditions a > 1 and b > 1 returns True or a < -1 and b < -1 returns False?
I prefer to do this with a little mathemagic on the signs.
i = np.sign(df.a)
j = np.sign(df.b)
i = i.mask(i != j).ffill()
i >= 0
# for your `lst` and `lst2` input
0 False
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 False
9 False
Name: a, dtype: bool
As long as you don't have to worry about integer overflows, this works just fine.
i = np.sign(df.a)
j = np.sign(df.b)
i.mask(i != j).ffill().ge(0)
Related
I am trying to figure out how I can mark the rows where the price are part of 4 increase prices .
the "is_consecutive" is actually the mark .
I managed to do the diff between the rows :
df['diff1'] = df['Close'].diff()
But I didn't managed to find out which row is a part of 4 increase prices .
I had a thought to use df.rolling() .
The exmple df,
On rows 0-3 , we need to get an output of 'True' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is increase for 4 rows .
On rows 8-11 , we need to get an output of 'False' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is zero .
Date Price diff1 is_consecutive
0 1/22/20 0 0 True
1 1/23/20 130 130 True
2 1/24/20 144 14 True
3 1/25/20 150 6 True
4 1/27/20 60 -90 False
5 1/28/20 95 35 False
6 1/29/20 100 5 False
7 1/30/20 50 -50 False
8 2/01/20 100 0 False
9 1/02/20 100 0 False
10 1/03/20 100 0 False
11 1/04/20 100 0 False
12 1/05/20 50 -50 False
general example :
if
price = [30,55,60,65,25]
the different form the consecutive number on the list will be :
diff1 = [0,25,5,5,-40]
So when the diff1 is plus its actually means the consecutive prices are increase .
I need to mark(in the df) the rows that have 4 consecutive that go up.
Thank You for help (-:
Try: .rolling with window of size 4 and min periods 1:
df["is_consecutive"] = (
df["Price"]
.rolling(4, min_periods=1)
.apply(lambda x: (x.diff().fillna(0) >= 0).all())
.astype(bool)
)
print(df)
Prints:
Date Price is_consecutive
0 1/22/20 0 True
1 1/23/20 130 True
2 1/24/20 144 True
3 1/25/20 150 True
4 1/26/20 60 False
5 1/26/20 95 False
6 1/26/20 100 False
7 1/26/20 50 False
Assuming the dataframe is sorted. One way is based on the cumsum of the differences to identify the first time an upward Price move succeeding a 3 days upwards trend (i.e. 4 days of upward trend).
quant1 = (df['Price'].diff().apply(np.sign) == 1).cumsum()
quant2 = (df['Price'].diff().apply(np.sign) == 1).cumsum().where(~(df['Price'].diff().apply(np.sign) == 1)).ffill().fillna(0).astype(int)
df['is_consecutive'] = (quant1-quant2) >= 3
note that the above takes into account only strictly increasing Prices (not equal).
Then we override also the is_consecutive tag for the previous 3 Prices to be also TRUE using the win_view self defined function:
def win_view(x, size):
if isinstance(x, list):
x = np.array(x)
if isinstance(x, pd.core.series.Series):
x = x.values
if isinstance(x, np.ndarray):
pass
else:
raise Exception('wrong type')
return np.lib.stride_tricks.as_strided(
x,
shape=(x.size - size + 1, size),
strides=(x.strides[0], x.strides[0])
)
arr = win_view(df['is_consecutive'], 4)
arr[arr[:,3]] = True
Note that we inplace replace the values to be True.
EDIT 1
Inspired by the self defined win_view function, I realized that the solution it can be obtained simply by win_view (without the need of using cumsums) as below:
df['is_consecutive'] = False
arr = win_view(df['Price'].diff(), 4)
arr_ind = win_view(list(df['Price'].index), 4)
mask = arr_ind[np.all(arr[:, 1:] > 0, axis=1)].flatten()
df.loc[mask, 'is_consecutive'] = True
We maintain 2 arrays, 1 for the returns and 1 for the indices. We collect the indices where we have 3 consecutive positive return np.all(arr[:, 1:] > 0, axis=1 (i.e. 4 upmoving prices) and we replace those in our original df.
The function will return columns named "consecutive_up" which represents all rows that are part of the 5 increase series and "consecutive_down" which represents all rows that are part of the 4 decrees series.
def c_func(temp_df):
temp_df['increase'] = temp_df['Price'] > temp_df['Price'].shift()
temp_df['decrease'] = temp_df['Price'] < temp_df['Price'].shift()
temp_df['consecutive_up'] = False
temp_df['consecutive_down'] = False
for ind, row in temp_df.iterrows():
if row['increase'] == True:
count += 1
else:
count = 0
if count == 5:
temp_df.iloc[ind - 5:ind + 1, 4] = True
elif count > 5:
temp_df.iloc[ind, 4] = True
for ind, row in temp_df.iterrows():
if row['decrease'] == True:
count += 1
else:
count = 0
if count == 4:
temp_df.iloc[ind - 4:ind + 1, 5] = True
elif count > 4:
temp_df.iloc[ind, 5] = True
return temp_df
I have a dataframe that looks like this:
1 2 3 4 5 6 7 8 9
0 2 1 -1 -2 -3 -2 -1 0 1
I need to return the column where the value becomes negative and where it becomes positive again.
It is similar to this question but inverted columns and rows:
Pandas: select the first value which is not negative anymore, return the row
Any ideas on simple way to do this?
Expected output would return the column numbers where it changed like so:
neg = 3
pos = 8
pandas + numpy solution.
Find the sign changes np.sign(df1).diff().ne(0)
df = df.replace(0,np.inf)
df1 = df.T
t = df1[np.sign(df1).diff().ne(0)[1:]]
s = (t>0)
pos = [*filter(s[0].get, s.index)]
s = (t<0)
neg = [*filter(s[0].get, s.index)]
pos:
['8']
neg:
['3']
For each row:
def get_pos_neg(row):
t = row[np.sign(row).diff().ne(0)]#[1:]
print(t)
s = (t>0)
pos = [*filter(s.get, s.index)]
s = (t<0)
neg = [*filter(s.get, s.index)]
return pos,neg
df = df.replace(0,np.inf)
df1 = df.T
df1.apply(get_pos_neg,0)
I'm trying to ascertain how I can create a column that indicates in advance (X rows) when the next occurrence of a value in another column will occur with pandas that in essence performs the following functionality (In this instance X = 3):
df
rowid event indicator
1 True 1 # Event occurs
2 False 0
3 False 0
4 False 1 # Starts indicator
5 False 1
6 True 1 # Event occurs
7 False 0
Apart from doing a iterative/recursive loop through every row:
i = df.index[df['event']==True]
dfx = [df.index[z-X:z] for z in i]
df['indicator'][dfx]=1
df['indicator'].fillna(0)
However this seems inefficient, is there a more succinct method of achieving the aforementioned example? Thanks
Here's a NumPy based approach using flatnonzero:
X = 3
# ndarray of indices where indicator should be set to one
nd_ixs = np.flatnonzero(df.event)[:,None] - np.arange(X-1, -1, -1)
# flatten the indices
ixs = nd_ixs.ravel()
# filter out negative indices an set to 1
df['indicator'] = 0
df.loc[ixs[ixs>=0], 'indicator'] = 1
print(df)
rowid event indicator
0 1 True 1
1 2 False 0
2 3 False 0
3 4 False 1
4 5 False 1
5 6 True 1
6 7 False 0
Where nd_ixs is obtained through the broadcasted subtraction of the indices where event is True and an arange up to X:
print(nd_ixs)
array([[-2, -1, 0],
[ 3, 4, 5]], dtype=int64)
A pandas and numpy solution:
# Make a variable shift:
def var_shift(series, X):
return [series] + [series.shift(i) for i in range(-X + 1, 0, 1)]
X = 3
# Set indicator to default to 1
df["indicator"] = 1
# Use pd.Series.where and np.logical_or with the
# var_shift function to get a bool array, setting
# 0 when False
df["indicator"] = df["indicator"].where(
np.logical_or.reduce(var_shift(df["event"], X)),
0,
)
# rowid event indicator
# 0 1 True 1
# 1 2 False 0
# 2 3 False 0
# 3 4 False 1
# 4 5 False 1
# 5 6 True 1
# 6 7 False 0
In [77]: np.logical_or.reduce(var_shift(df["event"], 3))
Out[77]: array([True, False, False, True, True, True, nan], dtype=object)
I have created a new column by comparing two boolean columns. If both are positive, I assign a 1, otherwise a 0. This is my code below, but is there a way to be more pythonic? I tried list comprehension but failed.
lst = []
for i,k in zip(df['new_customer'],df['y']):
if i == 1 & k == 1:
lst.append(1)
else:
lst.append(0)
df['new_customer_subscription'] = lst
Use np.sign:
m = np.sign(df[['new_customer', 'y']]) >= 0
df['new_customer_subscription'] = m.all(axis=1).astype(int)
If you want to consider only positive non-zero values, change >= 0 to > 0 (since np.sign(0) is 0).
# Sample DataFrame.
df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
df
A B
0 0.511684 -0.512633
1 -1.254813 -1.721734
2 0.751830 0.285449
3 -0.934877 1.407998
4 -1.686066 -0.947015
# Get the sign of the numbers.
m = np.sign(df[['A', 'B']]) >= 0
m
A B
0 True False
1 False False
2 True True
3 False True
4 False False
# Find all rows where both columns are `True`.
m.all(axis=1).astype(int)
0 0
1 0
2 1
3 0
4 0
dtype: int64
Another solution if you have to deal with only two columns would be:
df['new_customer_subscription'] = (
df['new_customer'].gt(0) & df['y'].gt(0)).astype(int)
To generalise to multiple columns, use logical_and.reduce:
df['new_customer_subscription'] = np.logical_and.reduce(
df[['new_customer', 'y']] > 0, axis=1).astype(int)
Or,
df['new_customer_subscription'] = (df[['new_customer', 'y']] > 0).all(1).astype(int)
Another way to do this is using the np.where from the numpys module:
df['Indicator'] = np.where((df.A > 0) & (df.B > 0), 1, 0)
Output
A B Indicator
0 -0.464992 0.418243 0
1 -0.902320 0.496530 0
2 0.219111 1.052536 1
3 -1.377076 0.207964 0
4 1.051078 2.041550 1
The np.where method works like this:
np.where(condition, true value, false value)
Given this DataFrame:
df = pandas.DataFrame({"a": [1,10,20,3,10], "b": [50,60,55,0,0], "c": [1,30,1,0,0]})
What is the best way to make a new column, "filter" that has value "pass" if the values at columns a and b are both greater than x and value "fail" otherwise?
It can be done by iterating through rows but it's inefficient and inelegant:
c = []
for x, v in df.iterrows():
if v["a"] >= 20 and v["b"] >= 20:
c.append("pass")
else:
c.append("fail")
df["filter"] = c
One way would be to create a column of boolean values like this:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
You can then change the boolean values to 'pass' or 'fail' using replace:
>>> df['filter'].astype(object).replace({False: 'fail', True: 'pass'})
0 fail
1 fail
2 pass
3 fail
4 fail
You can extend this to more columns using all. For example, to find rows across the columns with entries greater than 0:
>>> cols = ['a', 'b', 'c'] # a list of columns to test
>>> df[cols] > 0
a b c
0 True True True
1 True True True
2 True True True
3 True False False
4 True False False
Using all across axis 1 of this DataFrame creates the new column:
>>> (df[cols] > 0).all(axis=1)
0 True
1 True
2 True
3 False
4 False
dtype: bool