Calculating probability of consecutive events with python pandas - python

Given a dataframe, how do I calculate the probability of consecutive events using python pandas?
For example,
Time
A
B
C
1
1
1
1
2
-1
-1
-1
3
1
1
1
4
-1
-1
-1
5
1
1
1
6
-1
-1
-1
7
1
1
1
8
-1
1
1
9
1
-1
1
10
-1
1
-1
In this dataframe, B has two consecutive "1" in t=7 and t=8, and C has three consecutive "1" in t=7 to to=9.
Probability of event that two consecutive "1" appear is 3/27
Probability of event that three consecutive "1" appear is 1/24
How can I do this using python pandas?

Try this code(It can be used in other dataframes i.e. more columns, rows)
def consecutive(num):
'''
df = pd.DataFrame({
'Time' : [i for i in range(1, 11)],
'A' : [1, -1, 1, -1, 1, -1, 1, -1, 1, -1],
'B' : [1, -1, 1, -1, 1, -1, 1, 1, -1, 1],
'C' : [1, -1, 1, -1, 1, -1, 1, 1, 1, -1]
})
print(df)
'''
row_num = df.shape[0]
col_num = df.shape[1]
cnt = 0 # the number of consecutives
for col_index in range(1, col_num): # counting for each column
col_tmp = df.iloc[:, col_index]
consec = 0
for i in range(row_num):
if col_tmp[i] == 1:
consec += 1
# if -1 comes after 1, then consec = 0
else:
consec = 0
# to simply sum with the condition(consec == num), we minus 1 from consec
if consec == num:
cnt += 1
consec -= 1
all_cases = (row_num - num + 1) * (col_num - 1) # col_num - 1 because of 'Time' column
prob = cnt / all_cases
return prob
When you execute it with the given dataframe with this code
print(f'two consectuvie : {consecutive(2)}')
print(f'three consectuvie : {consecutive(3)}')
Output :
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
two consectuvie : 0.1111111111111111
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
three consectuvie : 0.041666666666666664

You can compare rows with previous rows using shift. So, to find out how often two consecutive values are equal, you can do
>>> (df.C == df.C.shift()).sum()
2
To find three consecutive equal values, you'd have to compare the column with itself shifted by 1 (the default) and additionally, shifted by 2.
>>> ((df.C == df.C.shift()) & (df.C == df.C.shift(2))).sum()
1
Another variation of this using the pd.Series.eq function instead of the == is:
>>> m = df.C.eq(df.C.shift(1)) & df.C.eq(df.C.shift(2))
>>> m.sum()
1
In this case, since the target value is 1 (and True == 1 is True; it won't work for other target values as is, see below), the pattern can be generalized with functools.reduce to:
from functools import reduce
def combos(column, n):
return reduce(pd.Series.eq, [column.shift(i) for i in range(n)])
You can apply this function to df like so, which will give you the numerator:
>>> df[['A', 'B', 'C']].apply(combos, n = 2).values.sum()
3
>>> df[['A', 'B', 'C']].apply(combos, n = 3).values.sum()
1
To get the denominator, you can do, e.g.,
n = 2
rows, cols = df[['A', 'B', 'C']].shape
denominator = (rows - n + 1) * cols
An idea for a generalized version of the combos function that should work with other target values is
from operator import and_ # equivalent of &
def combos_generalized(col, n):
return reduce(and_, [col == col.shift(i) for i in range(1, n)])

Related

Combining looping and conditional to make new columns on dataframe

I want to make a function with loop and conditional, that count only when Actual Result = 1.
So the numbers always increase by 1 if the Actual Result = 1.
This is my dataframe:
This is my code but it doesnt produce the result that i want :
def func_count(x):
for i in range(1,880):
if x['Actual Result']==1:
result = i
else:
result = '-'
return result
X_machine_learning['Count'] = X_machine_learning.apply(lambda x:func_count(x),axis=1)
When i check & filter with count != '-' The result will be like this :
The number always equal to 1 and not increase by 1 everytime the actual result = 1. Any solution?
Try something like this:
import pandas as pd
df = pd.DataFrame({
'age': [30,25,40,12,16,17,14,50,22,10],
'actual_result': [0,1,1,1,0,0,1,1,1,0]
})
count = 0
lst_count = []
for i in range(len(df)):
if df['actual_result'][i] == 1:
count+=1
lst_count.append(count)
else:
lst_count.append('-')
df['count'] = lst_count
print(df)
Result
age actual_result count
0 30 0 -
1 25 1 1
2 40 1 2
3 12 1 3
4 16 0 -
5 17 0 -
6 14 1 4
7 50 1 5
8 22 1 6
9 10 0 -
Actually, you don't need to loop over the dataframe, which is mostly a Pandas-antipattern that should be avoided. With df your dataframe you could try the following instead:
m = df["Actual Result"] == 1
df["Count"] = m.cumsum().where(m, "-")
Result for the following dataframe
df = pd.DataFrame({"Actual Result": [1, 1, 0, 1, 1, 1, 0, 0, 1, 0]})
is
Actual Result Count
0 1 1
1 1 2
2 0 -
3 1 3
4 1 4
5 1 5
6 0 -
7 0 -
8 1 6
9 0 -

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

How to create a new column through a specific condition?

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?
You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter
def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.
What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

convert values in a series to either one of two values

I have a series y which has values between -3 and 3.
I want to convert numbers that are above 0 to 1 and numbers that are less than or equal to zero to 0.
What is the best way to do this?
I wrote the code below. However it doesn't give me the expected output. The first line works. However after running the second line the values that were 1 change to something random, which I don't understand
import numpy as np
y_final = np.where(y > 0, 1, y).tolist()
y_final = np.where(y <= 0, 0, y).tolist()
I think you need Series.clip if values are integers:
y = pd.Series(range(-3, 4))
print (y)
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
print (y.clip(lower=0, upper=1))
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int64
In your solution is possible simplify it by set 1 and 0:
y_final = np.where(y > 0, 1, 0)
print (y_final)
[0 0 0 0 1 1 1]
Or convert mask greater like 0 to integers:
y_final = y.gt(0).astype(int)
#alternative
#y_final = (y > 0).astype(int)
print (y_final)
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int32
You can also use simple map:
numbers = range(-3,4)
print(list(map(lambda n: 1 if n > 0 else 0, numbers)))

populate row with opposite value of the xth previous row, if condition is true

Following is the Dataframe I am starting from:
import pandas as pd
import numpy as np
d= {'PX_LAST':[1,2,3,3,3,1,2,1,1,1,3,3],'ma':[2,2,2,2,2,2,2,2,2,2,2,2],'action':[0,0,1,0,0,-1,0,1,0,0,-1,0]}
df_zinc = pd.DataFrame(data=d)
df_zinc
Now, I need to add a column called 'buy_sell', which:
when 'action'==1, populates with 1 if 'PX_LAST' >'ma', and with -1 if 'PX_LAST'<'ma'
when 'action'==-1, populates with the opposite of the previous non-zero value that was populated
FYI: in my data, the row that needs to be filled with the opposite of the previous non-zero item is always at the same distance from the previous non-zero item (i.e., 2 in the current example). This should facilitate making the code.
the code that I made so far is the following. It seems right to me. Do you have any fixes to propose?
while index < df_zinc.shape[0]:
if df_zinc['action'][index] == 1:
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
else:
df_zinc.loc[index,'buy_sell'] = 1
elif df_zinc['action'][index] == -1:
df_zinc['buy_sell'][index] = df_zinc['buy_sell'][index-3]*-1
index=index+1
df_zinc
the resulting dataframe would look like this:
df_zinc['buy_sell'] = [0,0,1,0,0,-1,0,-1,0,0,1,0]
df_zinc
So, this would be my suggestion according to the example output (and assuming I understood the question properly:
def buy_sell(row):
if row['action'] == 0:
return 0
if row['PX_LAST'] > row['ma']:
return 1 * (-1 if row['action'] == 0 else 1)
else:
return -1 * (-1 if row['action'] == 0 else 1)
return 0
df_zinc = df_zinc.assign(buy_sell=df_zinc.apply(buy_sell, axis=1))
df_zinc
This should behave as expected by the rules. It does not take into account the possibility of 'PX_LAST' being equal to 'ma', returning 0 by default, as it was not clear what rule to follow in that scenario.
EDIT
Ok, after the new logic explained, I think this should do the trick:
def assign_buysell(df):
last_nonzero = None
def buy_sell(row):
nonlocal last_nonzero
if row['action'] == 0:
return 0
if row['action'] == 1:
if row['PX_LAST'] < row['ma']:
last_nonzero = -1
elif row['PX_LAST'] > row['ma']:
last_nonzero = 1
elif row['action'] == -1:
last_nonzero = last_nonzero * -1
return last_nonzero
return df.assign(buy_sell=df.apply(buy_sell, axis=1))
df_zinc = assign_buysell(df_zinc)
This solution is independent of how long ago the nonzero value was seen, it simply remembers the last nonzero value and pipes the opposite wen action is -1.
You can use np.select, and use np.nan as a label for the rows that satisfy the third condition:
c1 = df_zinc.action.eq(1) & df_zinc.PX_LAST.gt(df_zinc.ma)
c2 = df_zinc.action.eq(1) & df_zinc.PX_LAST.lt(df_zinc.ma)
c3 = df_zinc.action.eq(-1)
df_zinc['buy_sell'] = np.select([c1,c2, c3], [1, -1, np.nan])
Now in order to fill NaNs with the value from n rows above (in this case 3), you can fillna with a shifted version of the dataframe:
df_zinc['buy_sell'] = df_zinc.buy_sell.fillna(df_zinc.buy_sell.shift(3)*-1)
Output
PX_LAST ma action buy_sell
0 1 2 0 0.0
1 2 2 0 0.0
2 3 2 1 1.0
3 3 2 0 0.0
4 3 2 0 0.0
5 1 2 -1 -1.0
6 2 2 0 0.0
7 1 2 1 -1.0
8 1 2 0 0.0
9 1 2 0 0.0
10 3 2 -1 1.0
11 3 2 0 0.0
I would use np.select for this, since you have multiple conditions:
conditions = [
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] > df_zinc['ma']),
(df_zinc['action'] == -1) & (df_zinc['PX_LAST'] < df_zinc['ma'])
]
choices = [1, -1, 1, -1]
df_zinc['buy_sell'] = np.select(conditions, choices, default=0)
result
print(df_zinc)
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0
here my solution using the function shift() to trap the data of 3th up row:
df_zinc['buy_sell'] = 0
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] < df_zinc['ma']), 'buy_sell'] = -1
df_zinc.loc[(df_zinc['action'] == 1) & (df_zinc['PX_LAST'] > df_zinc['ma']), 'buy_sell'] = 1
df_zinc.loc[df_zinc['action'] == -1, 'buy_sell'] = -df_zinc['buy_sell'].shift(3)
df_zinc['buy_sell'] = df_zinc['buy_sell'].astype(int)
print(df_zinc)
output:
PX_LAST ma action buy_sell
0 1 2 0 0
1 2 2 0 0
2 3 2 1 1
3 3 2 0 0
4 3 2 0 0
5 1 2 -1 -1
6 2 2 0 0
7 1 2 1 -1
8 1 2 0 0
9 1 2 0 0
10 3 2 -1 1
11 3 2 0 0

Categories

Resources