How to create a new column through a specific condition?

How to create a new column through a specific condition? - python

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?

You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter

def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.

What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

Related

Calculating probability of consecutive events with python pandas

Given a dataframe, how do I calculate the probability of consecutive events using python pandas?
For example,
Time
A
B
C
1
1
1
1
2
-1
-1
-1
3
1
1
1
4
-1
-1
-1
5
1
1
1
6
-1
-1
-1
7
1
1
1
8
-1
1
1
9
1
-1
1
10
-1
1
-1
In this dataframe, B has two consecutive "1" in t=7 and t=8, and C has three consecutive "1" in t=7 to to=9.
Probability of event that two consecutive "1" appear is 3/27
Probability of event that three consecutive "1" appear is 1/24
How can I do this using python pandas?

Try this code(It can be used in other dataframes i.e. more columns, rows)
def consecutive(num):
'''
df = pd.DataFrame({
'Time' : [i for i in range(1, 11)],
'A' : [1, -1, 1, -1, 1, -1, 1, -1, 1, -1],
'B' : [1, -1, 1, -1, 1, -1, 1, 1, -1, 1],
'C' : [1, -1, 1, -1, 1, -1, 1, 1, 1, -1]
})
print(df)
'''
row_num = df.shape[0]
col_num = df.shape[1]
cnt = 0 # the number of consecutives
for col_index in range(1, col_num): # counting for each column
col_tmp = df.iloc[:, col_index]
consec = 0
for i in range(row_num):
if col_tmp[i] == 1:
consec += 1
# if -1 comes after 1, then consec = 0
else:
consec = 0
# to simply sum with the condition(consec == num), we minus 1 from consec
if consec == num:
cnt += 1
consec -= 1
all_cases = (row_num - num + 1) * (col_num - 1) # col_num - 1 because of 'Time' column
prob = cnt / all_cases
return prob
When you execute it with the given dataframe with this code
print(f'two consectuvie : {consecutive(2)}')
print(f'three consectuvie : {consecutive(3)}')
Output :
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
two consectuvie : 0.1111111111111111
Time A B C
0 1 1 1 1
1 2 -1 -1 -1
2 3 1 1 1
3 4 -1 -1 -1
4 5 1 1 1
5 6 -1 -1 -1
6 7 1 1 1
7 8 -1 1 1
8 9 1 -1 1
9 10 -1 1 -1
three consectuvie : 0.041666666666666664

You can compare rows with previous rows using shift. So, to find out how often two consecutive values are equal, you can do
>>> (df.C == df.C.shift()).sum()
2
To find three consecutive equal values, you'd have to compare the column with itself shifted by 1 (the default) and additionally, shifted by 2.
>>> ((df.C == df.C.shift()) & (df.C == df.C.shift(2))).sum()
1
Another variation of this using the pd.Series.eq function instead of the == is:
>>> m = df.C.eq(df.C.shift(1)) & df.C.eq(df.C.shift(2))
>>> m.sum()
1
In this case, since the target value is 1 (and True == 1 is True; it won't work for other target values as is, see below), the pattern can be generalized with functools.reduce to:
from functools import reduce
def combos(column, n):
return reduce(pd.Series.eq, [column.shift(i) for i in range(n)])
You can apply this function to df like so, which will give you the numerator:
>>> df[['A', 'B', 'C']].apply(combos, n = 2).values.sum()
3
>>> df[['A', 'B', 'C']].apply(combos, n = 3).values.sum()
1
To get the denominator, you can do, e.g.,
n = 2
rows, cols = df[['A', 'B', 'C']].shape
denominator = (rows - n + 1) * cols
An idea for a generalized version of the combos function that should work with other target values is
from operator import and_ # equivalent of &
def combos_generalized(col, n):
return reduce(and_, [col == col.shift(i) for i in range(1, n)])

data frame and list operation

There are three columns in df: mins, maxs, and col. I would like to generate a binary list according to the following rule: if col[i] is smaller than or equal to mins[i], add a "1" to the list and keep adding "1" for each row until col[i+n] is greater than or equal maxs[i+n]. After reaching maxs[i+n], add "0" to the list for each row until finding the next row where col[i] is smaller than or equal to mins[i]. Repeat this entire process till going over all rows.
For example,
col mins maxs
2 1 6 (0)
4 2 6 (0)
2 3 7 (1)
5 5 6 (1)
4 3 8 (1)
4 2 5 (1)
5 3 5 (0)
4 0 5 (0)
3 3 8 (1)
......
So the list would be [0,0,1,1,1,1,0,0,1]. Does this make sense?
I gave it a shot and wrote the following, which unfortunately did not achieve what I wanted.
def get_list(col, mins, maxs):
l = []
i = 0
while i <= len(col):
if col[i] <= mins[i]:
l.append(1)
while col[i+1] <= maxs[i+1]:
l.append(1)
i += 1
break
break
return l
Thank you so much folks!

My answer may not be elegant but should work according to your expectation.
Import the pandas library.
import pandas as pd
Create dataframe according to data provided.
input_data = {
'col': [2, 4, 2, 5, 4, 4, 5, 4, 3],
'mins': [1, 2, 3, 5, 3, 2 , 3, 0, 3],
'maxs': [6, 6, 7, 6, 8, 5, 5, 5, 8]
}
dataframe_ = pd.DataFrame(data=input_data)
Using a for loop iterate over the rows. The switch variable will change accordingly depending on the conditions was provided which results in the binary column being populated.
binary_switch = False
for index, row in dataframe_.iterrows():
if row['col'] <= row['mins']:
binary_switch = True
elif row['col'] >= row['maxs']:
binary_switch = False
binary_output = 1 if binary_switch else 0
dataframe_.at[index, 'binary'] = binary_output
dataframe_['binary'] = dataframe_['binary'].astype('int')
print(dataframe_)
Output from code.
col mins maxs binary
0 2 1 6 0
1 4 2 6 0
2 2 3 7 1
3 5 5 6 1
4 4 3 8 1
5 4 2 5 1
6 5 3 5 0
7 4 0 5 0
8 3 3 8 1

Your rules give the following decision tree:
1: is col <= mins?
True: l.append(1)
False: next question
2: was col <= mins before?
False: l.append(0)
True: next question:
3: is col >= maxs?
True: l.append(0)
False: l.append(1)
Making this into a function with an if/else tree, you get this:
def make_binary_list(df):
l = []
col_lte_mins = False
for index, row in df.iterrows():
col = row["col"]
mins = row["mins"]
maxs = row["maxs"]
if col <= mins:
col_lte_mins = True
l.append(1)
else:
if col_lte_mins:
if col >= maxs:
col_lte_mins = False
l.append(0)
else:
l.append(1)
else:
l.append(0)
return l
make_binary_list(df) gives [0, 0, 1, 1, 1, 1, 0, 0, 1]

Create sequential event id for groups of consecutive ones

I have a df like so:
Period Count
1 1
2 0
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
and I want to return a 'Event ID' in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Period Count Event_ID
1 1 0
2 0 0
3 1 1
4 1 1
5 0 0
6 0 0
7 1 2
8 1 2
9 1 2
10 0 0
I have researched and found solutions that allow me to flag out consecutive group of similar numbers (e.g 1) but I haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

This will do the job:
ones = df.groupby('Count').groups[1].tolist()
# creates a list of the indices with a '1': [0, 2, 3, 6, 7, 8]
event_id = [0] * len(df.index)
# creates a list of length 10 for Event_ID with all '0'
# find consecutive numbers in the list of ones (yields [2,3] and [6,7,8]):
for k, g in itertools.groupby(enumerate(ones), lambda ix : ix[0] - ix[1]):
sublist = list(map(operator.itemgetter(1), g))
if len(sublist) > 1:
for i in sublist:
event_id[i] = len(sublist)-1
# event_id is now [0, 0, 1, 1, 0, 0, 2, 2, 2, 0]
df['Event_ID'] = event_id
The for loop is adapted from this example (using itertools, other approaches are possible too).

Pandas dataframe if else condition based on previous rows not working

I have a pandas dataframe as below:
df = pd.DataFrame({'X':[1,1,1, 0, 0]})
df
X
0 1
1 1
2 1
3 0
4 0
Now I want to modify X based on the below condition:
If X = 0 , previous row + 1
So, my final output should look like below:
X
0 1
1 1
2 1
3 2
4 3
This can be achieved by iterating over rows and setting up a current and previous row and using iloc and is working as expected
for i in range(0, len(df)):
current_row = df.iloc[i]
if i > 0:
previous_row =df.iloc[i-1]
else:
previous_row = current_row
if (current_row['X'] == 0):
current_row['X'] = previous_row['X'] +1
I want more efficient way of doing that and I tried the below code but the output is not what I expected (the value of X for 5th row should be 3):
conditions = [df["X"] == 0]
values = [df["X"] .shift() + 1]
df['X'] = np.select(conditions, values)
>>> df
X
0 1
1 1
2 1
3 2
4 1

You could try the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 1, 0, 0]})
# values previous to zero
pe_zero = df.X.shift(-1).eq(0) * df.X # [0 0 1 0 0]
# 1 for reach zero value as you sum one to the previous value
eq_zero = df.X.eq(0)
# find consecutive groups of 0
groups = pe_zero + eq_zero
consecutive = (groups.gt(0) != groups.gt(0).shift()).cumsum()
# find cumulative sum by groups
cumulative = groups.groupby(consecutive).cumsum()
# choose from cumulative when equals to zero else from original
result = np.where(eq_zero, cumulative, df.X)
print(result)
Output
[1 1 1 2 3]
UPDATE
For df = pd.DataFrame({'X': [1, 1, 1, 0, 0, 1, 1, 0, 0]})
returns:
[1 1 1 2 3 1 1 2 3]

You could try this:
arr = df.X.values # extract the column as a numpy array for faster iteration
for i, val in enumerate(arr[1:], start=1):
if val == 0:
arr[i] = arr[i-1] + 1

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?

This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop

Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])

def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))

stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a new column through a specific condition? - python

I have a column like this: 1 0 0 0 0 1 0 0 0 1 0 0 and need as result: 1 1 1 1 1 2 2 2 2 3 3 3 A method/algorithm that divides into ranks from 1 to 1 and gives them successively values. Any idea?

You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1. def rank(lst): counter = 0 for i, column in enumerate(lst): if column == 1: counter+=1 lst[i] = counter

Related

Calculating probability of consecutive events with python pandas

data frame and list operation

Create sequential event id for groups of consecutive ones

Pandas dataframe if else condition based on previous rows not working

Creating intervaled ramp array based on a threshold - Python / NumPy

Categories

Resources