Create sequential event id for groups of consecutive ones - python

I have a df like so:
Period Count
1 1
2 0
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
and I want to return a 'Event ID' in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Period Count Event_ID
1 1 0
2 0 0
3 1 1
4 1 1
5 0 0
6 0 0
7 1 2
8 1 2
9 1 2
10 0 0
I have researched and found solutions that allow me to flag out consecutive group of similar numbers (e.g 1) but I haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

This will do the job:
ones = df.groupby('Count').groups[1].tolist()
# creates a list of the indices with a '1': [0, 2, 3, 6, 7, 8]
event_id = [0] * len(df.index)
# creates a list of length 10 for Event_ID with all '0'
# find consecutive numbers in the list of ones (yields [2,3] and [6,7,8]):
for k, g in itertools.groupby(enumerate(ones), lambda ix : ix[0] - ix[1]):
sublist = list(map(operator.itemgetter(1), g))
if len(sublist) > 1:
for i in sublist:
event_id[i] = len(sublist)-1
# event_id is now [0, 0, 1, 1, 0, 0, 2, 2, 2, 0]
df['Event_ID'] = event_id
The for loop is adapted from this example (using itertools, other approaches are possible too).

Related

Iterate through dataframe to find first row that satisfies condition for each group of id

I have panel data with id, time. For each id, after start =1 , I want to identify the first time that satisfy the rule: having "number in" greater than previous "number out".
With the example data below, the expected result is: for id=1, time =5; and for id=2 , time =3. The explanation is as below.
For id = 1, start =1 occurs a time =1. Tracking from time =1, time=5 is what I need as it is the first having "number in" = 4 and it is higher than prior "number out" = 1 occurs in time =3 (after start=1).
Similarly, for id=2, time=3 is what satisfy the rule
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'time': [1,2,3,4,5, 1,2,3,4],
'start':[1,0,0,0,0,1,0,0,0],
'number out': [2,99,1,13,9,10,2,8,8],
'number in': [2,9,1,0,4,1,5,7,8]})
df
id time start number out number in
0 1 1 1 2 2
1 1 2 0 99 9
2 1 3 0 1 1
3 1 4 0 13 0
4 1 5 0 9 4
5 2 1 1 10 1
6 2 2 0 2 5
7 2 3 0 8 7
8 2 4 0 8 8
The data is grouped by id. Apply is applied to them with a custom function. The first index after start=1(ind) is obtained. ind_in is the index at which to start searching for number in.
A check is used if there is no data so that an error does not occur. If you are sure of your data, you can remove this line:
if ind_in[0] > 0 and ind_in[0] <= x.index[-1]:
Next, in the 'aaa' list generator, each 'number in' element is compared to an array. If at least one element matches the condition, a boolean mask is stored. It is used for sampling as an index.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'time': [1,2,3,4,5, 1,2,3,4],
'start':[1,0,0,0,0,1,0,0,0],
'number out': [2,99,1,13,9,10,2,8,8],
'number in': [2,9,1,0,4,1,5,7,8]})
print(df)
def my_func(x):
ind = x[x['start'] == 1].index + 1
ind_in = ind + 1
if ind_in[0] > 0 and ind_in[0] <= x.index[-1]:
number_out = x.loc[ind[0]:x.index[-2], 'number out']
aaa = [i for i in range(ind_in[0], x.index[-1] + 1)
if (x.loc[i, 'number in'] > number_out.values).any()]
return x.loc[aaa[0], 'time']
print(df.groupby('id').apply(my_func))
Output
id
1 5
2 3
dtype: int64

How to create a new column through a specific condition?

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?
You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter
def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.
What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

Count values in previous rows that are greater than current row value

I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.
We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a
IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6
count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)
The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)

Reverse Values Based on Successive Count

I have a series containing only 1's and 0's used as a flag. I'm trying to figure out a good way to count the number of successive repeat values, and if it doesn't meet a threshold, I'd like to reverse them. For instance, if I have less than 5 repeated values in succession, reverse them from 0's to 1's or vice versa.
For example:
Flag
1
1
1
1
1
0
0
0
0
1
1
...
Would become:
Flag
1
1
1
1
1
1
1
1
1
1
1
...
Use diff().ne(0) to find the breaks
Use cumsum() to create the groups
Use groupby.transform('size') to count the size of groups
then flip value with sub(df.Flag).abs()
df.Flag.groupby(
df.Flag.diff().ne(0).cumsum()
).transform('size').lt(5).sub(df.Flag).abs()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 0
10 0
Name: Flag, dtype: int64
Just try another way maybe
s=df.Flag.diff().ne(0).cumsum().value_counts()
np.where(((s>=5).repeat(s).values),df.Flag,1-df.Flag)
Out[1158]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], dtype=int64)

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources