Increment counter the first time a number is reached - python

This is probably a very silly question. But, I'll still go ahead and ask. How would you increment a counter only the first time a particular value is reached?
For example, if I have step below as a column of the df and would want to add a counter column called 'counter' which increments the first time the 'step' column has a value of 6

You can use .shift() in pandas -
Notice how you only want to increment if value of df['step'] is 6
and value of df.shift(1)['step'] is not 6.
df['counter'] = ((df['step']==6) & (df.shift(1)['step']!=6 )).cumsum()
print(df)
Output
step counter
0 2 0
1 2 0
2 2 0
3 3 0
4 4 0
5 4 0
6 5 0
7 6 1
8 6 1
9 6 1
10 6 1
11 7 1
12 5 1
13 6 2
14 6 2
15 6 2
16 7 2
17 5 2
18 6 3
19 7 3
20 5 3
Explanation
a. df['step']==6 gives boolean values - True if the step is 6
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 True
10 True
11 False
12 False
13 True
14 True
15 True
16 False
17 False
18 True
19 False
20 False
Name: step, dtype: bool
b. df.shift(1)['step']!=6 shifts the data by 1 row and then checks if value is equal to 6.
When both these conditions satisfy, you want to increment - .cumsum() will take care of that. Hope that helps!
P.S - Although it's a good question, going forward please avoid pasting images. You can directly paste data and format as code. Helps the people who are answering to copy-paste

Use:
df = pd.DataFrame({'step':[2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]})
a = df['step'] == 6
b = (~a).shift()
b[0] = a[0]
df['counter1'] = (a & b).cumsum()
print (df)
step counter
0 2 0
1 2 0
2 2 0
3 3 0
4 4 0
5 4 0
6 5 0
7 6 1
8 6 1
9 6 1
10 6 1
11 7 1
12 5 1
13 6 2
14 6 2
15 6 2
16 7 2
17 5 2
18 6 3
19 7 3
20 5 3
Explanation:
Get boolean mask for comparing with 6:
a = df['step'] == 6
Invert Series and shift:
b = (~a).shift()
If first value is 6 then get no first group, so need set first value by first a value:
b[0] = a[0]
Chain conditions by bitwise and - &:
c = a & b
Get cumulative sum:
d = c.cumsum()
print (pd.concat([df['step'], a, b, c, d], axis=1, keys=('abcde')))
a b c d e
0 2 False False False 0
1 2 False True False 0
2 2 False True False 0
3 3 False True False 0
4 4 False True False 0
5 4 False True False 0
6 5 False True False 0
7 6 True True True 1
8 6 True False False 1
9 6 True False False 1
10 6 True False False 1
11 7 False False False 1
12 5 False True False 1
13 6 True True True 2
14 6 True False False 2
15 6 True False False 2
16 7 False False False 2
17 5 False True False 2
18 6 True True True 3
19 7 False False False 3
20 5 False True False 3
If performance is important, use numpy solution:
a = (df['step'] == 6).values
b = np.insert((~a)[:-1], 0, a[0])
df['counter1'] = np.cumsum(a & b)

If your DataFrame is called df, one possible way without iteration is
df['counter'] = 0
df.loc[1:, 'counter'] = ((df['steps'].values[1:] == 6) & (df['steps'].values[:-1] != 6)).cumsum()
This creates two boolean arrays, the conjunction of which is True when the previous row did not contain a 6 and the current row does contain a 6. You can sum this array to obtain the counter.

That's not a silly question. To get the desired output in your counter column, you can try (for example) this:
steps = [2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]
counter = [idx for idx in range(len(steps)) if steps[idx] == 6 and (idx==0 or steps[idx-1] != 6)]
print(counter)
results in:
>> [7, 13, 18]
, which are the indices in steps where a first 6 occurred. You can now get the total times that has happened with len(counter), or reproduce the second column the exact way you have given it with
counter_column = [0]
for idx in range(len(steps)):
counter_column.append(counter_column[-1])
if idx in counter:
counter_column[-1] += 1

If your DataFrame is called df, it's
import pandas as pd
q_list = [2, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 5, 6, 6, 6, 7, 5, 6, 7, 5]
df = pd.DataFrame(q_list, columns=['step'])
counter = 0
flag = False
for index, row in df.iterrows():
if row ['step'] == 6 and flag == False:
counter += 1
flag = True
elif row ['step'] != 6 and flag == True:
flag = False
df.set_value(index,'counter',counter)

Related

Pandas: find interval distance from N consecutive to M consecutive

TLDR version:
I have a column like below,
[2, 2, 0, 0, 0, 2, 2, 0, 3, 3, 3, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 3, 3]
# There is the probability that has more sequences, like 4, 5, 6, 7, 8...
I need a function that has parameters n,m, if I use
n=2, m=3,
I will get a distance between 2 and 3, and then final result after the group could be :
[6, 9]
Detailed version
Here is the test case. And I'm writing a function that will give n,m then generate a list of distances between each consecutive. Currently, this function can only work with one parameter N (which is the distance from N consecutive to another N consecutive). I want to make some changes to this function to make it accept M.
dummy = [1,1,0,0,0,1,1,0,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1,1]
df = pd.DataFrame({'a': dummy})
What I write currently,
def get_N_seq_stat(df, N=2, M=3):
df["c1"] = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
df["c2"] = np.where(df.c1.ne(N) , 1, 0)
df["c3"] = df["c2"].ne(df["c2"].shift()).cumsum()
result = df.loc[df["c2"] == 1].groupby("c3")["c2"].count().tolist()
# if last N rows are not consequence shouldn't add last.
if not (df["c1"].tail(N) == N).all():
del result[-1]
if not (df["c1"].head(N) == N).all():
del result[0]
return result
if I set N=2, M=3 ( from 2 consecutive to 3 consecutive), Then the ideal value return from this would be [6,9] because below.
dummy = [1,1,**0,0,0,1,1,0,**1,1,1,0,0,1,1,**0,0,0,0,1,1,0,0,0,**1,1,1]
Currently, if I set N =2, the return list would be [3, 6, 4] that because
dummy = [1,1,**0,0,0,**1,1,**0,1,1,1,0,0,**1,1,**0,0,0,0,**1,1,0,0,0,1,1,1]
I would modify your code this way:
def get_N_seq_stat(df, N=2, M=3, debug=False):
# get number of consecutive 1s
c1 = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
# find stretches between N and M
m1 = c1.eq(N)
m2 = c1.eq(M)
c2 = pd.Series(np.select([m1.shift()&~m1, m2], [True, False], np.nan),
index=df.index).ffill().eq(1)
# debug mode to understand how this works
if debug:
return df.assign(c1=c1, c2=c2,
length=c2[c2].groupby(c2.ne(c2.shift()).cumsum())
.transform('size')
)
# get the length of the stretches
return c2[c2].groupby(c2.ne(c2.shift()).cumsum()).size().to_list()
get_N_seq_stat(df, N=2, M=3)
Output: [6, 9]
Intermediate c1, c2, and length:
get_N_seq_stat(df, N=2, M=3, debug=True)
a c1 c2 length
0 1 2 False NaN
1 1 2 False NaN
2 0 0 True 6.0
3 0 0 True 6.0
4 0 0 True 6.0
5 1 2 True 6.0
6 1 2 True 6.0
7 0 0 True 6.0
8 1 3 False NaN
9 1 3 False NaN
10 1 3 False NaN
11 0 0 False NaN
12 0 0 False NaN
13 1 2 False NaN
14 1 2 False NaN
15 0 0 True 9.0
16 0 0 True 9.0
17 0 0 True 9.0
18 0 0 True 9.0
19 1 2 True 9.0
20 1 2 True 9.0
21 0 0 True 9.0
22 0 0 True 9.0
23 0 0 True 9.0
24 1 3 False NaN
25 1 3 False NaN
26 1 3 False NaN

How to remove consecutive pairs of opposite numbers from Pandas Dataframe?

How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks
Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Turn list clockwise for one time

How I can rotate list clockwise one time? I have some temporary solution, but I'm sure there is a better way to do it.
I want to get from this
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
to this:
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
And my temporary "solution" is just:
temporary = [0, 2, 4, 4, 5, 6, 6, 7, 7, 7]
test = [None] * len(temporary)
test[0] = temporary[0]
for index in range(1, len(temporary)):
test[index] = temporary[index - 1]
You might use temporary.pop() to discard the last item and temporary.insert(0, 0) to add 0 to the front.
Alternatively in one line:
temporary = [0] + temporary[:-1]

Checking if values of a pandas Dataframe are between two lists. Adding a boolean column

I am trying to add a new column to a pandas Dataframe (False/True),which reflects if the Value is between two datapoints from another file.
I have a two files which give the following info:
File A:(x) File B:(y)
'time' 'time_A' 'time_B'
0 1 0 1 3
1 3 1 5 6
2 5 2 8 10
3 7
4 9
5 11
6 13
I tried to do it with the .map function, however it gives true and false for each event, not one column.
x['Event'] = x['time'].map((lamda x: x< y['time_A']),(lamda x: x> y['time_B']))
This would be the expected result
->
File A:
'time' 'Event'
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
However what i get is something like this
->
File A:
'time'
0 1 "0 True
1 True
2 True"
Name:1, dtype:bool"
2 3 "0 True
1 True
2 True
Name:1, dtype:bool"
This should do it:
(x.assign(key=1)
.merge(y.assign(key=1),
on='key')
.drop('key', 1)
.assign(Event=lambda v: (v['time_A'] <= v['time']) &
(v['time'] <= v['time_B']))
.groupby('time', as_index=False)['Event']
.any())
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
Use pd.IntervalIndex here:
idx=pd.IntervalIndex.from_arrays(B['time_A'],B['time_B'],closed='both')
#output-> IntervalIndex([[1, 3], [5, 6], [8, 10]],closed='both',dtype='interval[int64]')
A['Event']=B.set_index(idx).reindex(A['time']).notna().all(1).to_numpy()
print(A)
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
One liner:
A['Event'] = sum(A.time.between(b.time_A, b.time_B) for _, b in B.iterrows()) > 0
Explain:
For each row b of B dataframe, A.time.between(b.time_A, b.time_B) returns a boolean series whether time is between time_A and time_B
sum(list_of_boolean_series) > 0: Elementwise OR

Assign value to subset of rows in Pandas dataframe

I want to assign values based on a condition on index in Pandas DataFrame.
class test():
def __init__(self):
self.l = 1396633637830123000
self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = np.nan
for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
self.dfa.ix[beg:end]['true'] = True
self.dfa.ix[beg:end]['idx'] = i
def do(self):
self.update()
print self.dfa
t = test()
t.do()
Result:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True NaN
1396633637830123002 4 5 True NaN
1396633637830123003 6 7 True NaN
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True NaN
1396633637830123007 14 15 True NaN
1396633637830123008 16 17 True NaN
1396633637830123009 18 19 True NaN
The true column is correctly assigned, while the idx column is not. Futhermore, this seems to depend on how the columns are initialized because if I do:
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = False
also the true column does not get properly assigned.
What am I doing wrong?
p.s. the expected result is:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True 0
1396633637830123002 4 5 True 0
1396633637830123003 6 7 True 0
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True 1
1396633637830123007 14 15 True 1
1396633637830123008 16 17 True 1
1396633637830123009 18 19 True 1
Edit: I tried assigning using both loc and iloc but it doesn't seem to work:
loc:
self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i
iloc:
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i
You are chain indexing, see here. The warning is not guaranteed to happen.
You should prob just do this. No real need to actually track the index in b, btw.
In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))
In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])
In [46]: dfa['in_b'] = False
In [47]: for i, s in dfb.iterrows():
....: dfa.loc[s['beg']:s['end'],'in_b'] = True
....:
or this if you have non-integer dtypes
In [36]: for i, s in dfb.iterrows():
dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True
In [48]: dfa
Out[48]:
A B in_b
1396633637830123000 0 1 False
1396633637830123001 2 3 True
1396633637830123002 4 5 True
1396633637830123003 6 7 True
1396633637830123004 8 9 False
1396633637830123005 10 11 False
1396633637830123006 12 13 True
1396633637830123007 14 15 True
1396633637830123008 16 17 True
1396633637830123009 18 19 True
[10 rows x 3 columns
If b is HUGE this might not be THAT performant.
As an aside, these look like nanosecond times. Can be more friendly by converting them.
In [49]: pd.to_datetime(dfa.index)
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None

Categories

Resources