Pandas/Numpy group value changes and derivative value changes above/below 0 - python

I have a series of values (Pandas DF or Numpy Arr):
vals = [0,1,3,4,5,5,4,2,1,0,-1,-2,-3,-2,3,5,8,4,2,0,-1,-3,-8,-20,-10,-5,-2,-1,0,1,2,3,5,6,8,4,3]
df = pd.DataFrame({'val': vals})
I want to classify/group the values into 4 categories:
Increasing above 0
Decreasing above 0
Increasing below 0
Decreasing below 0
Current approach with Pandas is to categorize into above/below 0 and then that into increasing/decreasing by seeing when diff values change above/below 0.
df['above_zero'] = np.where(df['val'] >= 0, 1, 0)
df['below_zero'] = np.where(df['val'] < 0, 1, 0)
df['diffs'] = df['val'].diff()
df['diff_above_zero'] = np.where(df['diffs'] >= 0, 1, 0)
df['diff_below_zero'] = np.where(df['diffs'] < 0, 1, 0)
This produces the desired output, but now I am trying to find a solution how to group these columns into an ascending group number as soon as one of the 4 conditions changes.
Desired output would look like this (*group col is manually typed, might have errors from calculated values):
id val above_zero below_zero diffs diff_above_zero diff_below_zero group
0 0 1 0 0.0 1 0 0
1 1 1 0 1.0 1 0 0
2 3 1 0 2.0 1 0 0
3 4 1 0 1.0 1 0 0
4 5 1 0 1.0 1 0 0
5 5 1 0 0.0 1 0 0
6 4 1 0 -1.0 0 1 1
7 2 1 0 -2.0 0 1 1
8 1 1 0 -1.0 0 1 1
9 0 1 0 -1.0 0 1 1
10 -1 0 1 -1.0 0 1 2
11 -2 0 1 -1.0 0 1 2
12 -3 0 1 -1.0 0 1 2
13 -2 0 1 1.0 1 0 3
14 3 1 0 5.0 1 0 4
15 5 1 0 2.0 1 0 4
16 8 1 0 3.0 1 0 4
17 4 1 0 -4.0 0 1 5
18 2 1 0 -2.0 0 1 5
19 0 1 0 -2.0 0 1 5
20 -1 0 1 -1.0 0 1 6
21 -3 0 1 -2.0 0 1 6
22 -8 0 1 -5.0 0 1 6
23 -20 0 1 -12.0 0 1 6
24 -10 0 1 10.0 1 0 7
25 -5 0 1 5.0 1 0 7
26 -2 0 1 3.0 1 0 7
27 -1 0 1 1.0 1 0 7
28 0 1 0 1.0 1 0 8
29 1 1 0 1.0 1 0 8
30 2 1 0 1.0 1 0 8
31 3 1 0 1.0 1 0 8
32 5 1 0 2.0 1 0 8
33 6 1 0 1.0 1 0 8
34 8 1 0 2.0 1 0 8
35 4 1 0 -4.0 0 1 9
36 3 1 0 -1.0 0 1 9
Would appreciate any help on how to solve this efficiently. Thanks!

Setup
g1 = ['above_zero', 'below_zero', 'diff_above_zero', 'diff_below_zero']
You can simply index all of your boolean columns, and use shift:
c = df.loc[:, g1]
(c != c.shift().fillna(c)).any(1).cumsum()
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 2
11 2
12 2
13 3
14 4
15 4
16 4
17 5
18 5
19 5
20 6
21 6
22 6
23 6
24 7
25 7
26 7
27 7
28 8
29 8
30 8
31 8
32 8
33 8
34 8
35 9
36 9
dtype: int32

The following code will produce two columns: c1 and c2.
The values of c1 correspond to the following 4 categories:
0 means below zero and increasing
1 means above zero and increasing
2 means below zero and decreasing
3 means above zero and decreasing
And c2 corresponds to ascending group number as soon as condition (i.e. c1) changes (as you wanted). Credits to #user3483203 for using the shift with cumsum
# calculate difference
df["diff"] = df['val'].diff()
# set first value in column 'diff' to 0 (as previous step sets it to NaN)
df.loc[0, 'diff'] = 0
df["c1"] = (df['val'] >= 0).astype(int) + (df["diff"] < 0).astype(int) * 2
df["c2"] = (df["c1"] != df["c1"].shift().fillna(df["c1"])).astype(int).cumsum()
Result:
val diff c1 c2
0 0 0.0 1 0
1 1 1.0 1 0
2 3 2.0 1 0
3 4 1.0 1 0
4 5 1.0 1 0
5 5 0.0 1 0
6 4 -1.0 3 1
7 2 -2.0 3 1
8 1 -1.0 3 1
9 0 -1.0 3 1
10 -1 -1.0 2 2
11 -2 -1.0 2 2
12 -3 -1.0 2 2
13 -2 1.0 0 3
14 3 5.0 1 4
15 5 2.0 1 4
16 8 3.0 1 4
17 4 -4.0 3 5
18 2 -2.0 3 5
19 0 -2.0 3 5
20 -1 -1.0 2 6
21 -3 -2.0 2 6
22 -8 -5.0 2 6
23 -20 -12.0 2 6
24 -10 10.0 0 7
25 -5 5.0 0 7
26 -2 3.0 0 7
27 -1 1.0 0 7
28 0 1.0 1 8
29 1 1.0 1 8
30 2 1.0 1 8
31 3 1.0 1 8
32 5 2.0 1 8
33 6 1.0 1 8
34 8 2.0 1 8
35 4 -4.0 3 9
36 3 -1.0 3 9

Related

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change
IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0
If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Percentage of events before and after a sequence of zeros in pandas rows

I have a dataframe like the following:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total
-----------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5
and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:
def func(row, n):
"""Returns the number of events before the
first sequence of 0s of length n is found
"""
idx = np.arange(0, 91)
a = row[idx]
b = (a != 0).cumsum()
c = b[a == 0]
d = c.groupby(c).count()
#in case there is no sequence of 0s with length n
try:
e = c[c >= d.index[d >= n][0]]
f = str(e.index[0])
except IndexError:
e = [90]
f = str(e[0])
idx_sliced = np.arange(0, int(f)+1)
a = row[idx_sliced]
if (int(f) + n > 90):
perc_before = 100
else:
perc_before = a.cumsum().tail(1).values[0]/row['total']
return perc_before
As is, the error I get is:
---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')
Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:
ID 0 1 2 3 4 5 6 7 8 ... 81 82 83 84 85 86 87 88 89 90 total %_before
---------------------------------------------------------------------------------------------------------------
0 A 2 21 0 18 3 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0 156 43
1 B 0 20 12 2 0 8 14 23 0 ... 0 0 0 0 0 0 0 0 0 0 231 21
2 C 0 38 19 3 1 3 3 7 1 ... 0 0 0 0 0 0 0 0 0 0 78 90
3 D 3 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 100
If trying to solve this, you can test by using this sample input:
a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])
df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()
Give this a try:
def percent_before(row, n, ncols):
"""Return the percentage of activities happen before
the first sequence of at least `n` consecutive 0s
"""
start_index, i, size = 0, 0, 0
for i in range(ncols):
if row[i] == 0:
# increase the size of the island
size += 1
elif size >= n:
# found the island we want
break
else:
# start a new island
# row[start_index] is always non-zero
start_index = i
size = 0
if size < n:
# didn't find the island we want
return 1
else:
# get the sum of activities that happen
# before the island
idx = np.arange(0, start_index + 1).astype(str)
return row.loc[idx].sum() / row['total']
df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
Result:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 total percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 33 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 53 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 45 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 99 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 87 0.172414
For the full frame, call apply with ncols=91.
Another possible solution:
def get_vals(df, n):
df, out = df.T, []
for col in df.columns:
diff_to_previous = df[col] != df[col].shift(1)
g = df.groupby(diff_to_previous.cumsum())[col].agg(['idxmin', 'size'])
vals = df.loc[g.loc[g['size'] >= n, 'idxmin'].values, col]
if len(vals):
out.append( df.loc[np.arange(0, vals[vals == 0].index[0]), col].sum() / df[col].sum() )
else:
out.append( 1.0 )
return out
df['percent_before'] = get_vals(df, n=3)
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 percent_before
0 1 1 13 0 0 0 4 0 0 0 0 0 12 1 1 0.454545
1 1 1 13 0 0 0 4 12 1 12 3 0 0 5 1 0.283019
2 1 1 13 0 0 0 4 12 2 0 5 0 5 1 1 0.333333
3 1 1 13 0 0 0 4 12 1 12 4 50 0 0 1 0.151515
4 1 1 13 0 0 0 4 12 0 0 0 54 0 1 1 0.172414
As one of the comment of the previous question was about the speed, I guess you can try to vectorize the problem. I used this dataframe to try (slightly different than your original input):
ID 0 1 2 3 4 5 6 7 8 total
0 A 2 21 0 18 3 0 0 0 2 46
1 B 0 0 12 2 0 8 14 23 0 59
2 C 0 38 19 3 1 3 3 7 1 75
3 D 3 0 0 1 0 0 0 0 0 4
Now what I think is chaining command to create a mask and find where the data is not equal to 0, then use cumsum along the column axis and see where the diff along the column is equal to 0. To find the first one, you can use cummax so that all the columns after (row-wise) are considered are True. Mask the original dataframe with the opposite of this mask, sum along the columns and divide by total. for example with n=2:
n=2
df['%_before'] = df[~(df.ne(0).cumsum(axis=1).diff(n, axis=1)[range(9)]
.eq(0).cummax(axis=1))].sum(axis=1)/df.total
print (df)
ID 0 1 2 3 4 5 6 7 8 total %_before
0 A 2 21 0 18 3 0 0 0 2 46 0.956522
1 B 0 0 12 2 0 8 14 23 0 59 0.000000
2 C 0 38 19 3 1 3 3 7 1 75 1.000000
3 D 3 0 0 1 0 0 0 0 0 4 0.750000
In your case, you need to change range(9) by range(91) to get all your columns
You can do this using the rolling method.
For your example input, given the number of zeros is 5, we can use
df.rolling(window=5, axis=1).apply(lambda x : np.sum(x))
The output would like
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 NaN NaN NaN NaN 15.0 14.0 17.0 4.0 4.0 4.0 4.0 0.0 12.0 13.0
1 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 32.0 28.0 16.0 20.0
2 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 18.0 18.0 23.0 19.0 12.0 11.0
3 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 17.0 29.0 33.0 79.0 67.0 66.0
4 NaN NaN NaN NaN 15.0 14.0 17.0 16.0 16.0 16.0 16.0 66.0 54.0 55.0
14
0 14.0
1 9.0
2 12.0
3 55.0
4 56.0
Looking at the output, its very easy to see that in the first row, for column 11, since the value is 0, it means that starting at position 7, you have 5 zeros.
Since none of the other rows have 0 in them, it means that none of the other rows have 5 contiguous zeros in them.

How to create a increment var from a first value of a dataframe group?

I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0

Pandas dataframe cumulative sum of column except last zero values

I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe:
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 2
7 0 0
8 0 0
9 0 0
cumulative sum of index 1 to 6 only:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
If want not use cumsum for last 0 values in all columns:
Compare if row no contains 0, shift mask and use cumulative sum. Last compare with last value and filter:
a = df.ne(0).any(1).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
Similar solution if want processes each column separately - only omit any:
print (df)
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 0
7 0 0
8 0 0
9 0 0
a = df.ne(0).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 0
7 0 0
8 0 0
9 0 0
Use
In [262]: s = df.ne(0).all(1)
In [263]: l = s[s].index[-1]
In [264]: df[:l] = df.cumsum()
In [265]: df
Out[265]:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
I will use last_valid_index
v=df.replace(0,np.nan).apply(lambda x : x.last_valid_index())
df[pd.DataFrame(df.index.values<=v.values[:,None],columns=df.index,index=df.columns).T].cumsum().fillna(0)
Out[890]:
A B
1 1.0 2.0
2 6.0 2.0
3 16.0 2.0
4 26.0 3.0
5 26.0 4.0
6 31.0 6.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0)
>>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0)
>>> df[:m] = df[:m].cumsum()
>>> df
A B
0 1 2
1 6 2
2 16 2
3 26 3
4 26 4
5 31 6
6 0 0
7 0 0
8 0 0

Use fillna-method per specific segments in dataframe

Currently I have following dataframe, where F1-F4 are some segments
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 NaN 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 NaN 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 NaN 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 NaN 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 NaN 0 0 1 0
20:45 1 3 5 7 NaN 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
What is the best approach to achieve next dataset after some manipulations?
E(06:15) = MEAN( AVG[E(06:00-06:30)], AVG[06:15(A-E)] ) #F1==1
E(20:45) = MEAN( AVG[E(20:45-21:00)], AVG[20:45(A-E)] ) #F4==1
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 [X0] 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 [X1] 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 [X2] 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 [X3] 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 [X4] 0 0 1 0
20:45 1 3 5 7 [X5] 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
I was trying to use an idea like below, but without success so far
In[89]: df.groupby(['F1', 'F2', 'F3', 'F4'], as_index=False).median()
Out[89]:
F1 F2 F3 F4 A B C D E
0 0 0 0 1 2.0 3.0 2.0 2.0 0.0
1 0 0 1 0 1.5 2.0 3.0 3.5 1.0
2 0 1 0 0 6.0 7.0 6.0 7.0 9.0
3 1 0 0 0 3.0 4.0 3.0 4.0 4.0
and now, I am struggling with accessing to values E==0.0 via key F4==1

Categories

Resources