Pandas dataframe cumulative sum of column except last zero values

Pandas dataframe cumulative sum of column except last zero values - python

I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe:
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 2
7 0 0
8 0 0
9 0 0
cumulative sum of index 1 to 6 only:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0

If want not use cumsum for last 0 values in all columns:
Compare if row no contains 0, shift mask and use cumulative sum. Last compare with last value and filter:
a = df.ne(0).any(1).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0
Similar solution if want processes each column separately - only omit any:
print (df)
A B
1 1 2
2 5 0
3 10 0
4 10 1
5 0 1
6 5 0
7 0 0
8 0 0
9 0 0
a = df.ne(0).shift().cumsum()
m = a != a.max()
df[m] = df[m].cumsum()
print (df)
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 0
7 0 0
8 0 0
9 0 0

Use
In [262]: s = df.ne(0).all(1)
In [263]: l = s[s].index[-1]
In [264]: df[:l] = df.cumsum()
In [265]: df
Out[265]:
A B
1 1 2
2 6 2
3 16 2
4 26 3
5 26 4
6 31 6
7 0 0
8 0 0
9 0 0

I will use last_valid_index
v=df.replace(0,np.nan).apply(lambda x : x.last_valid_index())
df[pd.DataFrame(df.index.values<=v.values[:,None],columns=df.index,index=df.columns).T].cumsum().fillna(0)
Out[890]:
A B
1 1.0 2.0
2 6.0 2.0
3 16.0 2.0
4 26.0 3.0
5 26.0 4.0
6 31.0 6.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0

To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0)
>>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0)
>>> df[:m] = df[:m].cumsum()
>>> df
A B
0 1 2
1 6 2
2 16 2
3 26 3
4 26 4
5 31 6
6 0 0
7 0 0
8 0 0

Related

Pandas Dataframe aggregating function to count also nan values

I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)

StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)

StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)

StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?

First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?

Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0

I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

All possible permutations within groups of column pandas

I have a df
a b c d
1 0 1 2 4
2 0 1 3 5
3 0 2 1 7
4 1 3 2 5
Within groups, grouped by 'a' and 'b' I want all possible permutations of 'c'
a b c d
1 0 1 2 4
0 1 3 5
0 2 1 7
2 0 1 3 5
0 1 2 4
0 2 1 7
3 1 3 2 5
...
...
I tried:
s=pd.Series({x: list(it.permutations(y) )for x , y in df.groupby(['a','b']).c})
0 1 [(3,2),(2,3)]
2 [(1,)]
1 3 [(2,)]
Explode() only does not do what I need, since I need all combinations of groups within subgroups.
For example in this case there are 2 different ways to combine rows 1 and 2. If row 2 would have been 2 different permutations, it would be 2*2=4 ways.
Does anybody have an idea?

Fix your code with groupby and explode
s=pd.Series({x: list(itertools.permutations(y) )for x , y in df.groupby('a').b}).explode().explode().reset_index()
index 0
0 0 1
1 0 2
2 0 3
3 0 1
4 0 3
5 0 2
6 0 2
7 0 1
8 0 3
9 0 2
10 0 3
11 0 1
12 0 3
13 0 1
14 0 2
15 0 3
16 0 2
17 0 1
18 1 1
19 1 2
20 1 2
21 1 1

Pandas/Numpy group value changes and derivative value changes above/below 0

I have a series of values (Pandas DF or Numpy Arr):
vals = [0,1,3,4,5,5,4,2,1,0,-1,-2,-3,-2,3,5,8,4,2,0,-1,-3,-8,-20,-10,-5,-2,-1,0,1,2,3,5,6,8,4,3]
df = pd.DataFrame({'val': vals})
I want to classify/group the values into 4 categories:
Increasing above 0
Decreasing above 0
Increasing below 0
Decreasing below 0
Current approach with Pandas is to categorize into above/below 0 and then that into increasing/decreasing by seeing when diff values change above/below 0.
df['above_zero'] = np.where(df['val'] >= 0, 1, 0)
df['below_zero'] = np.where(df['val'] < 0, 1, 0)
df['diffs'] = df['val'].diff()
df['diff_above_zero'] = np.where(df['diffs'] >= 0, 1, 0)
df['diff_below_zero'] = np.where(df['diffs'] < 0, 1, 0)
This produces the desired output, but now I am trying to find a solution how to group these columns into an ascending group number as soon as one of the 4 conditions changes.
Desired output would look like this (*group col is manually typed, might have errors from calculated values):
id val above_zero below_zero diffs diff_above_zero diff_below_zero group
0 0 1 0 0.0 1 0 0
1 1 1 0 1.0 1 0 0
2 3 1 0 2.0 1 0 0
3 4 1 0 1.0 1 0 0
4 5 1 0 1.0 1 0 0
5 5 1 0 0.0 1 0 0
6 4 1 0 -1.0 0 1 1
7 2 1 0 -2.0 0 1 1
8 1 1 0 -1.0 0 1 1
9 0 1 0 -1.0 0 1 1
10 -1 0 1 -1.0 0 1 2
11 -2 0 1 -1.0 0 1 2
12 -3 0 1 -1.0 0 1 2
13 -2 0 1 1.0 1 0 3
14 3 1 0 5.0 1 0 4
15 5 1 0 2.0 1 0 4
16 8 1 0 3.0 1 0 4
17 4 1 0 -4.0 0 1 5
18 2 1 0 -2.0 0 1 5
19 0 1 0 -2.0 0 1 5
20 -1 0 1 -1.0 0 1 6
21 -3 0 1 -2.0 0 1 6
22 -8 0 1 -5.0 0 1 6
23 -20 0 1 -12.0 0 1 6
24 -10 0 1 10.0 1 0 7
25 -5 0 1 5.0 1 0 7
26 -2 0 1 3.0 1 0 7
27 -1 0 1 1.0 1 0 7
28 0 1 0 1.0 1 0 8
29 1 1 0 1.0 1 0 8
30 2 1 0 1.0 1 0 8
31 3 1 0 1.0 1 0 8
32 5 1 0 2.0 1 0 8
33 6 1 0 1.0 1 0 8
34 8 1 0 2.0 1 0 8
35 4 1 0 -4.0 0 1 9
36 3 1 0 -1.0 0 1 9
Would appreciate any help on how to solve this efficiently. Thanks!

Setup
g1 = ['above_zero', 'below_zero', 'diff_above_zero', 'diff_below_zero']
You can simply index all of your boolean columns, and use shift:
c = df.loc[:, g1]
(c != c.shift().fillna(c)).any(1).cumsum()
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 2
11 2
12 2
13 3
14 4
15 4
16 4
17 5
18 5
19 5
20 6
21 6
22 6
23 6
24 7
25 7
26 7
27 7
28 8
29 8
30 8
31 8
32 8
33 8
34 8
35 9
36 9
dtype: int32

The following code will produce two columns: c1 and c2.
The values of c1 correspond to the following 4 categories:
0 means below zero and increasing
1 means above zero and increasing
2 means below zero and decreasing
3 means above zero and decreasing
And c2 corresponds to ascending group number as soon as condition (i.e. c1) changes (as you wanted). Credits to #user3483203 for using the shift with cumsum
# calculate difference
df["diff"] = df['val'].diff()
# set first value in column 'diff' to 0 (as previous step sets it to NaN)
df.loc[0, 'diff'] = 0
df["c1"] = (df['val'] >= 0).astype(int) + (df["diff"] < 0).astype(int) * 2
df["c2"] = (df["c1"] != df["c1"].shift().fillna(df["c1"])).astype(int).cumsum()
Result:
val diff c1 c2
0 0 0.0 1 0
1 1 1.0 1 0
2 3 2.0 1 0
3 4 1.0 1 0
4 5 1.0 1 0
5 5 0.0 1 0
6 4 -1.0 3 1
7 2 -2.0 3 1
8 1 -1.0 3 1
9 0 -1.0 3 1
10 -1 -1.0 2 2
11 -2 -1.0 2 2
12 -3 -1.0 2 2
13 -2 1.0 0 3
14 3 5.0 1 4
15 5 2.0 1 4
16 8 3.0 1 4
17 4 -4.0 3 5
18 2 -2.0 3 5
19 0 -2.0 3 5
20 -1 -1.0 2 6
21 -3 -2.0 2 6
22 -8 -5.0 2 6
23 -20 -12.0 2 6
24 -10 10.0 0 7
25 -5 5.0 0 7
26 -2 3.0 0 7
27 -1 1.0 0 7
28 0 1.0 1 8
29 1 1.0 1 8
30 2 1.0 1 8
31 3 1.0 1 8
32 5 2.0 1 8
33 6 1.0 1 8
34 8 2.0 1 8
35 4 -4.0 3 9
36 3 -1.0 3 9

Count distinct strings in rolling window using pandas

How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe cumulative sum of column except last zero values - python

I want to do a cumulative sum on a pandas dataframe without carrying over the sum to last zero values. For example, give a dataframe: A B 1 1 2 2 5 0 3 10 0 4 10 1 5 0 1 6 5 2 7 0 0 8 0 0 9 0 0 cumulative sum of index 1 to 6 only: A B 1 1 2 2 6 2 3 16 2 4 26 3 5 26 4 6 31 6 7 0 0 8 0 0 9 0 0

Use In [262]: s = df.ne(0).all(1) In [263]: l = s[s].index[-1] In [264]: df[:l] = df.cumsum() In [265]: df Out[265]: A B 1 1 2 2 6 2 3 16 2 4 26 3 5 26 4 6 31 6 7 0 0 8 0 0 9 0 0

To skip all rows after the first 0, 0 row, get the first index (by rows) where df['A'] and df[B] are 0 using idxmax(0) >>> m = ((df["A"]==0) & (df["B"]==0)).idxmax(0) >>> df[:m] = df[:m].cumsum() >>> df A B 0 1 2 1 6 2 2 16 2 3 26 3 4 26 4 5 31 6 6 0 0 7 0 0 8 0 0

Related

Pandas Dataframe aggregating function to count also nan values

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

All possible permutations within groups of column pandas

Pandas/Numpy group value changes and derivative value changes above/below 0

Count distinct strings in rolling window using pandas

Categories

Resources