Finding Occurrences SUM using Dataframe - python

I have a data frame and I need to group by at least one occurrence greater than 0 and I need to sum it to last occurance. My code is below
data = {'id':
[7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'timeatAcc':
[0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0]
}
df =pd.DataFrame(data, columns= ['id','timeatAcc'])
df['consecutive'] = df['id'].groupby((df['timeatAcc'] !=
df['timeatAcc'].shift()).cumsum()).transform('size') * df['timeatAcc']
print(df)
Current Output
Expected output
Need help and thanks in advance

Let's try groupby().diff():
df['Occurences'] = df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 1
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 1
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 1
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 1
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 1
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
Update: to get the sum instead of 1:
df['Occurences'] = df.groupby(['id', df['timeatAcc'].eq(0).cumsum()])['timeatAcc'].transform('sum')
df['Occurences'] = np.where(df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
, df['Occurences'], 0)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 3
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 2
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 3
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 3
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 2
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0

Related

How to calculate an an accumulated value conditionally?

This question is based on this thread.
I have the following dataframe:
diff_hours stage sensor
0 0 20
0 0 21
0 0 21
1 0 22
5 0 21
0 0 22
0 1 20
7 1 23
0 1 24
0 3 25
0 3 28
6 0 21
0 0 22
I need to calculated an accumulated value of diff_hours while stage is growing. When stage drops to 0, the accumulated value acc_hours should restart to 0 even though diff_hours might not be equal to 0.
The proposed solution is this one:
blocks = df['stage'].diff().lt(0).cumsum()
df['acc_hours'] = df['diff_hours'].groupby(blocks).cumsum()
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 6
12 0 0 22 6
On the line 11 the value of acc_hours is equal to 6. I need it to be restarted to 0, because the stage dropped from 3 back to 0 in row 11.
The expected output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
How can I implement this logic?
The expected output is unclear, what about a simple mask?
Masking only the value during the change:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df.groupby(m.cumsum())
['diff_hours'].cumsum()
.mask(m, 0)
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 6
13 3 0 22 9
14 0 0 22 9
Or ignoring the value completely bu masking before groupby:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df['diff_hours'].mask(m, 0)
.groupby(m.cumsum())
.cumsum()
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
13 3 0 22 3
14 0 0 22 3

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

How can I create unique id based on the value in the other column

I wanted to assign the unique id based on the value from the column. For ex. i have a table like this:
df = pd.DataFrame({'A': [0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1]}
Eventually I would like to have my output table looks like this:
A
id
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
1
2
8
1
2
9
1
2
10
1
2
11
1
2
12
1
2
13
0
3
14
0
3
15
0
3
16
0
3
17
0
3
18
0
3
19
1
4
20
1
4
21
1
4
22
0
5
23
0
5
24
0
5
25
0
5
26
1
6
27
1
6
28
1
6
I tried data.groupby(['a'], sort=False).ngroup() + 1 but its not working as what I want. Any help and guidance will be appreciated! thanks!
diff + cumsum:
df['id'] = df.A.diff().ne(0).cumsum()
df
A id
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 1 2
8 1 2
9 1 2
10 1 2
11 1 2
12 1 2
13 0 3
14 0 3
15 0 3
16 0 3
17 0 3
18 0 3
19 1 4
20 1 4
21 1 4
22 0 5
23 0 5
24 0 5
25 0 5
26 1 6
27 1 6
28 1 6
import pdrle
df["id"] = pdrle.get_id(df["A"]) + 1
df
# A id
# 0 0 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 0 1
# 5 0 1
# 6 0 1
# 7 1 2
# 8 1 2
# 9 1 2
# 10 1 2
# 11 1 2
# 12 1 2
# 13 0 3
# 14 0 3
# 15 0 3
# 16 0 3
# 17 0 3
# 18 0 3
# 19 1 4
# 20 1 4
# 21 1 4
# 22 0 5
# 23 0 5
# 24 0 5
# 25 0 5
# 26 1 6
# 27 1 6
# 28 1 6

Fill missing rows with zeros from a data frame

Now I have a DataFrame as below:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
We can see that part of the DataFrame is missing, like user_id_5 and user_id_8. What I want to do is to fill these rows with 0, like:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
Is there any solution to this issue?
You could use arange + reindex -
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
Assuming your index is meant to be monotonically increasing index.
df
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0
df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
5 0 0 0 0 0 0 0 0 0 0 # <-----
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
8 0 0 0 0 0 0 0 0 0 0 # <-----
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0

Leave blocks of 1 of size >= k in Pandas data frame

I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2:
df=
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0
where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0)
g = z.cumsum().mask(z, -1)
k = 2
df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0)
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0

Categories

Resources