This question is based on this thread.
I have the following dataframe:
diff_hours stage sensor
0 0 20
0 0 21
0 0 21
1 0 22
5 0 21
0 0 22
0 1 20
7 1 23
0 1 24
0 3 25
0 3 28
6 0 21
0 0 22
I need to calculated an accumulated value of diff_hours while stage is growing. When stage drops to 0, the accumulated value acc_hours should restart to 0 even though diff_hours might not be equal to 0.
The proposed solution is this one:
blocks = df['stage'].diff().lt(0).cumsum()
df['acc_hours'] = df['diff_hours'].groupby(blocks).cumsum()
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 6
12 0 0 22 6
On the line 11 the value of acc_hours is equal to 6. I need it to be restarted to 0, because the stage dropped from 3 back to 0 in row 11.
The expected output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
How can I implement this logic?
The expected output is unclear, what about a simple mask?
Masking only the value during the change:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df.groupby(m.cumsum())
['diff_hours'].cumsum()
.mask(m, 0)
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 6
13 3 0 22 9
14 0 0 22 9
Or ignoring the value completely bu masking before groupby:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df['diff_hours'].mask(m, 0)
.groupby(m.cumsum())
.cumsum()
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
13 3 0 22 3
14 0 0 22 3
image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1
I wanted to assign the unique id based on the value from the column. For ex. i have a table like this:
df = pd.DataFrame({'A': [0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,1,1,1]}
Eventually I would like to have my output table looks like this:
A
id
1
0
1
2
0
1
3
0
1
4
0
1
5
0
1
6
0
1
7
1
2
8
1
2
9
1
2
10
1
2
11
1
2
12
1
2
13
0
3
14
0
3
15
0
3
16
0
3
17
0
3
18
0
3
19
1
4
20
1
4
21
1
4
22
0
5
23
0
5
24
0
5
25
0
5
26
1
6
27
1
6
28
1
6
I tried data.groupby(['a'], sort=False).ngroup() + 1 but its not working as what I want. Any help and guidance will be appreciated! thanks!
diff + cumsum:
df['id'] = df.A.diff().ne(0).cumsum()
df
A id
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 1 2
8 1 2
9 1 2
10 1 2
11 1 2
12 1 2
13 0 3
14 0 3
15 0 3
16 0 3
17 0 3
18 0 3
19 1 4
20 1 4
21 1 4
22 0 5
23 0 5
24 0 5
25 0 5
26 1 6
27 1 6
28 1 6
import pdrle
df["id"] = pdrle.get_id(df["A"]) + 1
df
# A id
# 0 0 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 0 1
# 5 0 1
# 6 0 1
# 7 1 2
# 8 1 2
# 9 1 2
# 10 1 2
# 11 1 2
# 12 1 2
# 13 0 3
# 14 0 3
# 15 0 3
# 16 0 3
# 17 0 3
# 18 0 3
# 19 1 4
# 20 1 4
# 21 1 4
# 22 0 5
# 23 0 5
# 24 0 5
# 25 0 5
# 26 1 6
# 27 1 6
# 28 1 6
Now I have a DataFrame as below:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
We can see that part of the DataFrame is missing, like user_id_5 and user_id_8. What I want to do is to fill these rows with 0, like:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
Is there any solution to this issue?
You could use arange + reindex -
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
Assuming your index is meant to be monotonically increasing index.
df
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0
df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
5 0 0 0 0 0 0 0 0 0 0 # <-----
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
8 0 0 0 0 0 0 0 0 0 0 # <-----
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0