This question is based on this thread.
I have the following dataframe:
diff_hours stage sensor
0 0 20
0 0 21
0 0 21
1 0 22
5 0 21
0 0 22
0 1 20
7 1 23
0 1 24
0 3 25
0 3 28
6 0 21
0 0 22
I need to calculated an accumulated value of diff_hours while stage is growing. When stage drops to 0, the accumulated value acc_hours should restart to 0 even though diff_hours might not be equal to 0.
The proposed solution is this one:
blocks = df['stage'].diff().lt(0).cumsum()
df['acc_hours'] = df['diff_hours'].groupby(blocks).cumsum()
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 6
12 0 0 22 6
On the line 11 the value of acc_hours is equal to 6. I need it to be restarted to 0, because the stage dropped from 3 back to 0 in row 11.
The expected output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
How can I implement this logic?
The expected output is unclear, what about a simple mask?
Masking only the value during the change:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df.groupby(m.cumsum())
['diff_hours'].cumsum()
.mask(m, 0)
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 6
13 3 0 22 9
14 0 0 22 9
Or ignoring the value completely bu masking before groupby:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df['diff_hours'].mask(m, 0)
.groupby(m.cumsum())
.cumsum()
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
13 3 0 22 3
14 0 0 22 3
image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1
I have a data frame and I need to group by at least one occurrence greater than 0 and I need to sum it to last occurance. My code is below
data = {'id':
[7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'timeatAcc':
[0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0]
}
df =pd.DataFrame(data, columns= ['id','timeatAcc'])
df['consecutive'] = df['id'].groupby((df['timeatAcc'] !=
df['timeatAcc'].shift()).cumsum()).transform('size') * df['timeatAcc']
print(df)
Current Output
Expected output
Need help and thanks in advance
Let's try groupby().diff():
df['Occurences'] = df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 1
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 1
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 1
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 1
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 1
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
Update: to get the sum instead of 1:
df['Occurences'] = df.groupby(['id', df['timeatAcc'].eq(0).cumsum()])['timeatAcc'].transform('sum')
df['Occurences'] = np.where(df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
, df['Occurences'], 0)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 3
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 2
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 3
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 3
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 2
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
So I have a two Dictionaries which are composed with 10 of 3000 by 3000 Dataframe in each index(0~9). All the values in the Dataframe is int, and I just want to simply divide each values. The first loop below is only replacing index=column values into 0, and personally I do not think this loop is slowing the process. The second loop is the problem with run time (I believe) since there are too many data to compute. Please see the code below.
for a in range(10):
for aa in range(len(dict_cat4[a])):
dict_cat4[a].iloc[aa,aa] = 0
dict_amt4[a].iloc[aa,aa] = 0
for b in range(10):
temp_df3 = dict_amt4[b] / dict_cat4[b]
temp_df3.replace(np.nan,0.0,inplace=True)
dict_div4[b] = temp_df3
One problem is that the process takes forever to compute this loop since the data set is very big. Is there a efficient way to convert my code into other loops? Now its been 60+ minutes and still computing. Please let me know! Thanks
-----------------edit------------------
Below is sample input and output of first loop
Output:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
And second loop is below
Input:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 186 174 0 4 46 46 14 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 186 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 130 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Input:dict_cat4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Output:dict_div4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 3 1 0 1 1 1 2 1 1 ... 1 1 1 1 1 0 0 0 0 0
B 3 0 1 0 0 1 1 1 1 1 ... 1 1 1 1 0 0 0 0 0 0
C 1 1 0 0 0 10 1 1 1 1 ... 0 1 1 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 1 0 0 0 0 1 1 1 1 1 ... 0 0 0 0 0 0 1 0 0 0
F 1 1 1 0 1 0 1 1 0 0 ... 0 1 1 0 0 0 1 1 0 0
.............
I just made a sample data by hand, so please disregard typo. As you can see the first loop is just converting a value that dict_cat4[0].iloc[i,i] = 0. Second loop is dividing all the value from dict_amt[0] to dict_cat[0]. Hope it makes more sense.