I feel like this should be an easy solution, but it has eluded me a bit (long week).
Say I have the following Pandas Dataframe (df):
day
x_count
x_max
y_count
y_max
1
8
230
18
127
1
6
174
12
121
1
5
218
21
184
1
11
91
32
162
2
11
128
17
151
2
13
156
16
148
2
18
191
22
120
Etc. How can I collapse it down so that I have one row per day and each of the columns in my example are added across all of their days?
For example:
day
x_count
x_max
y_count
y_max
1
40
713
93
594
2
42
475
55
419
Is it best to reshape it or simply create a new one?
I am facing an issue with the following dataset:
item price
1 1706
2 210
3 1664
4 103
5 103
6 314
7 1664
8 57
9 140
10 1628
11 688
12 180
13 604
14 86
15 180
16 86
17 1616
18 832
19 1038
20 57
21 2343
22 151
23 328
24 328
25 57
26 86
27 1706
28 604
29 609
30 86
31 0
32 57
33 302
34 328
I want to have a cumulative sum column which "resets" each time it reaches the threshold (read not exceed it, it is fine to have a big gap between the last cumsum number and the threshold as long as it does not exceed it).
I have tried the following code:
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df.groupby((df['price'].cumsum()) // threshold)['price'].cumsum()
This output the following:
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 9327 #exceeds threshold
16 86 9413 #
17 1616 1616
18 832 2448
19 1038 3486
20 57 3543
21 2343 5886
22 151 6037
23 328 6365
24 328 6693
25 57 6750
26 86 6836
27 1706 8542
28 604 9146
29 609 9755 #exceeds threshold same below
30 86 9841 #
31 0 9841 #
32 57 9898 #
33 302 10200 #
34 328 328
My expected result would be the following instead (for the first part for example):
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 180 #
16 86 266 #
What do I need to change in order to get this result? also i would appreciate any explanation as to why the above code does not work.
Thank you in advance.
Maybe it costs a lot, but it can work...
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df['price'].cumsum()
# handle the cumsum which is gt threshold by loops
n = 1
while True:
print(n)
cond = df['cumsum'].ge(threshold)
if cond.sum():
df.loc[cond, 'cumsum'] = df.loc[cond, 'price'].cumsum()
else:
break
n += 1
Thank you for all the replies and feedback.
I went ahead with the below code which solves my issue:
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cumsum'] = ls
This question already has answers here:
Set maximum value (upper bound) in pandas DataFrame
(3 answers)
Closed 3 years ago.
I have some data that that looks like this:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 399 52 182
2 Mar-18 269 209 186
3 Apr-18 124 397 353
4 May-18 113 387 35
5 Jun-18 6 23 3
6 Jul-18 335 284 34
7 Aug-18 154 364 72
8 Sep-18 159 291 349
9 Oct-18 199 253 201
10 Nov-18 106 334 117
11 Dec-18 38 274 23
12 Jan-19 6 326 102
13 Feb-19 124 237 339
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Using python I want to be able to set a maximum value that each value can be. I wan't the maximum to be 300, so that any entry that is over 300 (e.g. 326) is changed to 300.
My desired result looks like this:
Date x y z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Is this achievable to do in Python?
Thanks.
Sure you can using
df.loc[:,'X':]=df.loc[:,'X':].clip_upper(300)
df
Out[118]:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Or
df=df.mask(df>300,300)
I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1
So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349