How can we run Scipy Optimize Based After Doing Group By? - python

I have a dataframe that looks like this.
SectorID MyDate PrevMainCost OutageCost
0 123 10/31/2022 332 193
1 123 9/30/2022 308 269
2 123 8/31/2022 33 440
3 123 7/31/2022 230 147
4 123 6/30/2022 264 184
5 123 5/31/2022 290 46
6 123 4/30/2022 51 150
7 123 3/31/2022 69 253
8 123 2/28/2022 257 308
9 123 1/31/2022 441 349
10 456 10/31/2022 280 188
11 456 9/30/2022 432 150
12 456 8/31/2022 357 307
13 456 7/31/2022 425 45
14 456 6/30/2022 101 278
15 456 5/31/2022 62 240
16 456 4/30/2022 407 46
17 456 3/31/2022 35 218
18 456 2/28/2022 403 113
19 456 1/31/2022 295 200
20 456 12/31/2021 20 235
21 456 11/30/2021 440 403
22 789 10/31/2022 145 181
23 789 9/30/2022 320 259
24 789 8/31/2022 485 472
25 789 7/31/2022 59 24
26 789 6/30/2022 345 64
27 789 5/31/2022 34 480
28 789 4/30/2022 260 162
29 789 3/31/2022 46 399
30 999 10/31/2022 491 346
31 999 9/30/2022 77 212
32 999 8/31/2022 316 112
33 999 7/31/2022 106 351
34 999 6/30/2022 481 356
35 999 5/31/2022 20 269
36 999 4/30/2022 246 268
37 999 3/31/2022 377 173
38 999 2/28/2022 426 413
39 999 1/31/2022 341 168
40 999 12/31/2021 144 471
41 999 11/30/2021 358 393
42 999 10/31/2021 340 197
43 999 9/30/2021 119 252
44 999 8/31/2021 470 203
45 999 7/31/2021 359 163
46 999 6/30/2021 410 383
47 999 5/31/2021 200 119
48 999 4/30/2021 230 291
I am trying to find the minimum of PrevMainCost and OutageCost, after grouping by SectorID. Here's my primitive code.
import numpy as np
import pandas
df = pandas.read_clipboard(sep='\\s+')
df
df_sum = df.groupby('SectorID').sum()
df_sum
df_sum.loc[df_sum['PrevMainCost'] <= df_sum['OutageCost'], 'Result'] = 'Main'
df_sum.loc[df_sum['PrevMainCost'] > df_sum['OutageCost'], 'Result'] = 'Out'
Result: (result column shows flags whether PrevMainCost is lower or OutageCost is lower)
PrevMainCost OutageCost Result
SectorID
123 2275 2339 Main
456 3257 2423 Out
789 1694 2041 Main
999 5511 5140 Out
I am trying to figure out how to use Scipy Optimization to solve this problem. I Googled this problem and came up with this simple code sample.
from scipy.optimize import *
df_sum.groupby(by=['SectorID']).apply(lambda g: minimize(equation, g.Result, options={'xtol':0.001}).x)
When I run that, I get an error saying 'NameError: name 'equation' is not defined'.
How can I find the minimum of either the preventative maintenance cost or the outage cost, after grouping by SectorID? Also, how can I add some kind of constraint, such as no more than 30% of all resources can be used by one any particular SectorID?

Related

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

Set a maximum value for cells in a csv file [duplicate]

This question already has answers here:
Set maximum value (upper bound) in pandas DataFrame
(3 answers)
Closed 3 years ago.
I have some data that that looks like this:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 399 52 182
2 Mar-18 269 209 186
3 Apr-18 124 397 353
4 May-18 113 387 35
5 Jun-18 6 23 3
6 Jul-18 335 284 34
7 Aug-18 154 364 72
8 Sep-18 159 291 349
9 Oct-18 199 253 201
10 Nov-18 106 334 117
11 Dec-18 38 274 23
12 Jan-19 6 326 102
13 Feb-19 124 237 339
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Using python I want to be able to set a maximum value that each value can be. I wan't the maximum to be 300, so that any entry that is over 300 (e.g. 326) is changed to 300.
My desired result looks like this:
Date x y z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Is this achievable to do in Python?
Thanks.
Sure you can using
df.loc[:,'X':]=df.loc[:,'X':].clip_upper(300)
df
Out[118]:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Or
df=df.mask(df>300,300)

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Numpy 1d array subtraction - not getting expected result.

I have 2 arrays of shape (128,). I want the elementwise difference between them.
for idx, x in enumerate(test):
if idx == 0:
print (test[idx])
print()
print(library[idx])
print()
print(np.abs(np.subtract(library[idx],test[idx])))
output:
[186 3 172 80 187 120 127 172 96 213 103 107 137 119 33 53 54 113
200 78 140 234 77 94 151 64 199 218 170 73 152 73 0 5 121 42
0 106 166 80 115 220 56 66 194 187 51 132 55 73 150 83 91 204
108 58 183 0 32 240 255 55 151 255 189 153 77 89 42 176 204 170
93 117 194 195 59 204 149 55 111 255 218 48 72 171 122 163 255 155
198 179 69 173 108 0 0 176 249 214 193 255 106 116 0 47 255 255
255 255 210 175 67 0 95 120 21 158 0 72 120 255 121 208 255 0
61 255]
[189 0 178 72 177 124 123 167 81 235 110 123 139 107 39 54 34 102
195 59 156 255 66 112 161 65 180 236 181 69 142 82 0 0 152 38
0 102 146 86 117 230 59 77 220 182 44 121 63 59 146 41 92 213
146 70 184 0 0 255 255 42 165 255 245 152 114 88 63 138 255 158
96 141 221 201 47 191 179 42 156 255 237 7 136 168 133 142 254 164
236 250 56 202 141 0 0 197 255 184 212 255 108 133 0 7 255 255
255 255 243 197 74 0 50 143 24 175 0 74 101 255 121 207 255 0
146 255]
[ 3 253 6 248 246 4 252 251 241 22 7 16 2 244 6 1 236 245
251 237 16 21 245 18 10 1 237 18 11 252 246 9 0 251 31 252
0 252 236 6 2 10 3 11 26 251 249 245 8 242 252 214 1 9
38 12 1 0 224 15 0 243 14 0 56 255 37 255 21 218 51 244
3 24 27 6 244 243 30 243 45 0 19 215 64 253 11 235 255 9
38 71 243 29 33 0 0 21 6 226 19 0 2 17 0 216 0 0
0 0 33 22 7 0 211 23 3 17 0 2 237 0 0 255 0 0
85 0]
So it reads, the last array printed out is the difference between the first two arrays.
189 - 186 is 3
3 - 0 is 3 (not 253)
I must be missing something trivial.
I'd rather not zip and subtract the values as I have a ton of data.
​
Your arrays probably have dtype uint8; they cannot hold values outside the interval [0, 256), and subtracting 3 from 0 wraps around to 253. The absolute value of 253 is still 253.
Use a different dtype, or restructure your computation to avoid hitting the limits of the dtype you're using.
You can just simply subtract two numpy arrays like this, it is element-wise operation:
>test = np.array([1,2,3])
>library = np.array([1,1,1])
>np.abs(library - test)
array([0, 1, 2])

How I can group by index and make mean on each column

I have following df
Blades & Razors & Foam Diaper Empty Fem Care HairCare Irrelevant Laundry Oral Care Others Personal Cleaning Care Skin Care
retailer
RTM 158 486 193 2755 3490 1458 889 2921 69 1543 645
RTM 39 0 28 2305 80 27 0 0 0 1207 414
RTM 98 276 121 1090 2359 717 561 911 293 1286 528
RTM 107 484 54 2136 2777 151 80 2191 7 1096 673
RTM 156 465 254 2972 2802 763 867 1065 8 2777 728
RTM 126 326 142 2126 2035 581 575 753 45 1768 292
RTM 0 0 181 1816 1455 598 579 0 2 749 451
RTM 86 374 308 2197 2075 576 698 693 26 1398 212
RTM 132 61 153 2094 1508 180 590 785 66 1519 486
RTM 90 303 8 0 0 18 0 60 0 358 0
RTM 0 14 6 190 198 21 131 75 18 171 0
I want make groupby() on my index and then get average on every column with the group? Any idea how to get so ?
To group on index, use:
df.groupby(level=0).mean()
or
df.groupby(df.index).mean()
Sample:
df = pd.DataFrame(data=np.random.random((10, 5)), columns=list('CDEFG'), index=list('AB')*5)
df.head()
C D E F G
A 0.230504 0.830818 0.560533 0.266903 0.745196
B 0.996806 0.861006 0.257780 0.258976 0.738617
A 0.409191 0.688814 0.214247 0.309678 0.565571
B 0.805192 0.940919 0.707562 0.772370 0.122562
A 0.596964 0.935662 0.493612 0.108362 0.673538
for either of the above yields:
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504

Categories

Resources