Custom ranking selection with Pandas - python

I have a 100x5 pandas dataframe with values ranging from 1 to 499.
# seed for reproducability
np.random.seed(3)
sample = pd.DataFrame(np.random.randint(0, 500, size = (100, 5)))
sample.columns = "X Y Z F V".split()
I want to select 10 rows from this dataframe, where for each column I select the rows corresponding the top 2 values for each of the columns (separately) and without duplicates.
If there are duplicates lets say for top 1st of column X and top 2nd of Y, then randomly keep one of them and the other one replace with the next biggest (top 3rd for X or top 3rd for Y -> randomly) and do it until there are no duplicate rows selected for any columns.
What I have so far
# convert it to long format and use groupby to get top values and their index - ID
stacked = (
sample
.stack()
.reset_index()
.rename(columns = {"level_0": "ID", "level_1": "Feature", 0: "Value"})
.set_index("ID")
)
stacked.groupby("Feature").Value.nlargest(2)
Which gives me this
Feature ID Value
F 37 489
F 32 481
V 19 497
V 22 497
X 25 495
X 32 491
Y 17 498
Y 22 496
Z 95 496
Z 45 489
It means, I need to select the rows based on ID values from that dataset. However, as you see, for column V and Y, F and X I have duplicate rows selected. I could not come up with the implementation of the logic with duplicates. I would be grateful for any help
One potential approach could be to select 5 rows per each column top values with groupby.nlargest() and from that dataframe select two as top as possible rows per each column where there are no duplicates. Unfortunately, I do not know any pythonic ways of doing this:
Feature ID Value
0 F 37 489
1 F 32 481
2 F 65 474
3 F 82 470
4 F 66 467
5 V 19 497
6 V 22 497
7 V 11 489
8 V 98 486
9 V 15 484
10 X 25 495
11 X 32 491
12 X 99 490
13 X 76 487
14 X 93 486
15 Y 17 498
16 Y 22 496
17 Y 89 494
18 Y 68 493
19 Y 3 480
20 Z 95 496
21 Z 45 489
22 Z 62 488
23 Z 79 485
24 Z 22 484
Desirable result would be (there is randomality)
X Y Z F V
37 133 212 351 489 106
32 491 135 441 481 427
19 48 445 289 308 497
22 182 496 484 91 497
25 495 444 216 311 267
99 490 164 345 23 365
17 400 498 274 331 183
89 309 494 122 82 140
95 275 213 496 167 98
45 267 246 489 252 17
Maybe I explained the logic in a complex way, but if you look at dataset and desirable outcome, you might get more clue
EDIT:
fixed the desirable result, it contained duplicate

Here is one possible way to approach the problem, where we use set to store the indices corresponding to top two largest values from each column
Pseudocode
For every column in the dataframe
Drop the indices that are already selected from the previous columns
drop the duplicate values
Sort the values in descending order
Using set union (|=) add the indices corresponding to top 2 largest values in the current column
ix = set()
for c in df.columns:
s = df[c].drop(ix).drop_duplicates().sort_values(ascending=False)
ix |= set(s.index[:2])
>>> df.loc[ix]
X Y Z F V
32 491 135 441 481 427
65 274 320 455 474 437
37 133 212 351 489 106
11 375 54 0 192 489
45 267 246 489 252 17
17 400 498 274 331 183
19 48 445 289 308 497
22 182 496 484 91 497
25 495 444 216 311 267
95 275 213 496 167 98

I came up with a solution although it is not the most elegant:
indexes = []
while len(indexes) < 2 * len(sample.columns):
new_df = sample.drop(index=indexes)
if new_df.iloc[:, len(indexes) // 2].idxmax() not in indexes:
indexes.append(new_df.iloc[:, len(indexes) // 2].idxmax())
sample.iloc[indexes]
Output:
>>> print(sample.iloc[indexes])
X Y Z F V
25 495 444 216 311 267
32 491 135 441 481 427
17 400 498 274 331 183
22 182 496 484 91 497
95 275 213 496 167 98
45 267 246 489 252 17
37 133 212 351 489 106
65 274 320 455 474 437
19 48 445 289 308 497
11 375 54 0 192 489

Related

How can we run Scipy Optimize Based After Doing Group By?

I have a dataframe that looks like this.
SectorID MyDate PrevMainCost OutageCost
0 123 10/31/2022 332 193
1 123 9/30/2022 308 269
2 123 8/31/2022 33 440
3 123 7/31/2022 230 147
4 123 6/30/2022 264 184
5 123 5/31/2022 290 46
6 123 4/30/2022 51 150
7 123 3/31/2022 69 253
8 123 2/28/2022 257 308
9 123 1/31/2022 441 349
10 456 10/31/2022 280 188
11 456 9/30/2022 432 150
12 456 8/31/2022 357 307
13 456 7/31/2022 425 45
14 456 6/30/2022 101 278
15 456 5/31/2022 62 240
16 456 4/30/2022 407 46
17 456 3/31/2022 35 218
18 456 2/28/2022 403 113
19 456 1/31/2022 295 200
20 456 12/31/2021 20 235
21 456 11/30/2021 440 403
22 789 10/31/2022 145 181
23 789 9/30/2022 320 259
24 789 8/31/2022 485 472
25 789 7/31/2022 59 24
26 789 6/30/2022 345 64
27 789 5/31/2022 34 480
28 789 4/30/2022 260 162
29 789 3/31/2022 46 399
30 999 10/31/2022 491 346
31 999 9/30/2022 77 212
32 999 8/31/2022 316 112
33 999 7/31/2022 106 351
34 999 6/30/2022 481 356
35 999 5/31/2022 20 269
36 999 4/30/2022 246 268
37 999 3/31/2022 377 173
38 999 2/28/2022 426 413
39 999 1/31/2022 341 168
40 999 12/31/2021 144 471
41 999 11/30/2021 358 393
42 999 10/31/2021 340 197
43 999 9/30/2021 119 252
44 999 8/31/2021 470 203
45 999 7/31/2021 359 163
46 999 6/30/2021 410 383
47 999 5/31/2021 200 119
48 999 4/30/2021 230 291
I am trying to find the minimum of PrevMainCost and OutageCost, after grouping by SectorID. Here's my primitive code.
import numpy as np
import pandas
df = pandas.read_clipboard(sep='\\s+')
df
df_sum = df.groupby('SectorID').sum()
df_sum
df_sum.loc[df_sum['PrevMainCost'] <= df_sum['OutageCost'], 'Result'] = 'Main'
df_sum.loc[df_sum['PrevMainCost'] > df_sum['OutageCost'], 'Result'] = 'Out'
Result: (result column shows flags whether PrevMainCost is lower or OutageCost is lower)
PrevMainCost OutageCost Result
SectorID
123 2275 2339 Main
456 3257 2423 Out
789 1694 2041 Main
999 5511 5140 Out
I am trying to figure out how to use Scipy Optimization to solve this problem. I Googled this problem and came up with this simple code sample.
from scipy.optimize import *
df_sum.groupby(by=['SectorID']).apply(lambda g: minimize(equation, g.Result, options={'xtol':0.001}).x)
When I run that, I get an error saying 'NameError: name 'equation' is not defined'.
How can I find the minimum of either the preventative maintenance cost or the outage cost, after grouping by SectorID? Also, how can I add some kind of constraint, such as no more than 30% of all resources can be used by one any particular SectorID?

How do i drop the 0, 1 and 2?

I have the code below trying to check dob_years for suspicious values and count the percentage
df['dob_years'].value_counts()
The result is below
35 614
41 603
40 603
34 597
38 595
42 592
33 577
39 572
31 556
36 553
44 543
29 543
30 536
48 536
37 531
43 510
50 509
32 506
49 505
28 501
45 494
27 490
52 483
56 482
47 480
54 476
46 469
58 461
57 457
53 457
51 446
55 441
59 441
26 406
60 376
25 356
61 353
62 351
63 268
24 263
64 263
23 252
65 194
66 183
22 183
67 167
21 110
0 100
68 99
69 83
2 76
70 65
71 58
20 51
1 47
72 33
19 14
73 8
74 6
75 1
How do I drop the ages showing as 0, 1, and 2?
I tried the code below but it didn't work
df.drop(df[(df['dob_years'] = 0) & (df['dob_years'] = 1)].index, inplace=True)
This statement df['dob_years'].value_counts() takes a series and returns another series. The result in your question is a series with index as dob_years and the counts as value array.
To follow the suggestions from Jon Clements and others, you will have to convert it in to a DataFrame using the to_frame function. Consider this code:
import pandas as pd
# create the data frame
df = pd.read_csv('dob.csv')
# create count series and convert it in to a DataFrame
df1 = df['dob_years'].value_counts().to_frame("counts")
# convert DataFrame index in to a column
df1.reset_index(inplace=True)
# rename the column index to dob_years
df1 = df1.rename(columns = {'index':'dob_years'})
# dropping the required rows from DataFrame
df1 = df1[~df1['dob_years'].isin([0, 1, 2])]
print(df1)

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Python Pandas GroupBy().Sum() Having Clause

So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349

Categories

Resources