Get top 5 matching rows from pandas dataframe for many criteria? - python

I have a dataframe containing many rows of the following form.
> all_rel = pandas.read_csv('../data/sv_abundances.csv')
> all_rel.head()
name day sample count tax_id rel
0 seq00000079;size=189384 204 37 1060 CYCL 0.122275
1 seq00000102;size=143633 204 37 639 SPLEN 0.073711
2 seq00000123;size=118889 204 37 813 723171 0.093782
3 seq00000326;size=50743 204 13 470 553239 0.097571
4 seq00000332;size=49099 204 13 468 TAS 0.097156
My goal is to get the top 5 rows sorted by the rel column for each unique combination of day, sample, and count. I have the unique combinations in a dataframe:
#get combinations of days, tax_ids, and samples present in dataset
> t = all_rel.drop_duplicates(['day', 'tax_id', 'sample'])[['day', 'tax_id', 'sample']]
> t.head()
day tax_id sample
0 204 CYCL 37
1 204 SPLEN 37
2 204 723171 37
3 204 553239 13
4 204 TAS 13
The only way I know to accomplish the goal is to use a for loop to iterate over the unique combinations and build up a dataframe.
hacky_df = pandas.DataFrame()
for (day, tax_id, sample) in t.values:
match = all_rel[(all_rel['tax_id']==tax_id) & (all_rel['day']==day) & (all_rel['sample']==sample)]
top_5 = match.sort('rel', ascending=False).head()
hacky_df.append(top_5)
hacky_df.head()
But this takes a long time (still hasn't finished) and doesn't take advantage of the fact that these are numpy arrays under the hood. Is there a way to accomplish my goal with a pandas.df.apply call instead of using a for loop?

The following code gave the intended results:
top_5_df = all_rel.sort('rel', ascending=False).groupby(['day', 'tax_id', 'sample']).head(5).sort(['day', 'sample', 'tax_id'])
print top_5_df.head(20)
name day sample count tax_id rel
136 seq00025622;size=605 204 13 28 188144 0.005813
2596 seq07169587;size=2 204 13 2 188144 0.000415
2438 seq05675680;size=2 204 13 2 188144 0.000415
2419 seq05517001;size=2 204 13 2 188144 0.000415
2123 seq03049127;size=3 204 13 1 188144 0.000208
4448 seq42562010;size=1 204 13 1 28173 0.000208
60 seq00008910;size=1787 204 13 15 335972 0.003114
1074 seq00182900;size=72 204 13 2 335972 0.000415
2151 seq03232487;size=3 204 13 1 335972 0.000208
3302 seq20519515;size=1 204 13 1 335972 0.000208
2451 seq05760125;size=2 204 13 1 335972 0.000208
750 seq00099976;size=139 204 13 23 428643 0.004775
2546 seq06674971;size=2 204 13 2 428643 0.000415
2207 seq03714229;size=3 204 13 1 428643 0.000208
3234 seq19173942;size=1 204 13 1 428643 0.000208
3201 seq18402810;size=1 204 13 1 428643 0.000208
3 seq00000326;size=50743 204 13 470 553239 0.097571
531 seq00066543;size=216 204 13 45 553239 0.009342
72 seq00010509;size=1528 204 13 17 553239 0.003529
117 seq00021191;size=745 204 13 11 553239 0.002284
df.groupby().head() will call head() on each group independently and return a dataframe of the resulting rows.
Here are the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration

Related

Pandas Dataframe Reshape/Alteration Question

I feel like this should be an easy solution, but it has eluded me a bit (long week).
Say I have the following Pandas Dataframe (df):
day
x_count
x_max
y_count
y_max
1
8
230
18
127
1
6
174
12
121
1
5
218
21
184
1
11
91
32
162
2
11
128
17
151
2
13
156
16
148
2
18
191
22
120
Etc. How can I collapse it down so that I have one row per day and each of the columns in my example are added across all of their days?
For example:
day
x_count
x_max
y_count
y_max
1
40
713
93
594
2
42
475
55
419
Is it best to reshape it or simply create a new one?

Cumulative pandas column "reseting" once threshold is reached

I am facing an issue with the following dataset:
item price
1 1706
2 210
3 1664
4 103
5 103
6 314
7 1664
8 57
9 140
10 1628
11 688
12 180
13 604
14 86
15 180
16 86
17 1616
18 832
19 1038
20 57
21 2343
22 151
23 328
24 328
25 57
26 86
27 1706
28 604
29 609
30 86
31 0
32 57
33 302
34 328
I want to have a cumulative sum column which "resets" each time it reaches the threshold (read not exceed it, it is fine to have a big gap between the last cumsum number and the threshold as long as it does not exceed it).
I have tried the following code:
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df.groupby((df['price'].cumsum()) // threshold)['price'].cumsum()
This output the following:
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 9327 #exceeds threshold
16 86 9413 #
17 1616 1616
18 832 2448
19 1038 3486
20 57 3543
21 2343 5886
22 151 6037
23 328 6365
24 328 6693
25 57 6750
26 86 6836
27 1706 8542
28 604 9146
29 609 9755 #exceeds threshold same below
30 86 9841 #
31 0 9841 #
32 57 9898 #
33 302 10200 #
34 328 328
My expected result would be the following instead (for the first part for example):
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 180 #
16 86 266 #
What do I need to change in order to get this result? also i would appreciate any explanation as to why the above code does not work.
Thank you in advance.
Maybe it costs a lot, but it can work...
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df['price'].cumsum()
# handle the cumsum which is gt threshold by loops
n = 1
while True:
print(n)
cond = df['cumsum'].ge(threshold)
if cond.sum():
df.loc[cond, 'cumsum'] = df.loc[cond, 'price'].cumsum()
else:
break
n += 1
Thank you for all the replies and feedback.
I went ahead with the below code which solves my issue:
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cumsum'] = ls

Set a maximum value for cells in a csv file [duplicate]

This question already has answers here:
Set maximum value (upper bound) in pandas DataFrame
(3 answers)
Closed 3 years ago.
I have some data that that looks like this:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 399 52 182
2 Mar-18 269 209 186
3 Apr-18 124 397 353
4 May-18 113 387 35
5 Jun-18 6 23 3
6 Jul-18 335 284 34
7 Aug-18 154 364 72
8 Sep-18 159 291 349
9 Oct-18 199 253 201
10 Nov-18 106 334 117
11 Dec-18 38 274 23
12 Jan-19 6 326 102
13 Feb-19 124 237 339
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Using python I want to be able to set a maximum value that each value can be. I wan't the maximum to be 300, so that any entry that is over 300 (e.g. 326) is changed to 300.
My desired result looks like this:
Date x y z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Is this achievable to do in Python?
Thanks.
Sure you can using
df.loc[:,'X':]=df.loc[:,'X':].clip_upper(300)
df
Out[118]:
Date X Y Z
0 Jan-18 247 58 163
1 Feb-18 300 52 182
2 Mar-18 269 209 186
3 Apr-18 124 300 300
4 May-18 113 300 35
5 Jun-18 6 23 3
6 Jul-18 300 284 34
7 Aug-18 154 300 72
8 Sep-18 159 291 300
9 Oct-18 199 253 201
10 Nov-18 106 300 117
11 Dec-18 38 274 23
12 Jan-19 6 300 102
13 Feb-19 124 237 300
14 Mar-19 263 68 75
15 Apr-19 121 116 21
Or
df=df.mask(df>300,300)

Selectively remove deprecated rows in a pandas dataframe

I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1

Python Pandas GroupBy().Sum() Having Clause

So I have this DataFrame with 3 columns 'Order ID, 'Order Qty' and 'Fill Qty'
I want to sum the Fill Qty per order then compare it to Order Qty, Ideally I will return only a dataframe that gives me Order ID whenever aggregate Fill Qty is greater than Order Qty.
In SQL I think what I'm looking for is
SELECT * FROM DataFrame GROUP BY Order ID, Order Qty HAVING sum(Fill Qty)>Order Qty
So far I have this:
SumFills= DataFrame.groupby(['Order ID','Order Qty']).sum()
output:
....................................Fill Qty
Order ID - Order Qty -
1--------- 300 --------- 300
2 --------- 80 ----------- 40
3 --------- 20 ----------- 20
4 --------- 110 ---------- 220
5 --------- 100 ---------- 200
6 --------- 100 ---------- 200
Above is aggregated already, I would ideally like to return a list/array of [4,5,6] since those have sum(fill qty) > Order Qty
View original dataframe:
In [57]: print original_df
Order Id Fill Qty Order Qty
0 1 419 334
1 2 392 152
2 3 167 469
3 4 470 359
4 5 447 441
5 6 154 190
6 7 365 432
7 8 209 181
8 9 140 136
9 10 112 358
10 11 384 302
11 12 307 376
12 13 119 237
13 14 147 342
14 15 279 197
15 16 280 137
16 17 148 381
17 18 313 498
18 19 193 328
19 20 291 193
20 21 100 357
21 22 161 286
22 23 453 168
23 24 349 283
Create and view new dataframe summing the Fill Qty:
In [58]: new_df = original_df.groupby(['Order Id','Order Qty'], as_index=False).sum()
In [59]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
2 3 469 167
3 4 359 470
4 5 441 447
5 6 190 154
6 7 432 365
7 8 181 209
8 9 136 140
9 10 358 112
10 11 302 384
11 12 376 307
12 13 237 119
13 14 342 147
14 15 197 279
15 16 137 280
16 17 381 148
17 18 498 313
18 19 328 193
19 20 193 291
20 21 357 100
21 22 286 161
22 23 168 453
23 24 283 349
Slice new dataframe to only those rows where Fill Qty > Order Qty:
In [60]: new_df = new_df.loc[new_df['Fill Qty'] > new_df['Order Qty'],:]
In [61]: print new_df
Order Id Order Qty Fill Qty
0 1 334 419
1 2 152 392
3 4 359 470
4 5 441 447
7 8 181 209
8 9 136 140
10 11 302 384
14 15 197 279
15 16 137 280
19 20 193 291
22 23 168 453
23 24 283 349

Categories

Resources