Cumulative pandas column "reseting" once threshold is reached - python

I am facing an issue with the following dataset:
item price
1 1706
2 210
3 1664
4 103
5 103
6 314
7 1664
8 57
9 140
10 1628
11 688
12 180
13 604
14 86
15 180
16 86
17 1616
18 832
19 1038
20 57
21 2343
22 151
23 328
24 328
25 57
26 86
27 1706
28 604
29 609
30 86
31 0
32 57
33 302
34 328
I want to have a cumulative sum column which "resets" each time it reaches the threshold (read not exceed it, it is fine to have a big gap between the last cumsum number and the threshold as long as it does not exceed it).
I have tried the following code:
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df.groupby((df['price'].cumsum()) // threshold)['price'].cumsum()
This output the following:
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 9327 #exceeds threshold
16 86 9413 #
17 1616 1616
18 832 2448
19 1038 3486
20 57 3543
21 2343 5886
22 151 6037
23 328 6365
24 328 6693
25 57 6750
26 86 6836
27 1706 8542
28 604 9146
29 609 9755 #exceeds threshold same below
30 86 9841 #
31 0 9841 #
32 57 9898 #
33 302 10200 #
34 328 328
My expected result would be the following instead (for the first part for example):
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 180 #
16 86 266 #
What do I need to change in order to get this result? also i would appreciate any explanation as to why the above code does not work.
Thank you in advance.

Maybe it costs a lot, but it can work...
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df['price'].cumsum()
# handle the cumsum which is gt threshold by loops
n = 1
while True:
print(n)
cond = df['cumsum'].ge(threshold)
if cond.sum():
df.loc[cond, 'cumsum'] = df.loc[cond, 'price'].cumsum()
else:
break
n += 1

Thank you for all the replies and feedback.
I went ahead with the below code which solves my issue:
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cumsum'] = ls

Related

How to plot multiple chart on one figure and combine with another?

# Create an axes object
axes = plt.gca()
# pass the axes object to plot function
df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii',
xlabel='鄉鎮別',ylabel='人數')
It's my data.
鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚
0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13
1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7
2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3
3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11
4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19
5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7
6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0
7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2
8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3
9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1
10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2
11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4
12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5
13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0
14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2
15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2
16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0
17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1
18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82
This my output df.plot
First question is how to display Chinese?
Second is can I use without df.plot to plot line chart?
last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Pandas pivot table with mean

I have a pandas data frame, df, that looks like this;
index New Old MAP Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to create a pivot table that looks like this
Limit <=18 >18
New xx1 xx2
Old xx3 xx4
MAP xx5 xx6
where values xx1, xx2, xx3, xx4, xx5, and xx6 are the mean of New, Old and Map for respective Limit.
How can I achieve this?
I tried the following without success.
table = df.pivot_table('count', index=['New', 'Old', 'MAP'], columns=['Limit'], aggfunc='mean')
Solution
df.groupby('Limit')['New', 'Old', 'MAP'].mean().T
Limit <= 18 > 18
New 108.5 121.615385
Old 56.0 72.769231
MAP 73.0 88.692308

Get top 5 matching rows from pandas dataframe for many criteria?

I have a dataframe containing many rows of the following form.
> all_rel = pandas.read_csv('../data/sv_abundances.csv')
> all_rel.head()
name day sample count tax_id rel
0 seq00000079;size=189384 204 37 1060 CYCL 0.122275
1 seq00000102;size=143633 204 37 639 SPLEN 0.073711
2 seq00000123;size=118889 204 37 813 723171 0.093782
3 seq00000326;size=50743 204 13 470 553239 0.097571
4 seq00000332;size=49099 204 13 468 TAS 0.097156
My goal is to get the top 5 rows sorted by the rel column for each unique combination of day, sample, and count. I have the unique combinations in a dataframe:
#get combinations of days, tax_ids, and samples present in dataset
> t = all_rel.drop_duplicates(['day', 'tax_id', 'sample'])[['day', 'tax_id', 'sample']]
> t.head()
day tax_id sample
0 204 CYCL 37
1 204 SPLEN 37
2 204 723171 37
3 204 553239 13
4 204 TAS 13
The only way I know to accomplish the goal is to use a for loop to iterate over the unique combinations and build up a dataframe.
hacky_df = pandas.DataFrame()
for (day, tax_id, sample) in t.values:
match = all_rel[(all_rel['tax_id']==tax_id) & (all_rel['day']==day) & (all_rel['sample']==sample)]
top_5 = match.sort('rel', ascending=False).head()
hacky_df.append(top_5)
hacky_df.head()
But this takes a long time (still hasn't finished) and doesn't take advantage of the fact that these are numpy arrays under the hood. Is there a way to accomplish my goal with a pandas.df.apply call instead of using a for loop?
The following code gave the intended results:
top_5_df = all_rel.sort('rel', ascending=False).groupby(['day', 'tax_id', 'sample']).head(5).sort(['day', 'sample', 'tax_id'])
print top_5_df.head(20)
name day sample count tax_id rel
136 seq00025622;size=605 204 13 28 188144 0.005813
2596 seq07169587;size=2 204 13 2 188144 0.000415
2438 seq05675680;size=2 204 13 2 188144 0.000415
2419 seq05517001;size=2 204 13 2 188144 0.000415
2123 seq03049127;size=3 204 13 1 188144 0.000208
4448 seq42562010;size=1 204 13 1 28173 0.000208
60 seq00008910;size=1787 204 13 15 335972 0.003114
1074 seq00182900;size=72 204 13 2 335972 0.000415
2151 seq03232487;size=3 204 13 1 335972 0.000208
3302 seq20519515;size=1 204 13 1 335972 0.000208
2451 seq05760125;size=2 204 13 1 335972 0.000208
750 seq00099976;size=139 204 13 23 428643 0.004775
2546 seq06674971;size=2 204 13 2 428643 0.000415
2207 seq03714229;size=3 204 13 1 428643 0.000208
3234 seq19173942;size=1 204 13 1 428643 0.000208
3201 seq18402810;size=1 204 13 1 428643 0.000208
3 seq00000326;size=50743 204 13 470 553239 0.097571
531 seq00066543;size=216 204 13 45 553239 0.009342
72 seq00010509;size=1528 204 13 17 553239 0.003529
117 seq00021191;size=745 204 13 11 553239 0.002284
df.groupby().head() will call head() on each group independently and return a dataframe of the resulting rows.
Here are the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration

Issue combining columns in Dataframe?

I have the following dataframe:
Obj BIT BIT BIT GAS GAS GAS OIL OIL OIL
Date
2007-01-03 18 7 0 184 35 2 52 14 0
2007-01-09 43 3 0 249 35 2 68 11 1
2007-01-16 60 6 0 254 35 5 72 13 1
2007-01-23 69 11 1 255 43 2 81 6 0
2007-01-30 74 8 0 263 29 4 69 9 0
2007-02-06 78 6 1 259 34 2 79 6 0
2007-02-14 76 9 1 263 24 2 70 10 1
2007-02-20 85 7 0 241 20 6 72 4 0
2007-02-27 79 6 0 242 35 3 68 7 0
2007-03-06 68 14 0 225 26 2 57 10 1
How can I sum each of the 9 columns into 3 columns. "BIT","GAS" and "OIL"
This is the code for the dataframe which basically just gets me a cross section from a larger df I want:
ABrigsA = ndfAB.xs(['BIT','GAS','OIL'],axis=1)
Any suggestions?
Assuming that you want to sum similarly-named columns, you can use groupby [tutorial docs]:
>>> df.groupby(level=0, axis='columns').sum()
Obj BIT GAS OIL
Date
2007-01-03 25 221 66
2007-01-09 46 286 80
2007-01-16 66 294 86
2007-01-23 81 300 87
2007-01-30 82 296 78
2007-02-06 85 295 85
2007-02-14 86 289 81
2007-02-20 92 267 76
2007-02-27 85 280 75
2007-03-06 82 253 68

Categories

Resources