How to plot a histogram - python

My program imports these:
import requests
import demjson
import pandas as pd
from pandas import DataFrame
import pylab
pylab.show()
I have a dataframe which if I print out looks like this:
Strike COI POI
0 50.00 927 1694
1 55.00 394 1898
2 60.00 2042 4438
3 65.00 642 3696
4 70.00 3169 3216
5 75.00 2529 3222
6 80.00 6268 14029
7 85.00 3988 6241
8 87.50 356 1516
9 90.00 15676 14345
10 92.50 1309 2498
11 95.00 3303 11391
12 97.50 1074 1472
13 100.00 64930 19513
14 105.00 10953 9286
15 110.00 19956 13008
16 115.00 13956 12932
17 120.00 23440 9240
18 125.00 12167 7467
19 130.00 23531 10168
20 135.00 9567 2637
21 140.00 18967 6854
22 145.00 7890 5176
23 150.00 21516 8079
24 155.00 3137 267
25 160.00 4115 432
26 165.00 1079 205
27 170.00 4341 785
28 175.00 6277 1631
29 180.00 1805 35
30 185.00 906 136
31 190.00 1984 377
32 195.00 3539 268
Sometimes there are zero values like this
Strike COI POI
0 95.00 53 663
1 100.00 16 595
2 105.00 6 377
3 110.00 56 1217
4 115.00 174 994
5 120.00 631 3227
6 125.00 701 1031
7 130.00 2678 833
8 135.00 1921 1049
9 140.00 1238 10
10 160.00 1486 0
11 165.00 1900 0
Unfortunately sometimes the Strike is a float like this:
Strike COI POI
0 34.29 476 12711
1 35.71 95 7782
2 37.14 0 7844
3 38.57 0 3640
4 40.00 93 6010
5 41.43 0 5621
6 42.86 1245 18146
7 44.29 116 6844
8 45.71 140 7099
9 47.14 500 483
10 48.57 445 3956
11 50.00 1540 22362
12 51.43 152 6366
13 52.86 131 8354
14 54.29 810 7542
15 55.71 132 9337
16 57.14 12455 15024
17 58.57 662 5245
18 60.00 1743 9116
19 61.43 1368 7236
20 62.86 1128 11890
21 64.29 4537 24204
22 65.71 766 5113
23 67.14 1859 10572
24 68.57 12407 11367
25 70.00 13263 11748
26 71.43 23400 31566
27 72.86 2784 12984
28 74.29 12679 20520
29 75.71 6932 14617
.. ... ... ...
63 115.00 39738 18033
64 115.71 5293 2877
65 116.43 1874 2748
66 117.14 4181 1965
67 117.86 3618 4214
68 118.57 11652 4043
69 120.00 81523 34752
70 121.43 14239 3527
71 122.86 9046 6160
72 125.00 187 88
73 125.71 22557 7381
74 128.57 11053 8163
75 130.00 74007 27825
76 131.43 6747 1951
77 132.86 7289 1383
78 134.29 5872 1380
79 135.71 4946 2047
80 137.14 5349 590
81 140.00 98310 57767
82 145.00 9857 403
83 150.00 64701 2063
84 155.00 17398 1434
85 160.00 12363 1133
86 165.00 5222 539
87 170.00 9050 918
88 175.00 9848 678
89 180.00 3408 85
90 185.00 3243 768
91 190.00 3646 419
92 195.00 4789 149
Since I want the Strikes to be the bin, I have tried to plot a histogram by saying:
df.hist(by=df.Strike)
but I either get nothing, or when I do see the system ready to plot with a bunch of little grids (I am using Spyder) I get this error before any plot. As far as I can see, all the dataframes have at least one point. The y-axis also doesn't make sense since its height appears to always be one:
Traceback (most recent call last):
File "<ipython-input-20-6f27fa6cf56c>", line 1, in <module>
runfile('/home/idf/goog.py', wdir='/home/idf')
File "/home/idf/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "/home/idf/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/home/idf/goog.py", line 153, in <module>
df.hist(by=df.Strike)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2740, in hist_frame
**kwds)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2873, in grouped_hist
figsize=figsize, layout=layout, rot=rot)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2983, in _grouped_plot
plotf(group, ax, **kwargs)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2867, in plot_group
ax.hist(group.dropna().values, bins=bins, **kwargs)
File "/home/idf/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 5597, in hist
raise ValueError("x must have at least one data point")
ValueError: x must have at least one data point

When you call DataFrame.hist method (i.e. pandas internal plotting function) you only need to pass a column name:
df.hist('Strike') # which is the same as df.hist(column='Strike')
To get:
If you would use plt.hist (directly accessing matplotlib function) then you would need to pass df.Strike.values.

Related

How to plot multiple chart on one figure and combine with another?

# Create an axes object
axes = plt.gca()
# pass the axes object to plot function
df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii',
xlabel='鄉鎮別',ylabel='人數')
It's my data.
鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚
0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13
1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7
2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3
3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11
4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19
5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7
6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0
7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2
8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3
9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1
10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2
11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4
12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5
13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0
14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2
15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2
16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0
17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1
18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82
This my output df.plot
First question is how to display Chinese?
Second is can I use without df.plot to plot line chart?
last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).

Cumulative pandas column "reseting" once threshold is reached

I am facing an issue with the following dataset:
item price
1 1706
2 210
3 1664
4 103
5 103
6 314
7 1664
8 57
9 140
10 1628
11 688
12 180
13 604
14 86
15 180
16 86
17 1616
18 832
19 1038
20 57
21 2343
22 151
23 328
24 328
25 57
26 86
27 1706
28 604
29 609
30 86
31 0
32 57
33 302
34 328
I want to have a cumulative sum column which "resets" each time it reaches the threshold (read not exceed it, it is fine to have a big gap between the last cumsum number and the threshold as long as it does not exceed it).
I have tried the following code:
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df.groupby((df['price'].cumsum()) // threshold)['price'].cumsum()
This output the following:
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 9327 #exceeds threshold
16 86 9413 #
17 1616 1616
18 832 2448
19 1038 3486
20 57 3543
21 2343 5886
22 151 6037
23 328 6365
24 328 6693
25 57 6750
26 86 6836
27 1706 8542
28 604 9146
29 609 9755 #exceeds threshold same below
30 86 9841 #
31 0 9841 #
32 57 9898 #
33 302 10200 #
34 328 328
My expected result would be the following instead (for the first part for example):
item price cumsum
1 1706 1706
2 210 1916
3 1664 3580
4 103 3683
5 103 3786
6 314 4100
7 1664 5764
8 57 5821
9 140 5961
10 1628 7589
11 688 8277
12 180 8757
13 604 9061
14 86 9147
15 180 180 #
16 86 266 #
What do I need to change in order to get this result? also i would appreciate any explanation as to why the above code does not work.
Thank you in advance.
Maybe it costs a lot, but it can work...
threshold = (7.17*1728)*0.75 #this is equal to 9292.32
df['cumsum'] = df['price'].cumsum()
# handle the cumsum which is gt threshold by loops
n = 1
while True:
print(n)
cond = df['cumsum'].ge(threshold)
if cond.sum():
df.loc[cond, 'cumsum'] = df.loc[cond, 'price'].cumsum()
else:
break
n += 1
Thank you for all the replies and feedback.
I went ahead with the below code which solves my issue:
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cumsum'] = ls

pandas syntax error returning on multiple conditions

I cannot figure out what the problem is with the code, it is giving me "invalid syntax error" but im following exact instructions and it looks accurate, i'm trying to get just the people with over 30 doubles ('2B') and in the AL league from the merged data below (d820hw5p3)... any ideas whats going on??
d820hw5p6= d820hw5p3[(d820hw5p3.2B > 30) & (d820hw5p3.LEAGUE == 'AL')]
d820hw5p6
d820hw5p3 is this data:
First Last R H AB LEAGUE 2B 3B HR RBI
0 Leonys Martin 72 128 518 AL 17 3 15 47
1 Jay Bruce 74 135 540 NL 27 6 33 99
2 Jackie Bradley Jr. 94 149 558 AL 30 7 26 87
3 George Springer 116 168 644 AL 29 5 29 82
4 Corey Dickerson 57 125 510 AL 36 3 24 70
5 Dexter Fowler 84 126 457 NL 25 7 13 48
6 Angel Pagan 71 137 495 NL 24 5 12 55
7 Adam Eaton 91 176 620 AL 29 9 14 59
8 Yasmany Tomas 72 144 529 NL 30 1 31 83
9 Gregory Polanco 79 136 527 NL 34 4 22 86
10 Nomar Mazara 59 137 515 AL 13 3 20 64
11 Justin Upton 81 140 569 AL 28 2 31 87
12 Bryce Harper 84 123 506 NL 24 2 24 86
13 Kole Calhoun 91 161 594 AL 35 5 18 75
14 Ender Inciarte 85 152 522 NL 24 7 3 29
15 Jacoby Ellsbury 71 145 551 AL 24 5 9 56
16 Curtis Granderson 88 129 544 NL 24 5 30 59
17 Mookie Betts 122 214 673 AL 42 5 31 113
18 Denard Span 70 152 571 NL 23 5 11 53
19 Adam Duvall 85 133 552 NL 31 6 33 103
20 Brett Gardner 80 143 548 AL 22 6 7 41
21 Matt Kemp 89 167 623 NL 39 0 35 108
22 Khris Davis 85 137 555 AL 24 2 42 102
23 Mike Trout 123 173 549 AL 32 5 29 100
24 Melky Cabrera 70 175 591 AL 42 5 14 86
25 Jose Bautista 68 99 423 AL 24 1 22 69
26 Ian Desmond 107 178 625 AL 29 3 22 86
27 Alex Gordon 62 98 445 AL 16 2 17 40
28 Ryan Braun 80 156 511 NL 23 3 30 91
29 Nick Markakis 67 161 599 NL 38 0 13 89
30 Carlos Gonzalez 87 174 584 NL 42 2 25 100
31 Yoenis Cespedes 72 134 479 NL 25 1 31 86
32 Stephen Piscotty 86 159 582 NL 35 3 22 85
33 Michael Saunders 70 124 490 AL 32 3 24 57
34 Jayson Werth 84 128 525 NL 28 0 21 69
35 Howie Kendrick 65 124 486 NL 26 2 8 40
36 Adam Jones 86 164 619 AL 19 0 29 83
37 Marcell Ozuna 75 148 556 NL 23 6 23 76
38 Jason Heyward 61 122 530 NL 27 1 7 49
39 Marwin Gonzalez 55 123 484 AL 26 3 13 51
40 Starling Marte 71 152 489 NL 34 5 9 46
41 J.D. Martinez 69 141 459 AL 35 2 22 68
42 Kevin Pillar 59 146 549 AL 35 2 7 53
43 Charlie Blackmon 111 187 577 NL 35 5 29 82
44 Odubel Herrera 87 167 584 NL 21 6 15 49
45 Christian Yelich 78 172 577 NL 38 3 21 98
46 Andrew McCutchen 81 153 598 NL 26 3 24 79
I went of AMC's hunch that the column starting with a 2 is problematic, and created this minimal reproducible example:
import pandas as pd
# define Data Frame
df= pd.DataFrame({
'name': ['A', 'B', 'C'],
'2b': [1, 2, 3],
'b2': [4, 5, 6],
})
# Try to access column 2b
df.2b
Which returns SyntaxError: invalid syntax
While df['2b'] returns the expected series.
I did a brief search for documentation about this, and didn't see anything, but I expect it has something to do with this: Variable names in Python cannot start with a number or can they?
So in the end, while 2b is a valid column name, you will have to access it's series by using the df['column'] method.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Pandas - reindex so I can keep values

Long story short
I have a nested dictionary. When I turn it into a dataframe.
import pandas
pdf = pandas.DataFrame(nested_dict)
95 96 97 98 99 100 101 102 103 104 105 \
A 70019 102 4243 3083 3540 6311 4851 5938 4140 4659 3100
C 0 185 427 433 1190 910 3898 3869 2861 2149 3065
D 8 9 23463 1237 2574 4174 3640 4747 3557 4582 5934
E 141 89 5034 1576 2303 3416 2377 1252 1204 1703 718
F 7 12 1937 2246 1687 1154 1317 3473 1881 2221 3060
G 343 1550 13497 10659 12343 8213 9251 7341 6354 9058 9022
H 1 1978 1829 1394 1945 1003 1382 1489 4182 932 556
I 5 772 1361 3914 3255 3242 2808 3765 3284 2127 3120
K 3 10353 540 2364 1196 882 3439 2107 803 743 621
L 6 14 1599 11759 4571 4821 3450 5071 4364 1891 3677
M 1 6 158 211 524 2738 686 443 612 509 1721
N 6 186 299 2971 791 1440 2028 1163 1689 4296 1535
P 54 31 726 6208 7160 5494 6184 4282 3587 3727 3821
Q 10 87 1228 2233 1016 1801 1768 1693 3414 515 563
R 7 53939 3030 8904 6712 6134 5127 3223 4764 3768 6429
S 76 5213 3676 7480 9831 7666 5410 8185 7508 11237 8298
T 4369 1253 3087 2487 6559 4572 6863 3184 7352 6068 4756
V 732 5 7595 4331 5216 5444 5187 6013 4245 4545 4761
W 0 6 103 1225 598 888 601 713 1298 1323 908
Y 12 9 1968 1085 2787 5489 5529 7840 8691 9745 10136
Eventually I want to melt down this data frame to look like the following.
residue residue_num count
A 95 70019
A 96 102
A 97 4243
....
The residue column is being marked as the index so I don't know how to make it an arbitrary index like 0,1,2,3 and call "A C D E F.." another name.
EDIT
Answered myself as per suggestion
Answered from here and here
import pandas
pdf = pandas.DataFrame(the_matrix)
pdf = pdf.reset_index()
pdf.rename(columns={'index':'aa'},inplace=True)
pandas.melt(pdf,id_vars='aa',var_name="position",value_name="counts")
aa position counts
0 A 95 70019
1 C 95 0
2 D 95 8
3 E 95 141
4 F 95 7
5 G 95 343
6 H 95 1
7 I 95 5
8 K 95 3
Your pdf looks like a pivot table. Let's assume we have a dataframe with three columns. We can pivot it with a single function like this:
pivoted = df.pivot(index='col1',columns='col2',values='col3')
Unpivoting it back without losing the index requires a reset_index dance:
pivoted.reset_index().melt(id_vars=pivoted.index.name)
To get the exact original df:
pivoted.reset_index().melt(id_vars=pivoted.index.name, var_name='col2', value_name='col3')
PS. To my surprise, melt does not get a kwarg like keep_index=True. Enhancement suggestion is still open: https://github.com/pandas-dev/pandas/issues/17440

Categories

Resources