Pandas - reindex so I can keep values - python

Long story short
I have a nested dictionary. When I turn it into a dataframe.
import pandas
pdf = pandas.DataFrame(nested_dict)
95 96 97 98 99 100 101 102 103 104 105 \
A 70019 102 4243 3083 3540 6311 4851 5938 4140 4659 3100
C 0 185 427 433 1190 910 3898 3869 2861 2149 3065
D 8 9 23463 1237 2574 4174 3640 4747 3557 4582 5934
E 141 89 5034 1576 2303 3416 2377 1252 1204 1703 718
F 7 12 1937 2246 1687 1154 1317 3473 1881 2221 3060
G 343 1550 13497 10659 12343 8213 9251 7341 6354 9058 9022
H 1 1978 1829 1394 1945 1003 1382 1489 4182 932 556
I 5 772 1361 3914 3255 3242 2808 3765 3284 2127 3120
K 3 10353 540 2364 1196 882 3439 2107 803 743 621
L 6 14 1599 11759 4571 4821 3450 5071 4364 1891 3677
M 1 6 158 211 524 2738 686 443 612 509 1721
N 6 186 299 2971 791 1440 2028 1163 1689 4296 1535
P 54 31 726 6208 7160 5494 6184 4282 3587 3727 3821
Q 10 87 1228 2233 1016 1801 1768 1693 3414 515 563
R 7 53939 3030 8904 6712 6134 5127 3223 4764 3768 6429
S 76 5213 3676 7480 9831 7666 5410 8185 7508 11237 8298
T 4369 1253 3087 2487 6559 4572 6863 3184 7352 6068 4756
V 732 5 7595 4331 5216 5444 5187 6013 4245 4545 4761
W 0 6 103 1225 598 888 601 713 1298 1323 908
Y 12 9 1968 1085 2787 5489 5529 7840 8691 9745 10136
Eventually I want to melt down this data frame to look like the following.
residue residue_num count
A 95 70019
A 96 102
A 97 4243
....
The residue column is being marked as the index so I don't know how to make it an arbitrary index like 0,1,2,3 and call "A C D E F.." another name.
EDIT
Answered myself as per suggestion

Answered from here and here
import pandas
pdf = pandas.DataFrame(the_matrix)
pdf = pdf.reset_index()
pdf.rename(columns={'index':'aa'},inplace=True)
pandas.melt(pdf,id_vars='aa',var_name="position",value_name="counts")
aa position counts
0 A 95 70019
1 C 95 0
2 D 95 8
3 E 95 141
4 F 95 7
5 G 95 343
6 H 95 1
7 I 95 5
8 K 95 3

Your pdf looks like a pivot table. Let's assume we have a dataframe with three columns. We can pivot it with a single function like this:
pivoted = df.pivot(index='col1',columns='col2',values='col3')
Unpivoting it back without losing the index requires a reset_index dance:
pivoted.reset_index().melt(id_vars=pivoted.index.name)
To get the exact original df:
pivoted.reset_index().melt(id_vars=pivoted.index.name, var_name='col2', value_name='col3')
PS. To my surprise, melt does not get a kwarg like keep_index=True. Enhancement suggestion is still open: https://github.com/pandas-dev/pandas/issues/17440

Related

for-loop generates 'cannot insert {}, already exists' error depending on the pandas dataframe definition

I have a series of astronomical observations that are placed into program-generated DataFrames for each observation year (i.e., df2015, df2016, etc.) These DataFrames need to be modified in the subsequent processes, and I put all of them on the list. The method used to define the list makes a difference. An explicitly define list
dfs = [df2015, df2016, df2017, df2018, df2019]
allows further df modifications, but it does not agree with the purpose of the code - to automate the processing of standard data sets regardless of the number of years. A program-generated list
for yr in years:
exec('dfs = [df' + yr + ' for yr in years]')
seems to be working most of the time, as in :
for df in dfs:
dfX = df.dtypes
for index, val2 in dfX.items():
if val2 == 'float64':
df.iloc[:,index] = df.iloc[:,index].fillna(0).astype('int64')
, but fails in some cases, as in:
for df in dfs:
i=1
for i in range(1, 13):
ncol = i + (i-1) *2
if i < 10:
nmon = '0' + str(i)
else:
nmon = '' + str(i)
df.insert(ncol, 'M' + nmon, nmon)
i += 1
when a for loop with an insert statement results in an error:
ValueError: cannot insert M01, already exists
I've tried list comprehension instead of for loops, tried changing the loop nesting order (just in case), etc.
The intent of the above reference step is to convert this:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 1 713 1623 658.0 1659.0 619 1735 526.0 1810.0 439 ... 437 1903 510.0 1818.0 542 1725 618.0 1637.0 654 1613
1 2 714 1624 657.0 1700.0 618 1736 525.0 1812.0 438 ... 438 1902 511.0 1816.0 543 1724 619.0 1636.0 655 1613
2 3 714 1625 655.0 1702.0 616 1737 523.0 1813.0 437 ... 439 1901 512.0 1814.0 544 1722 620.0 1635.0 656 1612
3 4 714 1626 654.0 1703.0 614 1738 521.0 1814.0 435 ... 440 1900 513.0 1813.0 545 1720 622.0 1634.0 657 1612
4 5 713 1627 653.0 1704.0 613 1739 520.0 1815.0 434 ... 441 1859 514.0 1811.0 546 1719 623.0 1633.0 658 1612
into this
0 M01 D01 1 2 M02 D02 3 4 M03 ... 19 20 M11 D11 21 22 M12 D12 23 24
0 1 01 1 713 1623 02 1 658 1659 03 ... 542 1725 11 1 618 1637 12 1 654 1613
1 2 01 2 714 1624 02 2 657 1700 03 ... 543 1724 11 2 619 1636 12 2 655 1613
2 3 01 3 714 1625 02 3 655 1702 03 ... 544 1722 11 3 620 1635 12 3 656 1612
3 4 01 4 714 1626 02 4 654 1703 03 ... 545 1720 11 4 622 1634 12 4 657 1612
4 5 01 5 713 1627 02 5 653 1704 03 ... 546 1719 11 5 623 1633 12 5 658 1612
You create a list of copies of the last year's dataframe. If your years list is e.g. ['2015', '2016', '2017', '2018'], then you generate a dfs as [df2018, df2018, df2018, df2018] which will lead to the error.
This will get you the correct result:
dfs = [eval('df' + yr) for yr in years]
It forms the required dataframe names and evaluates them so you get a list of dataframes.

How to mask a data frame by the month?

I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

How I can group by index and make mean on each column

I have following df
Blades & Razors & Foam Diaper Empty Fem Care HairCare Irrelevant Laundry Oral Care Others Personal Cleaning Care Skin Care
retailer
RTM 158 486 193 2755 3490 1458 889 2921 69 1543 645
RTM 39 0 28 2305 80 27 0 0 0 1207 414
RTM 98 276 121 1090 2359 717 561 911 293 1286 528
RTM 107 484 54 2136 2777 151 80 2191 7 1096 673
RTM 156 465 254 2972 2802 763 867 1065 8 2777 728
RTM 126 326 142 2126 2035 581 575 753 45 1768 292
RTM 0 0 181 1816 1455 598 579 0 2 749 451
RTM 86 374 308 2197 2075 576 698 693 26 1398 212
RTM 132 61 153 2094 1508 180 590 785 66 1519 486
RTM 90 303 8 0 0 18 0 60 0 358 0
RTM 0 14 6 190 198 21 131 75 18 171 0
I want make groupby() on my index and then get average on every column with the group? Any idea how to get so ?
To group on index, use:
df.groupby(level=0).mean()
or
df.groupby(df.index).mean()
Sample:
df = pd.DataFrame(data=np.random.random((10, 5)), columns=list('CDEFG'), index=list('AB')*5)
df.head()
C D E F G
A 0.230504 0.830818 0.560533 0.266903 0.745196
B 0.996806 0.861006 0.257780 0.258976 0.738617
A 0.409191 0.688814 0.214247 0.309678 0.565571
B 0.805192 0.940919 0.707562 0.772370 0.122562
A 0.596964 0.935662 0.493612 0.108362 0.673538
for either of the above yields:
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504

How to plot a histogram

My program imports these:
import requests
import demjson
import pandas as pd
from pandas import DataFrame
import pylab
pylab.show()
I have a dataframe which if I print out looks like this:
Strike COI POI
0 50.00 927 1694
1 55.00 394 1898
2 60.00 2042 4438
3 65.00 642 3696
4 70.00 3169 3216
5 75.00 2529 3222
6 80.00 6268 14029
7 85.00 3988 6241
8 87.50 356 1516
9 90.00 15676 14345
10 92.50 1309 2498
11 95.00 3303 11391
12 97.50 1074 1472
13 100.00 64930 19513
14 105.00 10953 9286
15 110.00 19956 13008
16 115.00 13956 12932
17 120.00 23440 9240
18 125.00 12167 7467
19 130.00 23531 10168
20 135.00 9567 2637
21 140.00 18967 6854
22 145.00 7890 5176
23 150.00 21516 8079
24 155.00 3137 267
25 160.00 4115 432
26 165.00 1079 205
27 170.00 4341 785
28 175.00 6277 1631
29 180.00 1805 35
30 185.00 906 136
31 190.00 1984 377
32 195.00 3539 268
Sometimes there are zero values like this
Strike COI POI
0 95.00 53 663
1 100.00 16 595
2 105.00 6 377
3 110.00 56 1217
4 115.00 174 994
5 120.00 631 3227
6 125.00 701 1031
7 130.00 2678 833
8 135.00 1921 1049
9 140.00 1238 10
10 160.00 1486 0
11 165.00 1900 0
Unfortunately sometimes the Strike is a float like this:
Strike COI POI
0 34.29 476 12711
1 35.71 95 7782
2 37.14 0 7844
3 38.57 0 3640
4 40.00 93 6010
5 41.43 0 5621
6 42.86 1245 18146
7 44.29 116 6844
8 45.71 140 7099
9 47.14 500 483
10 48.57 445 3956
11 50.00 1540 22362
12 51.43 152 6366
13 52.86 131 8354
14 54.29 810 7542
15 55.71 132 9337
16 57.14 12455 15024
17 58.57 662 5245
18 60.00 1743 9116
19 61.43 1368 7236
20 62.86 1128 11890
21 64.29 4537 24204
22 65.71 766 5113
23 67.14 1859 10572
24 68.57 12407 11367
25 70.00 13263 11748
26 71.43 23400 31566
27 72.86 2784 12984
28 74.29 12679 20520
29 75.71 6932 14617
.. ... ... ...
63 115.00 39738 18033
64 115.71 5293 2877
65 116.43 1874 2748
66 117.14 4181 1965
67 117.86 3618 4214
68 118.57 11652 4043
69 120.00 81523 34752
70 121.43 14239 3527
71 122.86 9046 6160
72 125.00 187 88
73 125.71 22557 7381
74 128.57 11053 8163
75 130.00 74007 27825
76 131.43 6747 1951
77 132.86 7289 1383
78 134.29 5872 1380
79 135.71 4946 2047
80 137.14 5349 590
81 140.00 98310 57767
82 145.00 9857 403
83 150.00 64701 2063
84 155.00 17398 1434
85 160.00 12363 1133
86 165.00 5222 539
87 170.00 9050 918
88 175.00 9848 678
89 180.00 3408 85
90 185.00 3243 768
91 190.00 3646 419
92 195.00 4789 149
Since I want the Strikes to be the bin, I have tried to plot a histogram by saying:
df.hist(by=df.Strike)
but I either get nothing, or when I do see the system ready to plot with a bunch of little grids (I am using Spyder) I get this error before any plot. As far as I can see, all the dataframes have at least one point. The y-axis also doesn't make sense since its height appears to always be one:
Traceback (most recent call last):
File "<ipython-input-20-6f27fa6cf56c>", line 1, in <module>
runfile('/home/idf/goog.py', wdir='/home/idf')
File "/home/idf/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "/home/idf/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/home/idf/goog.py", line 153, in <module>
df.hist(by=df.Strike)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2740, in hist_frame
**kwds)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2873, in grouped_hist
figsize=figsize, layout=layout, rot=rot)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2983, in _grouped_plot
plotf(group, ax, **kwargs)
File "/home/idf/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2867, in plot_group
ax.hist(group.dropna().values, bins=bins, **kwargs)
File "/home/idf/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 5597, in hist
raise ValueError("x must have at least one data point")
ValueError: x must have at least one data point
When you call DataFrame.hist method (i.e. pandas internal plotting function) you only need to pass a column name:
df.hist('Strike') # which is the same as df.hist(column='Strike')
To get:
If you would use plt.hist (directly accessing matplotlib function) then you would need to pass df.Strike.values.

Categories

Resources