How I can group by index and make mean on each column - python

I have following df
Blades & Razors & Foam Diaper Empty Fem Care HairCare Irrelevant Laundry Oral Care Others Personal Cleaning Care Skin Care
retailer
RTM 158 486 193 2755 3490 1458 889 2921 69 1543 645
RTM 39 0 28 2305 80 27 0 0 0 1207 414
RTM 98 276 121 1090 2359 717 561 911 293 1286 528
RTM 107 484 54 2136 2777 151 80 2191 7 1096 673
RTM 156 465 254 2972 2802 763 867 1065 8 2777 728
RTM 126 326 142 2126 2035 581 575 753 45 1768 292
RTM 0 0 181 1816 1455 598 579 0 2 749 451
RTM 86 374 308 2197 2075 576 698 693 26 1398 212
RTM 132 61 153 2094 1508 180 590 785 66 1519 486
RTM 90 303 8 0 0 18 0 60 0 358 0
RTM 0 14 6 190 198 21 131 75 18 171 0
I want make groupby() on my index and then get average on every column with the group? Any idea how to get so ?

To group on index, use:
df.groupby(level=0).mean()
or
df.groupby(df.index).mean()
Sample:
df = pd.DataFrame(data=np.random.random((10, 5)), columns=list('CDEFG'), index=list('AB')*5)
df.head()
C D E F G
A 0.230504 0.830818 0.560533 0.266903 0.745196
B 0.996806 0.861006 0.257780 0.258976 0.738617
A 0.409191 0.688814 0.214247 0.309678 0.565571
B 0.805192 0.940919 0.707562 0.772370 0.122562
A 0.596964 0.935662 0.493612 0.108362 0.673538
for either of the above yields:
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504

Related

How can we run Scipy Optimize Based After Doing Group By?

I have a dataframe that looks like this.
SectorID MyDate PrevMainCost OutageCost
0 123 10/31/2022 332 193
1 123 9/30/2022 308 269
2 123 8/31/2022 33 440
3 123 7/31/2022 230 147
4 123 6/30/2022 264 184
5 123 5/31/2022 290 46
6 123 4/30/2022 51 150
7 123 3/31/2022 69 253
8 123 2/28/2022 257 308
9 123 1/31/2022 441 349
10 456 10/31/2022 280 188
11 456 9/30/2022 432 150
12 456 8/31/2022 357 307
13 456 7/31/2022 425 45
14 456 6/30/2022 101 278
15 456 5/31/2022 62 240
16 456 4/30/2022 407 46
17 456 3/31/2022 35 218
18 456 2/28/2022 403 113
19 456 1/31/2022 295 200
20 456 12/31/2021 20 235
21 456 11/30/2021 440 403
22 789 10/31/2022 145 181
23 789 9/30/2022 320 259
24 789 8/31/2022 485 472
25 789 7/31/2022 59 24
26 789 6/30/2022 345 64
27 789 5/31/2022 34 480
28 789 4/30/2022 260 162
29 789 3/31/2022 46 399
30 999 10/31/2022 491 346
31 999 9/30/2022 77 212
32 999 8/31/2022 316 112
33 999 7/31/2022 106 351
34 999 6/30/2022 481 356
35 999 5/31/2022 20 269
36 999 4/30/2022 246 268
37 999 3/31/2022 377 173
38 999 2/28/2022 426 413
39 999 1/31/2022 341 168
40 999 12/31/2021 144 471
41 999 11/30/2021 358 393
42 999 10/31/2021 340 197
43 999 9/30/2021 119 252
44 999 8/31/2021 470 203
45 999 7/31/2021 359 163
46 999 6/30/2021 410 383
47 999 5/31/2021 200 119
48 999 4/30/2021 230 291
I am trying to find the minimum of PrevMainCost and OutageCost, after grouping by SectorID. Here's my primitive code.
import numpy as np
import pandas
df = pandas.read_clipboard(sep='\\s+')
df
df_sum = df.groupby('SectorID').sum()
df_sum
df_sum.loc[df_sum['PrevMainCost'] <= df_sum['OutageCost'], 'Result'] = 'Main'
df_sum.loc[df_sum['PrevMainCost'] > df_sum['OutageCost'], 'Result'] = 'Out'
Result: (result column shows flags whether PrevMainCost is lower or OutageCost is lower)
PrevMainCost OutageCost Result
SectorID
123 2275 2339 Main
456 3257 2423 Out
789 1694 2041 Main
999 5511 5140 Out
I am trying to figure out how to use Scipy Optimization to solve this problem. I Googled this problem and came up with this simple code sample.
from scipy.optimize import *
df_sum.groupby(by=['SectorID']).apply(lambda g: minimize(equation, g.Result, options={'xtol':0.001}).x)
When I run that, I get an error saying 'NameError: name 'equation' is not defined'.
How can I find the minimum of either the preventative maintenance cost or the outage cost, after grouping by SectorID? Also, how can I add some kind of constraint, such as no more than 30% of all resources can be used by one any particular SectorID?

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

Where did I go wrong with retrieving POS proportions through analysis with spaCy?

So, recently, I've been quite intent on getting some statistics regarding the vocabulary of some of the books I like. Nothing advanced---just things like, what is the average length of a sentence? How many adjectives are used in proportion to the total number of words?
The problem is---spaCy, which is the tool I chose for sheer simplicity of use, seems to indicate that: EVERY TEXT I analyze seems to have EXACTLY 5.88235% of adjectives compared to total number of words. EXACTLY 2.94118% of auxiliaries. EXACTLY 14.7059% of nouns. EXACTLY 34 words per sentence (!).
Now, I am pretty sure that I have made some mistake in the code I am using---it's just, I can't see it!
Here is a sample code. I've divided the book by chapter (34 total) and printed a table of some statistics. The numbers seem legit---but these proportions, there is NO WAY every chapter in every text I am analyzing has exactly the same structure.
headers = ['chapter', 'num_sents', 'num_words', 'num_verbs', 'prop_verbs', 'num_adjs', \
'prop_adjs', 'num_adv', 'prop_adv', 'num_aux', 'prop_aux', 'num_intj', \
'prop_intj', 'num_non', 'prop_non', 'num_pron', 'prop_pron', 'num_propn', \
'prop_propn', 'words/sentences']
stats = list()
count_chpt = 1
for chapter in kf_spc:
num_sents = 0
num_words = 0
num_verbs = 0
num_adjs = 0
num_adv = 0
num_aux = 0
num_intj = 0
num_non = 0
num_pron = 0
num_propn = 0
for sentence in chapter.sents:
num_sents += 1
for token in sent:
num_words += 1
if token.pos_ == 'VERB':
num_verbs += 1
elif token.pos_ == 'ADJ':
num_adjs += 1
elif token.pos_ == 'ADV':
num_adv += 1
elif token.pos_ == 'AUX':
num_aux += 1
elif token.pos_ == 'INTJ':
num_intj += 1
elif token.pos_ == 'NOUN':
num_non += 1
elif token.pos_ == 'PRON':
num_pron += 1
elif token.pos_ == 'PROPN':
num_propn += 1
stats.append([count_chpt, num_sents, num_words, num_verbs, num_verbs/num_words, \
num_adjs, num_adjs/num_words, num_adv, num_adv/num_words, \
num_aux, num_aux/num_words, num_intj, num_intj/num_words, num_non, \
num_non/num_words, num_pron, num_pron/num_words, num_propn, \
num_propn/num_words, num_words/num_sents])
count_chpt += 1
print(tabulate(stats, headers=headers))
OUTPUT:
chapter num_sents num_words num_verbs prop_verbs num_adjs prop_adjs num_adv prop_adv num_aux prop_aux num_intj prop_intj num_non prop_non num_pron prop_pron num_propn prop_propn words/sentences
--------- ----------- ----------- ----------- ------------ ---------- ----------- --------- ---------- --------- ---------- ---------- ----------- --------- ---------- ---------- ----------- ----------- ------------ -----------------
1 251 8534 1255 0.147059 251 0.0294118 502 0.0588235 251 0.0294118 0 0 1255 0.147059 502 0.0588235 753 0.0882353 34
2 155 5270 775 0.147059 155 0.0294118 310 0.0588235 155 0.0294118 0 0 775 0.147059 310 0.0588235 465 0.0882353 34
3 235 7990 1175 0.147059 235 0.0294118 470 0.0588235 235 0.0294118 0 0 1175 0.147059 470 0.0588235 705 0.0882353 34
4 226 7684 1130 0.147059 226 0.0294118 452 0.0588235 226 0.0294118 0 0 1130 0.147059 452 0.0588235 678 0.0882353 34
5 80 2720 400 0.147059 80 0.0294118 160 0.0588235 80 0.0294118 0 0 400 0.147059 160 0.0588235 240 0.0882353 34
6 276 9384 1380 0.147059 276 0.0294118 552 0.0588235 276 0.0294118 0 0 1380 0.147059 552 0.0588235 828 0.0882353 34
7 276 9384 1380 0.147059 276 0.0294118 552 0.0588235 276 0.0294118 0 0 1380 0.147059 552 0.0588235 828 0.0882353 34
8 412 14008 2060 0.147059 412 0.0294118 824 0.0588235 412 0.0294118 0 0 2060 0.147059 824 0.0588235 1236 0.0882353 34
9 123 4182 615 0.147059 123 0.0294118 246 0.0588235 123 0.0294118 0 0 615 0.147059 246 0.0588235 369 0.0882353 34
10 475 16150 2375 0.147059 475 0.0294118 950 0.0588235 475 0.0294118 0 0 2375 0.147059 950 0.0588235 1425 0.0882353 34
11 80 2720 400 0.147059 80 0.0294118 160 0.0588235 80 0.0294118 0 0 400 0.147059 160 0.0588235 240 0.0882353 34
12 169 5746 845 0.147059 169 0.0294118 338 0.0588235 169 0.0294118 0 0 845 0.147059 338 0.0588235 507 0.0882353 34
13 358 12172 1790 0.147059 358 0.0294118 716 0.0588235 358 0.0294118 0 0 1790 0.147059 716 0.0588235 1074 0.0882353 34
14 415 14110 2075 0.147059 415 0.0294118 830 0.0588235 415 0.0294118 0 0 2075 0.147059 830 0.0588235 1245 0.0882353 34
15 146 4964 730 0.147059 146 0.0294118 292 0.0588235 146 0.0294118 0 0 730 0.147059 292 0.0588235 438 0.0882353 34
16 275 9350 1375 0.147059 275 0.0294118 550 0.0588235 275 0.0294118 0 0 1375 0.147059 550 0.0588235 825 0.0882353 34
17 332 11288 1660 0.147059 332 0.0294118 664 0.0588235 332 0.0294118 0 0 1660 0.147059 664 0.0588235 996 0.0882353 34
18 172 5848 860 0.147059 172 0.0294118 344 0.0588235 172 0.0294118 0 0 860 0.147059 344 0.0588235 516 0.0882353 34
19 268 9112 1340 0.147059 268 0.0294118 536 0.0588235 268 0.0294118 0 0 1340 0.147059 536 0.0588235 804 0.0882353 34
20 740 25160 3700 0.147059 740 0.0294118 1480 0.0588235 740 0.0294118 0 0 3700 0.147059 1480 0.0588235 2220 0.0882353 34
21 301 10234 1505 0.147059 301 0.0294118 602 0.0588235 301 0.0294118 0 0 1505 0.147059 602 0.0588235 903 0.0882353 34
22 172 5848 860 0.147059 172 0.0294118 344 0.0588235 172 0.0294118 0 0 860 0.147059 344 0.0588235 516 0.0882353 34
23 223 7582 1115 0.147059 223 0.0294118 446 0.0588235 223 0.0294118 0 0 1115 0.147059 446 0.0588235 669 0.0882353 34
24 791 26894 3955 0.147059 791 0.0294118 1582 0.0588235 791 0.0294118 0 0 3955 0.147059 1582 0.0588235 2373 0.0882353 34
25 181 6154 905 0.147059 181 0.0294118 362 0.0588235 181 0.0294118 0 0 905 0.147059 362 0.0588235 543 0.0882353 34
26 357 12138 1785 0.147059 357 0.0294118 714 0.0588235 357 0.0294118 0 0 1785 0.147059 714 0.0588235 1071 0.0882353 34
27 253 8602 1265 0.147059 253 0.0294118 506 0.0588235 253 0.0294118 0 0 1265 0.147059 506 0.0588235 759 0.0882353 34
28 82 2788 410 0.147059 82 0.0294118 164 0.0588235 82 0.0294118 0 0 410 0.147059 164 0.0588235 246 0.0882353 34
29 213 7242 1065 0.147059 213 0.0294118 426 0.0588235 213 0.0294118 0 0 1065 0.147059 426 0.0588235 639 0.0882353 34
30 416 14144 2080 0.147059 416 0.0294118 832 0.0588235 416 0.0294118 0 0 2080 0.147059 832 0.0588235 1248 0.0882353 34
31 280 9520 1400 0.147059 280 0.0294118 560 0.0588235 280 0.0294118 0 0 1400 0.147059 560 0.0588235 840 0.0882353 34
32 306 10404 1530 0.147059 306 0.0294118 612 0.0588235 306 0.0294118 0 0 1530 0.147059 612 0.0588235 918 0.0882353 34
33 322 10948 1610 0.147059 322 0.0294118 644 0.0588235 322 0.0294118 0 0 1610 0.147059 644 0.0588235 966 0.0882353 34
34 312 10608 1560 0.147059 312 0.0294118 624 0.0588235 312 0.0294118 0 0 1560 0.147059 624 0.0588235 936 0.0882353 34
I am using large English corpus. I can't see where I am going wrong...
You're iterating over the same sentence repeatedly. Notice sentence vs sent.
for sentence in chapter.sents:
num_sents += 1
for token in sent:
As an extra tip, rather than the extensive if/else, use a Counter object from collections like this:
poscount = Counter()
for word in sent:
poscount[word.pos_] += 1

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Pandas - reindex so I can keep values

Long story short
I have a nested dictionary. When I turn it into a dataframe.
import pandas
pdf = pandas.DataFrame(nested_dict)
95 96 97 98 99 100 101 102 103 104 105 \
A 70019 102 4243 3083 3540 6311 4851 5938 4140 4659 3100
C 0 185 427 433 1190 910 3898 3869 2861 2149 3065
D 8 9 23463 1237 2574 4174 3640 4747 3557 4582 5934
E 141 89 5034 1576 2303 3416 2377 1252 1204 1703 718
F 7 12 1937 2246 1687 1154 1317 3473 1881 2221 3060
G 343 1550 13497 10659 12343 8213 9251 7341 6354 9058 9022
H 1 1978 1829 1394 1945 1003 1382 1489 4182 932 556
I 5 772 1361 3914 3255 3242 2808 3765 3284 2127 3120
K 3 10353 540 2364 1196 882 3439 2107 803 743 621
L 6 14 1599 11759 4571 4821 3450 5071 4364 1891 3677
M 1 6 158 211 524 2738 686 443 612 509 1721
N 6 186 299 2971 791 1440 2028 1163 1689 4296 1535
P 54 31 726 6208 7160 5494 6184 4282 3587 3727 3821
Q 10 87 1228 2233 1016 1801 1768 1693 3414 515 563
R 7 53939 3030 8904 6712 6134 5127 3223 4764 3768 6429
S 76 5213 3676 7480 9831 7666 5410 8185 7508 11237 8298
T 4369 1253 3087 2487 6559 4572 6863 3184 7352 6068 4756
V 732 5 7595 4331 5216 5444 5187 6013 4245 4545 4761
W 0 6 103 1225 598 888 601 713 1298 1323 908
Y 12 9 1968 1085 2787 5489 5529 7840 8691 9745 10136
Eventually I want to melt down this data frame to look like the following.
residue residue_num count
A 95 70019
A 96 102
A 97 4243
....
The residue column is being marked as the index so I don't know how to make it an arbitrary index like 0,1,2,3 and call "A C D E F.." another name.
EDIT
Answered myself as per suggestion
Answered from here and here
import pandas
pdf = pandas.DataFrame(the_matrix)
pdf = pdf.reset_index()
pdf.rename(columns={'index':'aa'},inplace=True)
pandas.melt(pdf,id_vars='aa',var_name="position",value_name="counts")
aa position counts
0 A 95 70019
1 C 95 0
2 D 95 8
3 E 95 141
4 F 95 7
5 G 95 343
6 H 95 1
7 I 95 5
8 K 95 3
Your pdf looks like a pivot table. Let's assume we have a dataframe with three columns. We can pivot it with a single function like this:
pivoted = df.pivot(index='col1',columns='col2',values='col3')
Unpivoting it back without losing the index requires a reset_index dance:
pivoted.reset_index().melt(id_vars=pivoted.index.name)
To get the exact original df:
pivoted.reset_index().melt(id_vars=pivoted.index.name, var_name='col2', value_name='col3')
PS. To my surprise, melt does not get a kwarg like keep_index=True. Enhancement suggestion is still open: https://github.com/pandas-dev/pandas/issues/17440

Categories

Resources