Pandas: How to add column to multiindexed dataframe? - python

I was following a brief tutorial on LinkedIn regarding multiindexed pandas dataframes where I was unable to reproduce a seemingly very basic operation (at 3:00). You DO NOT have to watch the video to grasp the problem.
The following snippet that uses a dataset from seaborn will show that I'm unable to add a column to a multiindexed pandas dataframe using the technique shown in the video, and also described in an SO post here.
Here we go:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed.unstack()
print(flights_unstack)
Output:
passengers
month January February March April May June July August September October November December
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 305 336
1957 315 301 356 348 355 422 465 467 404 347 310 337
1958 340 318 362 348 363 435 491 505 404 359 362 405
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
And now I'd like to append a column that shows the sum per month for each year using
flights_unstack.sum(axis = 1)
Output:
year
1949 1520
1950 1676
1951 2042
1952 2364
1953 2700
1954 2867
1955 3408
1956 3939
1957 4421
1958 4572
1959 5140
1960 5714
The two sources mentioned above demonstrate this by using something as simple as:
flights_unstack['passengers', 'total'] = flights_unstack.sum(axis = 1)
Here, 'total' should appear as a new column under the existing indexes.
But I'm getting this:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I'm using Python 3, and so is the author in the video from 2015.
What's going on here?
I've made a bunch of attempts using only values from series above, as well as reshaping, transposing, merging and joining the data bot as pd.Series and pd.DataFrame. And resetting the indexes. I may have overlooked some important detail, and now I'm hoping for a suggestion from some of you.
EDIT 1 - Here's an attempt after the first suggestion from jezrael:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed['passengers'].unstack()
flights_unstack['total'] = flights_unstack.sum(axis = 1)
Output:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

Change:
flights_unstack = flights_indexed.unstack()
to:
flights_unstack = flights_indexed['passengers'].unstack()
for remove Multiindex in columns.
And last is necessary add_categories by new column name:
flights_unstack.columns = flights_unstack.columns.add_categories(['total'])
flights_unstack['total'] = flights_unstack.sum(axis = 1)
print (df)
January February March April May June July August September \
month
1949 112 118 132 129 121 135 148 148 136
1950 115 126 141 135 125 149 170 170 158
1951 145 150 178 163 172 178 199 199 184
1952 171 180 193 181 183 218 230 242 209
1953 196 196 236 235 229 243 264 272 237
1954 204 188 235 227 234 264 302 293 259
1955 242 233 267 269 270 315 364 347 312
1956 284 277 317 313 318 374 413 405 355
1957 315 301 356 348 355 422 465 467 404
1958 340 318 362 348 363 435 491 505 404
1959 360 342 406 396 420 472 548 559 463
1960 417 391 419 461 472 535 622 606 508
October November December total
month
1949 119 104 118 1520
1950 133 114 140 1676
1951 162 146 166 2042
1952 191 172 194 2364
1953 211 180 201 2700
1954 229 203 229 2867
1955 274 237 278 3408
1956 306 305 336 4003
1957 347 310 337 4427
1958 359 362 405 4692
1959 407 362 405 5140
1960 461 390 432 5714
Setup:
import pandas as pd
temp=u"""month;January;February;March;April;May;June;July;August;September;October;November;December
1949;112;118;132;129;121;135;148;148;136;119;104;118
1950;115;126;141;135;125;149;170;170;158;133;114;140
1951;145;150;178;163;172;178;199;199;184;162;146;166
1952;171;180;193;181;183;218;230;242;209;191;172;194
1953;196;196;236;235;229;243;264;272;237;211;180;201
1954;204;188;235;227;234;264;302;293;259;229;203;229
1955;242;233;267;269;270;315;364;347;312;274;237;278
1956;284;277;317;313;318;374;413;405;355;306;305;336
1957;315;301;356;348;355;422;465;467;404;347;310;337
1958;340;318;362;348;363;435;491;505;404;359;362;405
1959;360;342;406;396;420;472;548;559;463;407;362;405
1960;417;391;419;461;472;535;622;606;508;461;390;432"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", index_col=[0])
print (df)
df.columns = pd.CategoricalIndex(df.columns)
df.columns = df.columns.add_categories(['total'])
df['total'] = df.sum(axis = 1)

I know this is kind of late but I found the answer to your problem in the FAQs section of the course. Here's what it says:
"Q. What are the issues with Pandas categorical data?
A. Since version 0.6, seaborn.load_dataset converts certain columns to Pandas categorical data (see http://pandas.pydata.org/pandas-docs/stable/categorical.html). This creates a problem in the handling of the "flights" DataFrame used in "Introduction to Pandas/Using multilevel indices". To avoid the problem, you may load the dataset directly with Pandas:
flights = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv')"
I hope this helps.

You can use the following:
df = pd.concat([flights_unstack.sum(axis = 'columns').rename('Total'), flights_unstack], axis = 'columns')
Results
Then you can reset to multi-index using:
df.columns = pd.MultiIndex.from_tuples(('passangers', column) for column in df.columns)
Results

Related

Python list alignment

I have an assignment I am trying to complete.
I have 100 random int. From those 100 I have to create a 10x10 table. DONE..
within that table I have to align my values to the right side of the each column. That is the part I'm missing.
Below is the code for that:
print(num, end=(" " if counter < 10 else "\n"))
Late answer, but you can also use:
import random
rl = random.sample(range(100, 999), 100)
max_n = 10
for n, x in enumerate(rl, 1):
print(x, end=("\n" if n % max_n == 0 else " "))
440 688 758 837 279 736 510 706 392 631
588 511 610 792 535 526 335 842 247 124
552 329 245 689 832 407 919 302 592 385
542 890 406 898 189 116 495 764 664 471
851 728 292 314 839 503 691 355 350 213
661 489 800 649 521 958 123 205 983 219
321 633 120 388 632 187 158 576 294 835
673 470 699 908 456 270 220 878 376 884
816 525 147 104 602 637 249 763 494 127
981 524 262 915 267 873 886 397 922 932
You can just format the number before printing it.
print(f"{num:>5}", end=(" " if counter < 10 else "\n"))
Alternatively, if you wanna cast the numbers to string you can use the rjust method of string.
There is a simple way to do it. I hope I have made it clear.
import random
# Generate 100 random numbers in range 1 to 1000.
random_numbers = list(map(lambda x:random.randrange(1,1000), range(100)))
# Create an empty string to store the pretty string.
pretty_txt = ''
# Loop through random_numbers and use an enumerate to get iteration count.
for index, i in enumerate(random_numbers):
# Add the current number into the string with a space.
pretty_txt += str(i) + ' '
# Add a newline every ten numbers.
# If you don't add index != 0 it will put a space in first iteration
if index % 9 == 0 and index != 0:
pretty_txt += '\n'
print(pretty_txt)
The output is:
796 477 578 458 284 287 43 535 514 504
91 411 288 980 85 233 394 313 263
135 770 793 394 362 433 370 725 472
981 932 398 275 626 631 817 82 551
775 211 755 202 81 750 695 402 809
477 925 347 31 313 514 363 115 144
341 684 662 522 236 219 142 114 621
940 241 110 851 997 699 685 434 813
983 710 124 443 569 613 456 232 80
927 445 179 49 871 821 428 750 792
527 799 878 731 221 780 16 779 333

Clustering incosistencys by time difference in a timeseries df

I have a pandas dataframe that looks like this:
df
Out[94]:
nr aenvm aenhm ... naenhs naesvs naeshs
date ...
2019-11-16 08:44:24 1 388 776 ... 402 305 566
2019-11-16 08:44:25 2 383 767 ... 407 304 561
2019-11-16 08:44:26 3 378 762 ... 410 301 570
2019-11-16 08:44:27 4 376 766 ... 403 304 567
2019-11-16 08:44:28 5 374 773 ... 398 297 569
The data is inconsistent by Events.
Sometimes there are around 6 minutes of data (lets call it an "event") and then for maybe some hours or some days no data. See i.e. the structure break in the timestamp:
df.iloc[1056:1065]
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1057 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1058 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 1059 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 1060 304 715 507 ... 302 159 289 508
All I want to do is to "index" or "categories" those events. That the rows between two structural breaks are combined under one number [1, 2, 3, ...] in a new column.
My Goal is to creat a new column like "nr" that seperates the "events".
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 2 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 2 304 715 507 ... 302 159 289 508
To be honest I am a complete python newbie, I tried some classification with the timestamp datetime64 and .asfreq but with zero to nothing success...
I would be very thankfully for a good advice! :)

How to split column data and create new DataFrame with multiple columns

I'd like to split the data in the following DataFrame
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32), 'r':12*range(8), 'cnt': np.random.randint(300, 400, 96)}); df
cnt per r
0 355 10 0
1 359 10 1
2 347 10 2
3 390 10 3
4 304 10 4
5 306 10 5
.. ... ... ..
87 357 30 7
88 371 30 0
89 396 30 1
90 357 30 2
91 353 30 3
92 306 30 4
93 301 30 5
94 329 30 6
95 312 30 7
[96 rows x 3 columns]
such that for each r value a new column cnt_r{r} exist in a DataFrame but also keeping the corresponding per column.
The following piece of code almost does what I want except that it looses the per column:
pd.DataFrame({'cnt_r{}'.format(i): df[df.r==i].reset_index()['cnt'] for i in range(8)})
cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 355 359 347 390 304 306 366 310
1 394 331 384 312 380 350 318 396
2 340 336 360 389 352 370 353 319
...
9 341 300 386 334 386 314 358 326
10 357 386 311 382 356 339 375 357
11 371 396 357 353 306 301 329 312
I need a way to build the follow DataFrame:
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 355 359 347 390 304 306 366 310
1 10 394 331 384 312 380 350 318 396
2 10 340 336 360 389 352 370 353 319
...
7 20 384 385 376 323 345 339 339 347
9 30 341 300 386 334 386 314 358 326
10 30 357 386 311 382 356 339 375 357
11 30 371 396 357 353 306 301 329 312
Note that by construction my dataset has same number of values per per for each r. Obviously my dataset is much larger than the example one (about 800 million records).
Many thanks for your time.
If possible use reshape for 2d array and then insert new colum per:
np.random.seed(1256)
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32),
'r': 12*list(range(8)),
'cnt': np.random.randint(300, 400, 96)})
df1 = pd.DataFrame(df['cnt'].values.reshape(-1, 8)).add_prefix('cnt_r')
df1.insert(0, 'per', np.repeat([10,20,30], 4))
print (df1)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304
Or use cumcount for create new groups and reshape by set_index with unstack:
df = (df.set_index([df.groupby('r').cumcount(), 'per','r'])['cnt']
.unstack()
.add_prefix('cnt_r')
.reset_index(level=1)
.rename_axis(None, axis=1))
print (df)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304

Issue with choosing columns when creating max/min columns in dataframe

I have the following code:
def minmaxdata():
Totrigs,TotOrDV,TotOrH,TotGas,TotOil,TotBit,ABFl,ABOr,SKFl,SKOr,BCOr,MBFl,MBOr = dataforgraphs()
tr = Totrigs
tr['year'] = tr.index.year
tr['week']= tr.groupby('year').cumcount()+1
tr2 = tr.pivot_table(index='week',columns='year')
tr2['max07_13']=tr2.max(axis=1)
tr2['min07_13']=tr2.min(axis=1)
print(tr2)
Which gives me this:
Total Rigs max07_13 min07_13
year 2007 2008 2009 2010 2011 2012 2013 2014
week
1 408 333 303 322 419 382 270 477 477 270
2 539 449 357 382 495 541 460 514 541 357
3 581 482 355 419 511 554 502 509 581 355
4 597 485 356 441 514 568 502 502 597 356
5 587 496 340 462 522 570 503 500 587 340
6 590 521 304 457 526 564 506 512 590 304
7 586 539 294 465 517 571 519 530 586 294
8 555 529 282 455 517 555 517 NaN 555 282
9 550 534 232 437 532 519 518 NaN 550 232
10 510 502 160 366 528 419 472 NaN 528 160
11 396 411 107 259 466 296 405 NaN 466 107
...But I would like the two max/min columns on the right to only take the max/min for 2007-2013. I have tried several indexing methods but seems to result in errors.
Any suggestions??
EDIT:
Tried the scalable solution but got the following error:
KeyError: "['2007' '2008' '2009' '2010' '2011' '2012' '2013'] not in index"
EDIT2:
tr2.columns output is the following:
Year
Total Rigs 2007
2008
2009
2010
2011
2012
2013
2014
max07_13
min07_13
EDIT3:
This was the solution:
gcols=[('Total Rigs',2007),('Total Rigs',2008),('Total Rigs',2009),('Total Rigs',2010),('Total Rigs',2011),('Total Rigs',2012),('Total Rigs',2013)]
tr2['Max 2007-2013']=tr2[gcols].max(axis=1)
tr2['Min 2007-2013']=tr2[gcols].min(axis=1)
A bit of a non-scalable solution would be to drop 2014 and then call max and min -
tr2['max07_13']=tr2.drop('2014', axis=1).max(axis=1)
If you know the columns of interest, you can also use that -
columns_of_interest = ['2007', '2008', '2009', '2010', '2011', '2012', '2013']
tr2['max07_13']=tr2[columns_of_interest].max(axis=1)

Raise TypeError when adding new column to a pandas DataFrame

I've been watching an online course about data analysis using Python. I came across a problem when following exactly what the instructor did. Basically, I pulled a data frame called "flights" from seaborn and set the index "year" and "month" and unstacked it. The following codes are used:
import seaborn
import pandas as pd
flights = seaborn.load_dataset("flights")
flights_indexed = flights.set_index(["year","month"])
flights_unstacked = flights_indexed.unstack()
flights_unstacked
the final data frame is like this
Then I am trying to add a new column called "Total" at the end for the sum of each year using the following code:
flights_unstacked["passengers"]["Total"] = flights_unstacked.sum(axis = 1)
But it raised a TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category.
I am new to data manipulation using pandas. Anyone can tell me how to fix this? Is this a version issue, because the online instructor did exactly the same thing but his works just fine. PS: I use Python 2.7 and pandas 0.20.3.
The seaborn.load_dataset line detects the month column as a category data type. To get around this error, cast categorical to str with this line right after flights = seaborn.load_dataset("flights"):
flights["month"] = flights["month"].astype(str)
To sort the month strings in chronological order, first drop the top level (level=0) of the columns of flights_unstacked (this level holds the single value passengers):
import seaborn
import pandas as pd
flights = seaborn.load_dataset("flights")
flights["month"] = flights["month"].astype(str)
flights_indexed = flights.set_index(["year", "month"])
flights_unstacked = flights_indexed.unstack()
flights_unstacked.columns = flights_unstacked.columns.droplevel(0)
Then reindex the month-string columns according to a list of month strings that you pre-built in chronological order:
import calendar
months = [calendar.month_name[i] for i in range(1, 13)]
flights_unstacked = flights_unstacked[months]
Finally, you can add a column of totals:
flights_unstacked["Total"] = flights_unstacked.sum(axis=1)
Result:
In [329]: flights_unstacked
Out[329]:
month January February March April May June July August September October November December Total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714

Categories

Resources