Clustering incosistencys by time difference in a timeseries df - python

I have a pandas dataframe that looks like this:
df
Out[94]:
nr aenvm aenhm ... naenhs naesvs naeshs
date ...
2019-11-16 08:44:24 1 388 776 ... 402 305 566
2019-11-16 08:44:25 2 383 767 ... 407 304 561
2019-11-16 08:44:26 3 378 762 ... 410 301 570
2019-11-16 08:44:27 4 376 766 ... 403 304 567
2019-11-16 08:44:28 5 374 773 ... 398 297 569
The data is inconsistent by Events.
Sometimes there are around 6 minutes of data (lets call it an "event") and then for maybe some hours or some days no data. See i.e. the structure break in the timestamp:
df.iloc[1056:1065]
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1057 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1058 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 1059 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 1060 304 715 507 ... 302 159 289 508
All I want to do is to "index" or "categories" those events. That the rows between two structural breaks are combined under one number [1, 2, 3, ...] in a new column.
My Goal is to creat a new column like "nr" that seperates the "events".
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 2 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 2 304 715 507 ... 302 159 289 508
To be honest I am a complete python newbie, I tried some classification with the timestamp datetime64 and .asfreq but with zero to nothing success...
I would be very thankfully for a good advice! :)

Related

How do I filter based in Indices in Python?

I am having an issue with manipulating indices once I have used the groupby command. My problem is similar to this code:
import pandas as pd
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,10,size=(1000000,5)),columns=list('ABCDE'))
M=df.groupby(['A','B','D','E'])['C'].sum().unstack()
M
E 0 1 2 3 4 5 6 7 8 9
A B D
0 0 0 464 414 553 420 499 394 528 423 415 443
1 407 479 392 441 433 472 520 421 484 384
2 545 546 523 356 386 434 531 534 486 417
3 408 511 422 424 477 351 452 395 341 492
4 502 462 403 434 428 444 506 414 418 328
... ... ... ... ... ... ... ... ... ... ...
9 9 5 419 416 485 386 581 330 408 489 394 454
6 416 475 469 490 357 523 418 514 555 499
7 528 419 462 486 565 388 438 445 469 521
8 390 454 566 341 459 463 478 463 426 499
9 414 436 441 462 403 415 362 472 433 430
[1000 rows x 10 columns]
I am wondering how to filter down to only situations where B is greater than A, when they are both in the index here. If they weren't in the index then I would be doing something like M=M[M['A']<M['B']].
You can temporarily convert the index to_frame:
out = M.loc[M.index.to_frame().query('B>A').index]
Or use Index.get_level_values:
A = M.index.get_level_values('A')
B = M.index.get_level_values('B')
out = M.loc[B>A]
Output:
E 0 1 2 3 4 5 6 7 8 9
A B D
0 1 0 489 452 421 455 442 377 440 476 477 451
1 468 448 473 443 557 492 471 460 476 469
2 576 472 465 355 503 448 491 437 546 425
3 404 438 474 516 410 446 411 459 467 450
4 500 418 441 445 420 605 467 580 479 377
... ... ... ... ... ... ... ... ... ... ...
8 9 5 390 466 436 493 446 508 375 390 485 393
6 457 478 476 417 458 460 361 397 432 403
7 516 587 379 406 396 449 430 433 357 432
8 390 460 489 427 346 490 498 454 395 345
9 474 510 466 336 484 577 443 428 459 406
[450 rows x 10 columns]

Python list alignment

I have an assignment I am trying to complete.
I have 100 random int. From those 100 I have to create a 10x10 table. DONE..
within that table I have to align my values to the right side of the each column. That is the part I'm missing.
Below is the code for that:
print(num, end=(" " if counter < 10 else "\n"))
Late answer, but you can also use:
import random
rl = random.sample(range(100, 999), 100)
max_n = 10
for n, x in enumerate(rl, 1):
print(x, end=("\n" if n % max_n == 0 else " "))
440 688 758 837 279 736 510 706 392 631
588 511 610 792 535 526 335 842 247 124
552 329 245 689 832 407 919 302 592 385
542 890 406 898 189 116 495 764 664 471
851 728 292 314 839 503 691 355 350 213
661 489 800 649 521 958 123 205 983 219
321 633 120 388 632 187 158 576 294 835
673 470 699 908 456 270 220 878 376 884
816 525 147 104 602 637 249 763 494 127
981 524 262 915 267 873 886 397 922 932
You can just format the number before printing it.
print(f"{num:>5}", end=(" " if counter < 10 else "\n"))
Alternatively, if you wanna cast the numbers to string you can use the rjust method of string.
There is a simple way to do it. I hope I have made it clear.
import random
# Generate 100 random numbers in range 1 to 1000.
random_numbers = list(map(lambda x:random.randrange(1,1000), range(100)))
# Create an empty string to store the pretty string.
pretty_txt = ''
# Loop through random_numbers and use an enumerate to get iteration count.
for index, i in enumerate(random_numbers):
# Add the current number into the string with a space.
pretty_txt += str(i) + ' '
# Add a newline every ten numbers.
# If you don't add index != 0 it will put a space in first iteration
if index % 9 == 0 and index != 0:
pretty_txt += '\n'
print(pretty_txt)
The output is:
796 477 578 458 284 287 43 535 514 504
91 411 288 980 85 233 394 313 263
135 770 793 394 362 433 370 725 472
981 932 398 275 626 631 817 82 551
775 211 755 202 81 750 695 402 809
477 925 347 31 313 514 363 115 144
341 684 662 522 236 219 142 114 621
940 241 110 851 997 699 685 434 813
983 710 124 443 569 613 456 232 80
927 445 179 49 871 821 428 750 792
527 799 878 731 221 780 16 779 333

Pandas: How to add column to multiindexed dataframe?

I was following a brief tutorial on LinkedIn regarding multiindexed pandas dataframes where I was unable to reproduce a seemingly very basic operation (at 3:00). You DO NOT have to watch the video to grasp the problem.
The following snippet that uses a dataset from seaborn will show that I'm unable to add a column to a multiindexed pandas dataframe using the technique shown in the video, and also described in an SO post here.
Here we go:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed.unstack()
print(flights_unstack)
Output:
passengers
month January February March April May June July August September October November December
year
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 305 336
1957 315 301 356 348 355 422 465 467 404 347 310 337
1958 340 318 362 348 363 435 491 505 404 359 362 405
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
And now I'd like to append a column that shows the sum per month for each year using
flights_unstack.sum(axis = 1)
Output:
year
1949 1520
1950 1676
1951 2042
1952 2364
1953 2700
1954 2867
1955 3408
1956 3939
1957 4421
1958 4572
1959 5140
1960 5714
The two sources mentioned above demonstrate this by using something as simple as:
flights_unstack['passengers', 'total'] = flights_unstack.sum(axis = 1)
Here, 'total' should appear as a new column under the existing indexes.
But I'm getting this:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I'm using Python 3, and so is the author in the video from 2015.
What's going on here?
I've made a bunch of attempts using only values from series above, as well as reshaping, transposing, merging and joining the data bot as pd.Series and pd.DataFrame. And resetting the indexes. I may have overlooked some important detail, and now I'm hoping for a suggestion from some of you.
EDIT 1 - Here's an attempt after the first suggestion from jezrael:
import pandas as pd
import seaborn as sns
flights = sns.load_dataset('flights')
flights.head()
flights_indexed = flights.set_index(['year', 'month'])
flights_unstack = flights_indexed['passengers'].unstack()
flights_unstack['total'] = flights_unstack.sum(axis = 1)
Output:
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
Change:
flights_unstack = flights_indexed.unstack()
to:
flights_unstack = flights_indexed['passengers'].unstack()
for remove Multiindex in columns.
And last is necessary add_categories by new column name:
flights_unstack.columns = flights_unstack.columns.add_categories(['total'])
flights_unstack['total'] = flights_unstack.sum(axis = 1)
print (df)
January February March April May June July August September \
month
1949 112 118 132 129 121 135 148 148 136
1950 115 126 141 135 125 149 170 170 158
1951 145 150 178 163 172 178 199 199 184
1952 171 180 193 181 183 218 230 242 209
1953 196 196 236 235 229 243 264 272 237
1954 204 188 235 227 234 264 302 293 259
1955 242 233 267 269 270 315 364 347 312
1956 284 277 317 313 318 374 413 405 355
1957 315 301 356 348 355 422 465 467 404
1958 340 318 362 348 363 435 491 505 404
1959 360 342 406 396 420 472 548 559 463
1960 417 391 419 461 472 535 622 606 508
October November December total
month
1949 119 104 118 1520
1950 133 114 140 1676
1951 162 146 166 2042
1952 191 172 194 2364
1953 211 180 201 2700
1954 229 203 229 2867
1955 274 237 278 3408
1956 306 305 336 4003
1957 347 310 337 4427
1958 359 362 405 4692
1959 407 362 405 5140
1960 461 390 432 5714
Setup:
import pandas as pd
temp=u"""month;January;February;March;April;May;June;July;August;September;October;November;December
1949;112;118;132;129;121;135;148;148;136;119;104;118
1950;115;126;141;135;125;149;170;170;158;133;114;140
1951;145;150;178;163;172;178;199;199;184;162;146;166
1952;171;180;193;181;183;218;230;242;209;191;172;194
1953;196;196;236;235;229;243;264;272;237;211;180;201
1954;204;188;235;227;234;264;302;293;259;229;203;229
1955;242;233;267;269;270;315;364;347;312;274;237;278
1956;284;277;317;313;318;374;413;405;355;306;305;336
1957;315;301;356;348;355;422;465;467;404;347;310;337
1958;340;318;362;348;363;435;491;505;404;359;362;405
1959;360;342;406;396;420;472;548;559;463;407;362;405
1960;417;391;419;461;472;535;622;606;508;461;390;432"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", index_col=[0])
print (df)
df.columns = pd.CategoricalIndex(df.columns)
df.columns = df.columns.add_categories(['total'])
df['total'] = df.sum(axis = 1)
I know this is kind of late but I found the answer to your problem in the FAQs section of the course. Here's what it says:
"Q. What are the issues with Pandas categorical data?
A. Since version 0.6, seaborn.load_dataset converts certain columns to Pandas categorical data (see http://pandas.pydata.org/pandas-docs/stable/categorical.html). This creates a problem in the handling of the "flights" DataFrame used in "Introduction to Pandas/Using multilevel indices". To avoid the problem, you may load the dataset directly with Pandas:
flights = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv')"
I hope this helps.
You can use the following:
df = pd.concat([flights_unstack.sum(axis = 'columns').rename('Total'), flights_unstack], axis = 'columns')
Results
Then you can reset to multi-index using:
df.columns = pd.MultiIndex.from_tuples(('passangers', column) for column in df.columns)
Results

Issue with choosing columns when creating max/min columns in dataframe

I have the following code:
def minmaxdata():
Totrigs,TotOrDV,TotOrH,TotGas,TotOil,TotBit,ABFl,ABOr,SKFl,SKOr,BCOr,MBFl,MBOr = dataforgraphs()
tr = Totrigs
tr['year'] = tr.index.year
tr['week']= tr.groupby('year').cumcount()+1
tr2 = tr.pivot_table(index='week',columns='year')
tr2['max07_13']=tr2.max(axis=1)
tr2['min07_13']=tr2.min(axis=1)
print(tr2)
Which gives me this:
Total Rigs max07_13 min07_13
year 2007 2008 2009 2010 2011 2012 2013 2014
week
1 408 333 303 322 419 382 270 477 477 270
2 539 449 357 382 495 541 460 514 541 357
3 581 482 355 419 511 554 502 509 581 355
4 597 485 356 441 514 568 502 502 597 356
5 587 496 340 462 522 570 503 500 587 340
6 590 521 304 457 526 564 506 512 590 304
7 586 539 294 465 517 571 519 530 586 294
8 555 529 282 455 517 555 517 NaN 555 282
9 550 534 232 437 532 519 518 NaN 550 232
10 510 502 160 366 528 419 472 NaN 528 160
11 396 411 107 259 466 296 405 NaN 466 107
...But I would like the two max/min columns on the right to only take the max/min for 2007-2013. I have tried several indexing methods but seems to result in errors.
Any suggestions??
EDIT:
Tried the scalable solution but got the following error:
KeyError: "['2007' '2008' '2009' '2010' '2011' '2012' '2013'] not in index"
EDIT2:
tr2.columns output is the following:
Year
Total Rigs 2007
2008
2009
2010
2011
2012
2013
2014
max07_13
min07_13
EDIT3:
This was the solution:
gcols=[('Total Rigs',2007),('Total Rigs',2008),('Total Rigs',2009),('Total Rigs',2010),('Total Rigs',2011),('Total Rigs',2012),('Total Rigs',2013)]
tr2['Max 2007-2013']=tr2[gcols].max(axis=1)
tr2['Min 2007-2013']=tr2[gcols].min(axis=1)
A bit of a non-scalable solution would be to drop 2014 and then call max and min -
tr2['max07_13']=tr2.drop('2014', axis=1).max(axis=1)
If you know the columns of interest, you can also use that -
columns_of_interest = ['2007', '2008', '2009', '2010', '2011', '2012', '2013']
tr2['max07_13']=tr2[columns_of_interest].max(axis=1)

Raise TypeError when adding new column to a pandas DataFrame

I've been watching an online course about data analysis using Python. I came across a problem when following exactly what the instructor did. Basically, I pulled a data frame called "flights" from seaborn and set the index "year" and "month" and unstacked it. The following codes are used:
import seaborn
import pandas as pd
flights = seaborn.load_dataset("flights")
flights_indexed = flights.set_index(["year","month"])
flights_unstacked = flights_indexed.unstack()
flights_unstacked
the final data frame is like this
Then I am trying to add a new column called "Total" at the end for the sum of each year using the following code:
flights_unstacked["passengers"]["Total"] = flights_unstacked.sum(axis = 1)
But it raised a TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category.
I am new to data manipulation using pandas. Anyone can tell me how to fix this? Is this a version issue, because the online instructor did exactly the same thing but his works just fine. PS: I use Python 2.7 and pandas 0.20.3.
The seaborn.load_dataset line detects the month column as a category data type. To get around this error, cast categorical to str with this line right after flights = seaborn.load_dataset("flights"):
flights["month"] = flights["month"].astype(str)
To sort the month strings in chronological order, first drop the top level (level=0) of the columns of flights_unstacked (this level holds the single value passengers):
import seaborn
import pandas as pd
flights = seaborn.load_dataset("flights")
flights["month"] = flights["month"].astype(str)
flights_indexed = flights.set_index(["year", "month"])
flights_unstacked = flights_indexed.unstack()
flights_unstacked.columns = flights_unstacked.columns.droplevel(0)
Then reindex the month-string columns according to a list of month strings that you pre-built in chronological order:
import calendar
months = [calendar.month_name[i] for i in range(1, 13)]
flights_unstacked = flights_unstacked[months]
Finally, you can add a column of totals:
flights_unstacked["Total"] = flights_unstacked.sum(axis=1)
Result:
In [329]: flights_unstacked
Out[329]:
month January February March April May June July August September October November December Total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714

Categories

Resources