Why is pandas df.diff(2) different than df.diff().diff()?

Why is pandas df.diff(2) different than df.diff().diff()? - python

According to Ender's Applied Econometric Time Series, the second difference of a variable y is defined as:
Pandas provides the diff function that receives "periods" as an argument. Nevertheless, df.diff(2) gives a different result than df.diff().diff().
Code excerpt showing the above:
In [8]: df
Out[8]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 16.0 6.0 256.0 216.0 65536 4352
1991 17.0 7.0 289.0 343.0 131072 5202
1992 6.0 -4.0 36.0 -64.0 64 252
1993 7.0 -3.0 49.0 -27.0 128 392
1994 8.0 -2.0 64.0 -8.0 256 576
1995 13.0 3.0 169.0 27.0 8192 2366
1996 10.0 0.5 100.0 0.5 1024 1100
1997 11.0 1.0 121.0 1.0 2048 1452
1998 4.0 -6.0 16.0 -216.0 16 80
1999 5.0 -5.0 25.0 -125.0 32 150
2000 18.0 8.0 324.0 512.0 262144 6156
2001 3.0 -7.0 9.0 -343.0 8 36
2002 0.5 -10.0 0.5 -1000.0 48 20
2003 1.0 -9.0 1.0 -729.0 2 2
2004 14.0 4.0 196.0 64.0 16384 2940
2005 15.0 5.0 225.0 125.0 32768 3600
2006 12.0 2.0 144.0 8.0 4096 1872
2007 9.0 -1.0 81.0 -1.0 512 810
2008 2.0 -8.0 4.0 -512.0 4 12
2009 19.0 9.0 361.0 729.0 524288 7220
In [9]: df.diff(2)
Out[9]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -10.0 -10.0 -220.0 -280.0 -65472.0 -4100.0
1993 -10.0 -10.0 -240.0 -370.0 -130944.0 -4810.0
1994 2.0 2.0 28.0 56.0 192.0 324.0
1995 6.0 6.0 120.0 54.0 8064.0 1974.0
1996 2.0 2.5 36.0 8.5 768.0 524.0
1997 -2.0 -2.0 -48.0 -26.0 -6144.0 -914.0
1998 -6.0 -6.5 -84.0 -216.5 -1008.0 -1020.0
1999 -6.0 -6.0 -96.0 -126.0 -2016.0 -1302.0
2000 14.0 14.0 308.0 728.0 262128.0 6076.0
2001 -2.0 -2.0 -16.0 -218.0 -24.0 -114.0
2002 -17.5 -18.0 -323.5 -1512.0 -262096.0 -6136.0
2003 -2.0 -2.0 -8.0 -386.0 -6.0 -34.0
2004 13.5 14.0 195.5 1064.0 16336.0 2920.0
2005 14.0 14.0 224.0 854.0 32766.0 3598.0
2006 -2.0 -2.0 -52.0 -56.0 -12288.0 -1068.0
2007 -6.0 -6.0 -144.0 -126.0 -32256.0 -2790.0
2008 -10.0 -10.0 -140.0 -520.0 -4092.0 -1860.0
2009 10.0 10.0 280.0 730.0 523776.0 6410.0
In [10]: df.diff().diff()
Out[10]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -12.0 -12.0 -286.0 -534.0 -196544.0 -5800.0
1993 12.0 12.0 266.0 444.0 131072.0 5090.0
1994 0.0 0.0 2.0 -18.0 64.0 44.0
1995 4.0 4.0 90.0 16.0 7808.0 1606.0
1996 -8.0 -7.5 -174.0 -61.5 -15104.0 -3056.0
1997 4.0 3.0 90.0 27.0 8192.0 1618.0
1998 -8.0 -7.5 -126.0 -217.5 -3056.0 -1724.0
1999 8.0 8.0 114.0 308.0 2048.0 1442.0
2000 12.0 12.0 290.0 546.0 262096.0 5936.0
2001 -28.0 -28.0 -614.0 -1492.0 -524248.0 -12126.0
2002 12.5 12.0 306.5 198.0 262176.0 6104.0
2003 3.0 4.0 9.0 928.0 -86.0 -2.0
2004 12.5 12.0 194.5 522.0 16428.0 2956.0
2005 -12.0 -12.0 -166.0 -732.0 2.0 -2278.0
2006 -4.0 -4.0 -110.0 -178.0 -45056.0 -2388.0
2007 0.0 0.0 18.0 108.0 25088.0 666.0
2008 -4.0 -4.0 -14.0 -502.0 3076.0 264.0
2009 24.0 24.0 434.0 1752.0 524792.0 8006.0
In [11]: df.diff(2) - df.diff().diff()
Out[11]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 2.0 2.0 66.0 254.0 131072.0 1700.0
1993 -22.0 -22.0 -506.0 -814.0 -262016.0 -9900.0
1994 2.0 2.0 26.0 74.0 128.0 280.0
1995 2.0 2.0 30.0 38.0 256.0 368.0
1996 10.0 10.0 210.0 70.0 15872.0 3580.0
1997 -6.0 -5.0 -138.0 -53.0 -14336.0 -2532.0
1998 2.0 1.0 42.0 1.0 2048.0 704.0
1999 -14.0 -14.0 -210.0 -434.0 -4064.0 -2744.0
2000 2.0 2.0 18.0 182.0 32.0 140.0
2001 26.0 26.0 598.0 1274.0 524224.0 12012.0
2002 -30.0 -30.0 -630.0 -1710.0 -524272.0 -12240.0
2003 -5.0 -6.0 -17.0 -1314.0 80.0 -32.0
2004 1.0 2.0 1.0 542.0 -92.0 -36.0
2005 26.0 26.0 390.0 1586.0 32764.0 5876.0
2006 2.0 2.0 58.0 122.0 32768.0 1320.0
2007 -6.0 -6.0 -162.0 -234.0 -57344.0 -3456.0
2008 -6.0 -6.0 -126.0 -18.0 -7168.0 -2124.0
2009 -14.0 -14.0 -154.0 -1022.0 -1016.0 -1596.0
Why are they different? Which one corresponds to the one defined in Ender's book?

This is precisely because
Δ2 yt = yt - 2 yt - 1 + yt - 2 ≠ yt - yt - 2.
The left hand side is df.diff().diff(), whereas the right hand side is df.diff(2). For the difference in difference, you want the left hand side.

Consider;
df
a
b
c
d
df.diff() is
NaN
b - a
c - b
d - c
df.diff(2) is
NaN
NaN
c - a
d - b
df.diff().diff() is
NaN
NaN
(c - b) - (b - a) = c - 2b + a
(d - c) - (c - b) = d - 2c + b
They're not the same, mathematically.

Related

What is the best way to create a new dataframe with existing ones of different shapes and criteria

I have a few dataframes that I have made through various sorting and processing of data from the main dataframe (df1).
df1 - large and will currently covers 6 days worth of data for every 30 mins but I wish to scale up to longer periods:
import pandas as pd
import numpy as np
bmu_units = pd.read_csv('bmu_units_technology.csv')
b1610 = pd.read_csv('b1610_df.csv')
b1610 = (b1610.merge(bmu_units, on=['BM Unit ID 1'], how='left'))
b1610['% of capacity running'] = b1610.quantity / b1610.Capacity
def func(tech):
if tech in ["CCGT","OCGT","COAL"]:
return "Fossil"
else:
return "ZE"
b1610["Type"] = b1610['Technology'].apply(func)
settlementDate time BM Unit ID 1 BM Unit ID 2_x settlementPeriod quantity BM Unit ID 2_y Capacity Technology % of capacity running Type
0 03/01/2022 00:00:00 RCBKO-1 T_RCBKO-1 1 278.658 T_RCBKO-1 279.0 WIND 0.998774 ZE
1 03/01/2022 00:00:00 LARYO-3 T_LARYW-3 1 162.940 T_LARYW-3 180.0 WIND 0.905222 ZE
2 03/01/2022 00:00:00 LAGA-1 T_LAGA-1 1 262.200 T_LAGA-1 905.0 CCGT 0.289724 Fossil
3 03/01/2022 00:00:00 CRMLW-1 T_CRMLW-1 1 3.002 T_CRMLW-1 47.0 WIND 0.063872 ZE
4 03/01/2022 00:00:00 GRIFW-1 T_GRIFW-1 1 9.972 T_GRIFW-1 102.0 WIND 0.097765 ZE
... ... ... ... ... ... ... ... ... ... ... ...
52533 08/01/2022 23:30:00 CRMLW-1 T_CRMLW-1 48 8.506 T_CRMLW-1 47.0 WIND 0.180979 ZE
52534 08/01/2022 23:30:00 LARYO-4 T_LARYW-4 48 159.740 T_LARYW-4 180.0 WIND 0.887444 ZE
52535 08/01/2022 23:30:00 HOWBO-3 T_HOWBO-3 48 32.554 T_HOWBO-3 440.0 Offshore Wind 0.073986 ZE
52536 08/01/2022 23:30:00 BETHW-1 E_BETHW-1 48 5.010 E_BETHW-1 30.0 WIND 0.167000 ZE
52537 08/01/2022 23:30:00 HMGTO-1 T_HMGTO-1 48 92.094 HMGTO-1 108.0 WIND 0.852722 ZE
df2:
rank = (
b1610.pivot_table(
index=['settlementDate','BM Unit ID 1','Technology'],
columns='settlementPeriod',
values='% of capacity running',
aggfunc=sum,
fill_value=0)
)
rank['rank of capacity'] = rank.sum(axis=1)
rank
settlementPeriod 1 2 3 4 5 6 7 8 9 10 ... 40 41 42 43 44 45 46 47 48 rank of capacity
settlementDate BM Unit ID 1 Technology
03/01/2022 ABRBO-1 WIND 0.936970 0.969293 0.970909 0.925051 0.885657 0.939394 0.963434 0.938586 0.863232 0.781212 ... 0.461818 0.394545 0.428889 0.537172 0.520606 0.545253 0.873333 0.697778 0.651111 29.566263
ABRTW-1 WIND 0.346389 0.343333 0.345389 0.341667 0.342222 0.346778 0.347611 0.347722 0.346833 0.340556 ... 0.018778 0.015889 0.032056 0.043056 0.032167 0.109611 0.132111 0.163278 0.223556 10.441333
ACHRW-1 WIND 0.602884 0.575628 0.602140 0.651070 0.667721 0.654791 0.539209 0.628698 0.784233 0.782140 ... 0.174419 0.148465 0.139860 0.091535 0.094698 0.272419 0.205023 0.184651 0.177628 18.517814
AKGLW-2 WIND 0.000603 0.000603 0.000603 0.000635 0.000603 0.000635 0.000635 0.000635 0.000635 0.000603 ... 0.191079 0.195079 0.250476 0.281048 0.290000 0.279524 0.358508 0.452698 0.572730 8.616032
ANSUW-1 WIND 0.889368 0.865053 0.915684 0.894000 0.888526 0.858211 0.875158 0.878421 0.809368 0.898737 ... 0.142632 0.212526 0.276421 0.225053 0.235789 0.228000 0.152211 0.226000 0.299158 19.662421
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
08/01/2022 WBURB-2 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.636329 0.642447 0.961835 0.908706 0.650212 0.507012 0.513176 0.503576 0.518212 24.439765
HOWBO-3 Offshore Wind 0.030418 0.026355 0.026595 0.014373 0.012523 0.008418 0.010977 0.016918 0.019127 0.025641 ... 0.055509 0.063845 0.073850 0.073923 0.073895 0.073791 0.073886 0.074050 0.073986 2.332809
MRWD-1 CCGT 0.808043 0.894348 0.853043 0.650870 0.159783 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.701739 0.488913 0.488913 0.489348 0.489130 0.392826 0.079130 0.000000 0.000000 23.485217
WBURB-3 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.771402 0.699986 0.648386 0.919242 0.759520 0.424513 0.430598 0.420089 0.436376 25.436282
DRAXX-4 BIOMASS 0.706074 0.791786 0.806713 0.806462 0.806270 0.806136 0.806509 0.806369 0.799749 0.825070 ... 0.777395 0.816093 0.707122 0.666639 0.680406 0.679216 0.501433 0.000000 0.000000 36.576512
df3 - this was made by sorting the above dataframe to list sums for each day for each BM Unit ID filtered for specific technology types.
BM Unit ID 1 Technology 03/01/2022 04/01/2022 05/01/2022 06/01/2022 07/01/2022 08/01/2022 ave rank rank
0 FAWN-1 CCGT 1.0 5.0 1.0 5.0 2.0 1.0 2.500000 1.0
1 GRAI-6 CCGT 4.0 18.0 2.0 4.0 3.0 3.0 5.666667 2.0
2 EECL-1 CCGT 5.0 29.0 4.0 1.0 1.0 2.0 7.000000 3.0
3 PEMB-21 CCGT 7.0 1.0 6.0 13.0 8.0 8.0 7.166667 4.0
4 PEMB-51 CCGT 3.0 3.0 3.0 11.0 16.0 NaN 7.200000 5.0
5 PEMB-41 CCGT 9.0 4.0 7.0 7.0 10.0 13.0 8.333333 6.0
6 WBURB-1 CCGT 6.0 9.0 22.0 2.0 7.0 5.0 8.500000 7.0
7 PEMB-31 CCGT 14.0 6.0 13.0 6.0 4.0 9.0 8.666667 8.0
8 GRMO-1 CCGT 2.0 7.0 10.0 24.0 11.0 6.0 10.000000 9.0
9 PEMB-11 CCGT 21.0 2.0 9.0 10.0 9.0 14.0 10.833333 10.0
10 STAY-1 CCGT 19.0 12.0 5.0 23.0 6.0 7.0 12.000000 11.0
11 GRAI-7 CCGT 10.0 27.0 15.0 9.0 15.0 11.0 14.500000 12.0
12 DIDCB6 CCGT 28.0 11.0 11.0 8.0 19.0 15.0 15.333333 13.0
13 STAY-4 CCGT 12.0 8.0 20.0 18.0 14.0 23.0 15.833333 14.0
14 SCCL-3 CCGT 17.0 16.0 31.0 3.0 18.0 10.0 15.833333 14.0
15 CDCL-1 CCGT 13.0 22.0 8.0 25.0 12.0 16.0 16.000000 15.0
16 STAY-3 CCGT 8.0 17.0 17.0 20.0 13.0 22.0 16.166667 16.0
17 MRWD-1 CCGT NaN NaN 19.0 26.0 5.0 19.0 17.250000 17.0
18 WBURB-3 CCGT NaN NaN 24.0 14.0 17.0 17.0 18.000000 18.0
19 WBURB-2 CCGT NaN 14.0 21.0 12.0 31.0 18.0 19.200000 19.0
20 GYAR-1 CCGT NaN 26.0 14.0 17.0 20.0 21.0 19.600000 20.0
21 STAY-2 CCGT 18.0 20.0 18.0 21.0 24.0 20.0 20.166667 21.0
22 SHOS-1 CCGT 16.0 15.0 28.0 15.0 29.0 27.0 21.666667 22.0
23 KLYN-A-1 CCGT NaN 24.0 12.0 19.0 27.0 29.0 22.200000 23.0
24 DIDCB5 CCGT NaN 10.0 35.0 22.0 NaN NaN 22.333333 24.0
25 CARR-1 CCGT NaN 33.0 26.0 27.0 22.0 4.0 22.400000 25.0
26 LAGA-1 CCGT 15.0 13.0 29.0 32.0 23.0 24.0 22.666667 26.0
27 CARR-2 CCGT 24.0 25.0 27.0 29.0 21.0 12.0 23.000000 27.0
28 GRAI-8 CCGT 11.0 28.0 36.0 16.0 26.0 25.0 23.666667 28.0
29 SCCL-2 CCGT 29.0 NaN 16.0 28.0 25.0 NaN 24.500000 29.0
30 LBAR-1 CCGT NaN 19.0 25.0 31.0 28.0 NaN 25.750000 30.0
31 CNQPS-2 CCGT 20.0 NaN 32.0 NaN 32.0 26.0 27.500000 31.0
32 SPLN-1 CCGT NaN NaN 23.0 30.0 30.0 NaN 27.666667 32.0
33 CNQPS-1 CCGT 25.0 NaN 33.0 NaN NaN NaN 29.000000 33.0
34 DAMC-1 CCGT 23.0 21.0 38.0 34.0 NaN NaN 29.000000 33.0
35 KEAD-2 CCGT 30.0 NaN NaN NaN NaN NaN 30.000000 34.0
36 HUMR-1 CCGT 22.0 30.0 37.0 37.0 33.0 28.0 31.166667 35.0
37 SHBA-1 CCGT 26.0 23.0 40.0 35.0 37.0 NaN 32.200000 36.0
38 SEAB-1 CCGT NaN 32.0 34.0 36.0 NaN 30.0 33.000000 37.0
39 CNQPS-4 CCGT 27.0 NaN 41.0 33.0 35.0 31.0 33.400000 38.0
40 PETEM1 CCGT NaN 35.0 NaN NaN NaN NaN 35.000000 39.0
41 SEAB-2 CCGT NaN 31.0 39.0 39.0 34.0 NaN 35.750000 40.0
42 COSO-1 CCGT NaN NaN 30.0 42.0 36.0 NaN 36.000000 41.0
43 ROCK-1 CCGT 31.0 34.0 42.0 38.0 38.0 NaN 36.600000 42.0
44 WBURB-43 COAL 32.0 37.0 45.0 40.0 39.0 32.0 37.500000 43.0
45 WBURB-41 COAL 33.0 38.0 46.0 41.0 40.0 33.0 38.500000 44.0
46 FELL-1 CCGT 34.0 39.0 47.0 43.0 41.0 34.0 39.666667 45.0
47 FDUNT-1 OCGT NaN 36.0 44.0 NaN NaN NaN 40.000000 46.0
48 KEAD-1 CCGT NaN NaN 43.0 NaN NaN NaN 43.000000 47.0
My issue is that I am trying to create a new dataframe using the existing dataframes listed above in which I can list all my BM Unit ID 1's in order of rank from df2 while populating the values with means of values for all dates (not split by date) in df1. An example of what I am after is below, which I made on excel using index match. Here I have the results for each settlement period from df1 and df2 but instead of split by date they are an aggregated mean over all dates in the df but they are still ranked according to the last column of df2, which is key.
Desired Output:
BM Unit ID Technology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Rank Capacity
1 150 FAWN-1 CCGT 130.43 130.93 130.78 130.58 130.57 130.54 130.71 130.87 130.89 130.98 130.83 130.80 130.88 131.02 130.81 130.65 130.86 130.84 131.19 130.60 130.69 130.70 130.40 130.03 130.13 130.03 129.75
2 455 GRAI-6 CCGT 339.45 342.33 322.53 312.40 303.78 307.60 316.35 277.18 293.48 325.75 326.75 271.34 299.74 328.06 317.12 342.66 364.50 390.90 403.32 411.52 400.18 405.94 394.04 400.08 389.08 382.74 374.76
3 408 EECL-1 CCGT 363.31 386.71 364.46 363.31 363.31 363.38 361.87 305.06 286.99 282.74 323.93 242.88 242.64 207.73 294.71 357.15 383.47 426.93 433.01 432.98 435.14 436.38 416.04 417.69 430.42 415.09 406.45
4 430 PEMB-21 CCGT 334.40 419.50 436.70 441.90 440.50 415.80 327.90 323.70 322.70 331.10 367.50 368.40 396.70 259.05 415.95 356.32 386.84 400.00 429.52 435.40 434.84 435.88 435.60 438.48 438.16 437.84 437.76
5 465 PEMB-51 CCGT 370.65 370.45 359.90 326.25 326.20 322.65 324.60 274.25 319.55 288.80 301.75 279.08 379.60 376.76 389.92 419.24 403.64 420.92 428.20 421.32 396.92 397.80 424.40 433.92 434.56 431.44 434.40
6 445 PEMB-41 CCGT 337.00 423.40 423.10 427.50 427.00 419.00 361.00 318.80 263.20 226.70 268.70 231.35 366.90 378.35 392.20 421.55 354.96 382.48 422.64 428.28 428.76 431.24 431.92 431.84 429.52 429.00 431.48
7 425 WBURB-1 CCGT 240.41 293.17 252.27 256.51 261.65 253.44 247.14 217.08 223.11 199.27 254.69 314.16 361.07 317.50 259.54 266.83 349.64 383.43 408.18 412.29 395.54 383.48 355.98 340.49 360.87 352.74 376.92
8 465 PEMB-31 CCGT 297.73 360.27 355.40 357.07 358.67 353.07 300.93 284.73 268.73 255.20 248.53 257.75 366.75 376.45 396.40 320.56 342.68 352.52 361.16 379.40 386.64 390.36 409.12 427.48 426.60 426.80 427.16
9 144 GRMO-1 CCGT 106.62 106.11 105.96 106.00 106.00 105.98 105.99 105.90 105.47 105.31 105.28 105.07 105.04 105.06 105.06 105.04 105.06 105.06 105.07 105.04 105.05 105.06 105.04 105.04 105.04 105.06 105.07
10 430 PEMB-11 CCGT 432.80 430.40 430.70 431.90 432.10 429.30 430.00 408.30 320.90 346.50 432.90 432.20 312.93 297.20 414.55 432.00 420.40 429.80 402.60 426.90 430.65 435.85 435.10 431.15 435.20 431.50 431.75
11 457 STAY-1 CCGT 216.07 223.27 232.67 243.47 234.67 221.73 227.00 219.00 237.00 218.33 250.73 228.27 219.67 142.68 243.00 300.64 312.28 331.00 360.84 379.28 398.92 410.04 410.56 409.24 411.96 408.84 411.88
12 455 GRAI-7 CCGT 425.20 425.40 377.90 339.40 342.00 329.80 408.00 402.40 329.00 257.30 130.43 211.37 262.60 318.45 299.98 324.72 350.40 386.26 394.20 402.10 390.48 401.22 388.94 394.10 395.14 379.70 377.26
13 710 DIDCB6 CCGT 465.80 459.50 411.60 411.70 413.70 410.80 351.50 333.40 333.70 390.40 234.60 265.56 348.16 430.28 524.32 554.04 536.28 589.28 594.04 597.72 592.76 557.86 687.70 687.25 687.35 687.25 679.80
14 400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
16 440 CDCL-1 CCGT 270.63 255.24 210.87 197.10 195.12 198.72 197.64 198.99 233.19 221.31 176.94 317.52 280.68 213.12 297.68 342.25 397.26 372.28 371.74 379.87 347.51 348.48 352.15 384.88 395.14 381.02 360.40
17 457 STAY-3 CCGT 311.25 311.30 311.60 311.45 311.15 311.30 308.40 313.10 223.90 196.05 242.95 172.87 217.40 236.84 252.92 352.98 384.06 414.76 403.68 424.90 418.38 403.00 420.26 424.40 427.06 421.64 424.66
18 920 MRWD-1 CCGT 468.70 483.90 420.60 267.80 472.60 470.20 241.40 299.30 327.70 327.80 336.90 241.60 308.33 529.93 793.73 828.40 870.67 846.67 827.07 855.93 829.33 865.87 870.40 846.87 765.47 785.20 824.00
19 425 WBURB-3 CCGT 311.73 427.68 333.68 333.93 370.68 335.09 420.85 433.86 370.45 321.70 340.54 300.95 155.47 190.67 290.81 310.43 332.52 376.63 391.11 413.74 408.33 398.69 397.54 368.05 410.64 413.05 428.91
20 425 WBURB-2 CCGT 295.54 424.56 336.68 334.08 371.20 358.44 358.90 358.96 377.94 325.42 203.19 165.32 205.75 121.41 162.51 180.15 301.12 413.77 410.33 397.21 385.59 378.09 381.50 380.93 413.71 418.53 427.09
21 420 GYAR-1 CCGT 404.33 404.33 403.73 405.12 404.13 404.33 404.33 376.98 218.02 218.02 351.01 215.10 177.46 222.43 345.47 398.94 401.97 401.97 402.17 401.87 401.47 401.77 401.62 402.51 402.31 402.41 402.26
22 457 STAY-2 CCGT 434.20 435.40 435.40 435.20 434.20 434.20 434.20 434.60 249.80 196.20 291.20 234.80 196.80 88.73 167.10 239.52 324.52 372.80 412.40 423.32 424.04 423.96 423.92 424.08 423.88 420.96 422.44
23 400 KLYN-A-1 CCGT 382.58 382.50 384.94 385.81 385.83 385.79 385.02 384.94 259.16 141.03 195.65 205.75 278.81 256.95 296.85 337.82 369.26 376.38 376.84 376.56 376.30 376.09 375.62 375.45 375.11 375.17 375.09
24 420 SHOS-1 CCGT 290.63 326.33 229.60 265.70 269.05 259.40 299.45 310.20 301.65 266.00 307.90 319.30 253.06 246.85 263.04 220.46 277.68 297.84 290.62 297.86 302.83 295.13 293.73 289.04 306.14 314.24 321.76

Create a line/area chart as a gantt chart with plotly

I'm trying to create a line/area chart, which looks like a gantt chart with plotly in python. That's because i do not have a column of start and end (required for px.timeline). Instead, i have several vectors which begins in a certain place in time and decrease over several months. To better illustrate, thats my dataframe:
periods 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
start
2018-12 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-01 252.0 240.0 228.0 208.0 199.0 182.0 168.0 152.0 141.0 132.0 120.0 108.0 91.0 77.0 66.0 52.0 37.0 19.0 7.0
2019-02 140.0 135.0 129.0 123.0 114.0 101.0 99.0 91.0 84.0 74.0 62.0 49.0 45.0 39.0 33.0 26.0 20.0 10.0 3.0
2019-03 97.0 93.0 85.0 79.0 73.0 68.0 62.0 60.0 54.0 50.0 45.0 41.0 37.0 31.0 23.0 18.0 11.0 4.0 NaN
2019-04 92.0 90.0 86.0 82.0 78.0 73.0 67.0 58.0 51.0 46.0 41.0 38.0 36.0 34.0 32.0 19.0 14.0 3.0 1.0
2019-05 110.0 106.0 98.0 94.0 88.0 84.0 81.0 74.0 66.0 64.0 61.0 53.0 42.0 37.0 32.0 20.0 15.0 11.0 1.0
2019-06 105.0 101.0 96.0 87.0 84.0 80.0 75.0 69.0 65.0 60.0 56.0 46.0 40.0 32.0 30.0 18.0 10.0 6.0 2.0
2019-07 123.0 121.0 113.0 105.0 97.0 90.0 82.0 77.0 74.0 69.0 68.0 66.0 55.0 47.0 36.0 32.0 24.0 11.0 2.0
2019-08 127.0 122.0 117.0 112.0 108.0 100.0 94.0 82.0 78.0 69.0 65.0 58.0 53.0 43.0 35.0 24.0 17.0 8.0 2.0
2019-09 122.0 114.0 106.0 100.0 90.0 83.0 76.0 69.0 58.0 50.0 45.0 39.0 32.0 28.0 24.0 17.0 8.0 5.0 1.0
2019-10 164.0 161.0 151.0 138.0 129.0 121.0 114.0 102.0 95.0 88.0 81.0 72.0 62.0 56.0 48.0 40.0 22.0 16.0 5.0
2019-11 216.0 214.0 202.0 193.0 181.0 165.0 150.0 139.0 126.0 116.0 107.0 95.0 82.0 65.0 54.0 44.0 31.0 14.0 7.0
2019-12 341.0 327.0 311.0 294.0 274.0 261.0 245.0 225.0 210.0 191.0 171.0 136.0 117.0 96.0 79.0 55.0 45.0 26.0 6.0
2020-01 1167.0 1139.0 1089.0 1009.0 948.0 881.0 826.0 745.0 682.0 608.0 539.0 473.0 401.0 346.0 292.0 244.0 171.0 90.0 31.0
2020-02 280.0 274.0 262.0 247.0 239.0 226.0 204.0 184.0 169.0 158.0 141.0 125.0 105.0 89.0 68.0 55.0 29.0 18.0 3.0
2020-03 723.0 713.0 668.0 629.0 581.0 537.0 499.0 462.0 419.0 384.0 340.0 293.0 268.0 215.0 172.0 136.0 103.0 67.0 19.0
2020-04 1544.0 1502.0 1420.0 1337.0 1256.0 1149.0 1065.0 973.0 892.0 795.0 715.0 637.0 538.0 463.0 371.0 283.0 199.0 111.0 29.0
2020-05 1355.0 1313.0 1241.0 1175.0 1102.0 1046.0 970.0 890.0 805.0 726.0 652.0 569.0 488.0 415.0 331.0 255.0 180.0 99.0 19.0
2020-06 1042.0 1009.0 949.0 886.0 834.0 784.0 740.0 670.0 611.0 558.0 493.0 438.0 380.0 312.0 257.0 195.0 125.0 78.0 NaN
2020-07 719.0 698.0 663.0 624.0 595.0 547.0 512.0 460.0 424.0 387.0 341.0 301.0 256.0 215.0 172.0 124.0 90.0 NaN NaN
2020-08 655.0 633.0 605.0 566.0 537.0 492.0 453.0 417.0 377.0 333.0 294.0 259.0 222.0 189.0 162.0 118.0 NaN NaN NaN
2020-09 715.0 687.0 647.0 617.0 562.0 521.0 479.0 445.0 408.0 371.0 331.0 297.0 257.0 208.0 165.0 NaN NaN NaN NaN
2020-10 345.0 333.0 313.0 297.0 284.0 267.0 252.0 225.0 201.0 183.0 159.0 141.0 123.0 108.0 NaN NaN NaN NaN NaN
2020-11 1254.0 1221.0 1162.0 1094.0 1027.0 965.0 892.0 816.0 743.0 682.0 607.0 549.0 464.0 NaN NaN NaN NaN NaN NaN
2020-12 387.0 379.0 352.0 338.0 319.0 292.0 275.0 257.0 230.0 207.0 185.0 157.0 NaN NaN NaN NaN NaN NaN NaN
2021-01 805.0 782.0 742.0 692.0 649.0 599.0 551.0 500.0 463.0 417.0 371.0 NaN NaN NaN NaN NaN NaN NaN NaN
2021-02 469.0 458.0 434.0 407.0 380.0 357.0 336.0 317.0 296.0 263.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-03 1540.0 1491.0 1390.0 1302.0 1221.0 1128.0 1049.0 967.0 864.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-04 1265.0 1221.0 1145.0 1086.0 1006.0 937.0 862.0 793.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-05 558.0 548.0 520.0 481.0 446.0 417.0 389.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-06 607.0 589.0 560.0 517.0 484.0 455.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-07 597.0 572.0 543.0 511.0 477.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-08 923.0 902.0 850.0 792.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09 975.0 952.0 899.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-10 647.0 628.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-11 131.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
As you can see, for each period, i have a start at 0, until the last period available. Right now, my code is this:
vectors = []
for i in pivot_period.index:
vectors.append(list(pivot_period.loc[i]))
fig = px.area(y=[i for i in vectors])
If you plot the graph, you will see that the x-axis is the number of periods. However, when i try to implement the dates (which are the index), it returns a mislength, as long as i have 18 periods vs 36 dates. My idea, is to plot something like this (sorry for the terrible pic):
In a way that could visualize a decay of each vector in its own timeline. Any ideas?

generating an area figure from this data is simple: px.area(df, x=df.index, y=df.columns)
I do not see where the jobs/tasks come from in this dataset to match the image attached
df = pd.read_csv(io.StringIO("""periods 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
start
2018-12 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-01 252.0 240.0 228.0 208.0 199.0 182.0 168.0 152.0 141.0 132.0 120.0 108.0 91.0 77.0 66.0 52.0 37.0 19.0 7.0
2019-02 140.0 135.0 129.0 123.0 114.0 101.0 99.0 91.0 84.0 74.0 62.0 49.0 45.0 39.0 33.0 26.0 20.0 10.0 3.0
2019-03 97.0 93.0 85.0 79.0 73.0 68.0 62.0 60.0 54.0 50.0 45.0 41.0 37.0 31.0 23.0 18.0 11.0 4.0 NaN
2019-04 92.0 90.0 86.0 82.0 78.0 73.0 67.0 58.0 51.0 46.0 41.0 38.0 36.0 34.0 32.0 19.0 14.0 3.0 1.0
2019-05 110.0 106.0 98.0 94.0 88.0 84.0 81.0 74.0 66.0 64.0 61.0 53.0 42.0 37.0 32.0 20.0 15.0 11.0 1.0
2019-06 105.0 101.0 96.0 87.0 84.0 80.0 75.0 69.0 65.0 60.0 56.0 46.0 40.0 32.0 30.0 18.0 10.0 6.0 2.0
2019-07 123.0 121.0 113.0 105.0 97.0 90.0 82.0 77.0 74.0 69.0 68.0 66.0 55.0 47.0 36.0 32.0 24.0 11.0 2.0
2019-08 127.0 122.0 117.0 112.0 108.0 100.0 94.0 82.0 78.0 69.0 65.0 58.0 53.0 43.0 35.0 24.0 17.0 8.0 2.0
2019-09 122.0 114.0 106.0 100.0 90.0 83.0 76.0 69.0 58.0 50.0 45.0 39.0 32.0 28.0 24.0 17.0 8.0 5.0 1.0
2019-10 164.0 161.0 151.0 138.0 129.0 121.0 114.0 102.0 95.0 88.0 81.0 72.0 62.0 56.0 48.0 40.0 22.0 16.0 5.0
2019-11 216.0 214.0 202.0 193.0 181.0 165.0 150.0 139.0 126.0 116.0 107.0 95.0 82.0 65.0 54.0 44.0 31.0 14.0 7.0
2019-12 341.0 327.0 311.0 294.0 274.0 261.0 245.0 225.0 210.0 191.0 171.0 136.0 117.0 96.0 79.0 55.0 45.0 26.0 6.0
2020-01 1167.0 1139.0 1089.0 1009.0 948.0 881.0 826.0 745.0 682.0 608.0 539.0 473.0 401.0 346.0 292.0 244.0 171.0 90.0 31.0
2020-02 280.0 274.0 262.0 247.0 239.0 226.0 204.0 184.0 169.0 158.0 141.0 125.0 105.0 89.0 68.0 55.0 29.0 18.0 3.0
2020-03 723.0 713.0 668.0 629.0 581.0 537.0 499.0 462.0 419.0 384.0 340.0 293.0 268.0 215.0 172.0 136.0 103.0 67.0 19.0
2020-04 1544.0 1502.0 1420.0 1337.0 1256.0 1149.0 1065.0 973.0 892.0 795.0 715.0 637.0 538.0 463.0 371.0 283.0 199.0 111.0 29.0
2020-05 1355.0 1313.0 1241.0 1175.0 1102.0 1046.0 970.0 890.0 805.0 726.0 652.0 569.0 488.0 415.0 331.0 255.0 180.0 99.0 19.0
2020-06 1042.0 1009.0 949.0 886.0 834.0 784.0 740.0 670.0 611.0 558.0 493.0 438.0 380.0 312.0 257.0 195.0 125.0 78.0 NaN
2020-07 719.0 698.0 663.0 624.0 595.0 547.0 512.0 460.0 424.0 387.0 341.0 301.0 256.0 215.0 172.0 124.0 90.0 NaN NaN
2020-08 655.0 633.0 605.0 566.0 537.0 492.0 453.0 417.0 377.0 333.0 294.0 259.0 222.0 189.0 162.0 118.0 NaN NaN NaN
2020-09 715.0 687.0 647.0 617.0 562.0 521.0 479.0 445.0 408.0 371.0 331.0 297.0 257.0 208.0 165.0 NaN NaN NaN NaN
2020-10 345.0 333.0 313.0 297.0 284.0 267.0 252.0 225.0 201.0 183.0 159.0 141.0 123.0 108.0 NaN NaN NaN NaN NaN
2020-11 1254.0 1221.0 1162.0 1094.0 1027.0 965.0 892.0 816.0 743.0 682.0 607.0 549.0 464.0 NaN NaN NaN NaN NaN NaN
2020-12 387.0 379.0 352.0 338.0 319.0 292.0 275.0 257.0 230.0 207.0 185.0 157.0 NaN NaN NaN NaN NaN NaN NaN
2021-01 805.0 782.0 742.0 692.0 649.0 599.0 551.0 500.0 463.0 417.0 371.0 NaN NaN NaN NaN NaN NaN NaN NaN
2021-02 469.0 458.0 434.0 407.0 380.0 357.0 336.0 317.0 296.0 263.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-03 1540.0 1491.0 1390.0 1302.0 1221.0 1128.0 1049.0 967.0 864.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-04 1265.0 1221.0 1145.0 1086.0 1006.0 937.0 862.0 793.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-05 558.0 548.0 520.0 481.0 446.0 417.0 389.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-06 607.0 589.0 560.0 517.0 484.0 455.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-07 597.0 572.0 543.0 511.0 477.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-08 923.0 902.0 850.0 792.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09 975.0 952.0 899.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-10 647.0 628.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-11 131.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN""")
,sep="\s+").drop(0).set_index("periods")
px.area(df, x=df.index, y=df.columns)

Randomly replace 10% of dataframe with NaNs?

I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))

Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0

Moving average by column / year - python, pandas

I need to built a moving average over column "total_medals" by country [noc] for all previous years - my daata looks like:
medal Bronze Gold Medal Silver **total_medals**
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
I want to have a moving average per country and year (i.e. for ALG: 1984 Avg (total_medals)=2.0; 1992 Avg(total_medals) = (2.0+6.0)/2 = 4.0; 1996 Acg(total_medals) = (2.0+6.0+7.0)/3 = 5.0) - moving average should appear in new column (next to total_medals).
Additionally, for each country & year combination new column called "performance" should be the fraction of "total_medals" divided by "moving average"

Sample dataframe:
print(df)
medal Bronze Gold Medal Silver
noc year
ALG 1984 2.0 NaN NaN NaN 2.0
1992 4.0 2.0 NaN NaN 6.0
1996 2.0 1.0 NaN 4.0 7.0
ANZ 1984 2.0 15.0 NaN 2.0 19.0
1992 3.0 5.0 NaN 2.0 10.0
1996 1.0 2.0 NaN 2.0 5.0
ARG 1984 2.0 6.0 NaN 3.0 11.0
1992 5.0 3.0 NaN 24.0 32.0
1992 3.0 7.0 NaN 5.0 15.0
Use DataFrame.groupby + expanding:
df['total_mean']=df.groupby(level=0,sort=False).Silver.apply(lambda x: x.expanding(1).mean())
print(df)
medal Bronze Gold Medal Silver total_medals
noc year
ALG 1984 2.0 NaN NaN NaN 2.0 2.000000
1992 4.0 2.0 NaN NaN 6.0 4.000000
1996 2.0 1.0 NaN 4.0 7.0 5.000000
ANZ 1984 2.0 15.0 NaN 2.0 19.0 19.000000
1992 3.0 5.0 NaN 2.0 10.0 14.500000
1996 1.0 2.0 NaN 2.0 5.0 11.333333
ARG 1984 2.0 6.0 NaN 3.0 11.0 11.000000
1992 5.0 3.0 NaN 24.0 32.0 21.500000
1992 3.0 7.0 NaN 5.0 15.0 19.333333
bonze lagged
s=df.groupby('noc').apply(lambda x: x['Bronze']/x['total_medals'].shift())
s.index=s.index.droplevel()
df['bronze_lagged']=s
You could create a function for this...
def lagged_medals(type_of_medal):
s=df.groupby('noc').apply(lambda x: x[type_of_medal]/x['total_medals'].shift())
s.index=s.index.droplevel()
df[f'{type_of_medal}_lagged']=s
lagged_medals('Silver')
#print(df)

Easy pythonic way to classify columns in groups and store it in Dictionary?

Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0
20 21.0 205.0
I am trying to classify according to machine number. Like Machine_number 1 to 5 will be one group. Then 6 to 10 in one group and so on.

I think you need substract 1 by sub and then floordiv:
df['g'] = df.Machine_number.sub(1).floordiv(5)
#same as //
#df['g'] = df.Machine_number.sub(1) // 5
print (df)
Machine_number Machine_Running_Hours g
0 1.0 424.0 -0.0
1 2.0 458.0 0.0
2 3.0 465.0 0.0
3 4.0 446.0 0.0
4 5.0 466.0 0.0
5 6.0 466.0 1.0
6 7.0 445.0 1.0
7 8.0 466.0 1.0
8 9.0 447.0 1.0
9 10.0 469.0 1.0
10 11.0 467.0 2.0
11 12.0 449.0 2.0
12 13.0 436.0 2.0
13 14.0 465.0 2.0
14 15.0 463.0 2.0
15 16.0 372.0 3.0
16 17.0 460.0 3.0
17 18.0 450.0 3.0
18 19.0 467.0 3.0
19 20.0 463.0 3.0
20 21.0 205.0 4.0
If need store in dictionary use groupby with dict comprehension:
dfs = {i:g for i, g in df.groupby(df.Machine_number.astype(int).sub(1).floordiv(5))}
print (dfs)
{0: Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0, 1: Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0, 2: Machine_number Machine_Running_Hours
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0, 3: Machine_number Machine_Running_Hours
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0, 4: Machine_number Machine_Running_Hours
20 21.0 205.0}
print (dfs[0])
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
print (dfs[1])
Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is pandas df.diff(2) different than df.diff().diff()? - python

This is precisely because Δ2 yt = yt - 2 yt - 1 + yt - 2 ≠ yt - yt - 2. The left hand side is df.diff().diff(), whereas the right hand side is df.diff(2). For the difference in difference, you want the left hand side.

Consider; df a b c d df.diff() is NaN b - a c - b d - c df.diff(2) is NaN NaN c - a d - b df.diff().diff() is NaN NaN (c - b) - (b - a) = c - 2b + a (d - c) - (c - b) = d - 2c + b They're not the same, mathematically.

Related

What is the best way to create a new dataframe with existing ones of different shapes and criteria

Create a line/area chart as a gantt chart with plotly

Randomly replace 10% of dataframe with NaNs?

Moving average by column / year - python, pandas

Easy pythonic way to classify columns in groups and store it in Dictionary?

Categories

Resources