Iterate through dates to calculate difference - python

I am working with the following df:
date/time wind (°) wind (kt) temp (C°) humidity(%) currents (°) currents (kt) stemp (C°) stemp_diff
0 2017-04-27 11:21:54 180.0 14.0 27.0 5000.0 200.0 0.4 25.4 2.6
1 2017-05-04 20:31:12 150.0 15.0 22.0 5000.0 30.0 0.2 26.5 -1.2
2 2017-05-08 05:00:52 110.0 6.0 25.0 5000.0 30.0 0.2 27.0 -1.7
3 2017-05-09 05:00:55 160.0 13.0 23.0 5000.0 30.0 0.6 27.0 -2.0
4 2017-05-10 16:39:16 160.0 20.0 22.0 5000.0 30.0 0.6 26.5 -1.8
5 ... ... ... ... ... ... ... ... ...
6 2020-10-25 00:00:00 5000.0 5000.0 21.0 81.0 5000.0 5000.0 23.0 -2.0
7 2020-10-26 00:00:00 5000.0 5000.0 21.0 77.0 5000.0 5000.0 23.0 -2.0
8 2020-10-27 00:00:00 5000.0 5000.0 21.0 80.0 5000.0 5000.0 23.0 -2.0
9 2020-10-31 00:00:00 5000.0 5000.0 22.0 79.0 5000.0 5000.0 23.0 -2.0
10 2020-11-01 00:00:00 5000.0 5000.0 19.0 82.0 5000.0 5000.0 23.0 -2.0
I would like to find a way to iterate through the years and months my date/time column (i.e. April 2017, all the way to November 2020), and for each unique time period, calculate the average difference for the stemp column as a starting point (so the average for April 2017, May 2017, and so on, or even just use the months).
I'm trying to use this sort of code, but not quite sure how to translate it into something that works:
for '%MM' in australia_in['date/time']:
australia_in['stemp_diff'] = australia_in['stemp (C°)'] - australia_out['stemp (C°)']
australia_in
Any thoughts?

Let's first make sure that your date/time columns is of datetime type, and set it to index:
df['date/time'] = pd.to_datetime(df['date/time'])
df = df.set_index('date/time')
Then you have several options.
Using pandas.Grouper:
>>> df.groupby(pd.Grouper(freq='M'))['stemp (C°)'].mean().dropna()
date/time
2017-04-30 25.40
2017-05-31 26.75
2020-10-31 23.00
2020-11-30 23.00
Name: stemp (C°), dtype: float64
Using the year and month:
>>> df.groupby([df.index.year, df.index.month])['stemp (C°)'].mean()
date/time date/time
2017 4 25.40
5 26.75
2020 10 23.00
11 23.00
Name: stemp (C°), dtype: float64
If you do not want to have the date/time as index, you can use the column instead:
df.groupby([df['date/time'].dt.year, df['date/time'].dt.month])['stemp (C°)'].mean()
And finally, to have nice names for the year and month:
(df.groupby([df['date/time'].dt.year.rename('year'),
df['date/time'].dt.month.rename('month')])
['stemp (C°)'].mean()
)
output:
year month
2017 4 25.40
5 26.75
2020 10 23.00
11 23.00
Name: stemp (C°), dtype: float64
Applying the same calculation on stemp_diff:
year month
2017 4 2.600
5 -1.675
2020 10 -2.000
11 -2.000
Name: stemp_diff, dtype: float64

Related

What is the best way to create a new dataframe with existing ones of different shapes and criteria

I have a few dataframes that I have made through various sorting and processing of data from the main dataframe (df1).
df1 - large and will currently covers 6 days worth of data for every 30 mins but I wish to scale up to longer periods:
import pandas as pd
import numpy as np
bmu_units = pd.read_csv('bmu_units_technology.csv')
b1610 = pd.read_csv('b1610_df.csv')
b1610 = (b1610.merge(bmu_units, on=['BM Unit ID 1'], how='left'))
b1610['% of capacity running'] = b1610.quantity / b1610.Capacity
def func(tech):
if tech in ["CCGT","OCGT","COAL"]:
return "Fossil"
else:
return "ZE"
b1610["Type"] = b1610['Technology'].apply(func)
settlementDate time BM Unit ID 1 BM Unit ID 2_x settlementPeriod quantity BM Unit ID 2_y Capacity Technology % of capacity running Type
0 03/01/2022 00:00:00 RCBKO-1 T_RCBKO-1 1 278.658 T_RCBKO-1 279.0 WIND 0.998774 ZE
1 03/01/2022 00:00:00 LARYO-3 T_LARYW-3 1 162.940 T_LARYW-3 180.0 WIND 0.905222 ZE
2 03/01/2022 00:00:00 LAGA-1 T_LAGA-1 1 262.200 T_LAGA-1 905.0 CCGT 0.289724 Fossil
3 03/01/2022 00:00:00 CRMLW-1 T_CRMLW-1 1 3.002 T_CRMLW-1 47.0 WIND 0.063872 ZE
4 03/01/2022 00:00:00 GRIFW-1 T_GRIFW-1 1 9.972 T_GRIFW-1 102.0 WIND 0.097765 ZE
... ... ... ... ... ... ... ... ... ... ... ...
52533 08/01/2022 23:30:00 CRMLW-1 T_CRMLW-1 48 8.506 T_CRMLW-1 47.0 WIND 0.180979 ZE
52534 08/01/2022 23:30:00 LARYO-4 T_LARYW-4 48 159.740 T_LARYW-4 180.0 WIND 0.887444 ZE
52535 08/01/2022 23:30:00 HOWBO-3 T_HOWBO-3 48 32.554 T_HOWBO-3 440.0 Offshore Wind 0.073986 ZE
52536 08/01/2022 23:30:00 BETHW-1 E_BETHW-1 48 5.010 E_BETHW-1 30.0 WIND 0.167000 ZE
52537 08/01/2022 23:30:00 HMGTO-1 T_HMGTO-1 48 92.094 HMGTO-1 108.0 WIND 0.852722 ZE
df2:
rank = (
b1610.pivot_table(
index=['settlementDate','BM Unit ID 1','Technology'],
columns='settlementPeriod',
values='% of capacity running',
aggfunc=sum,
fill_value=0)
)
rank['rank of capacity'] = rank.sum(axis=1)
rank
settlementPeriod 1 2 3 4 5 6 7 8 9 10 ... 40 41 42 43 44 45 46 47 48 rank of capacity
settlementDate BM Unit ID 1 Technology
03/01/2022 ABRBO-1 WIND 0.936970 0.969293 0.970909 0.925051 0.885657 0.939394 0.963434 0.938586 0.863232 0.781212 ... 0.461818 0.394545 0.428889 0.537172 0.520606 0.545253 0.873333 0.697778 0.651111 29.566263
ABRTW-1 WIND 0.346389 0.343333 0.345389 0.341667 0.342222 0.346778 0.347611 0.347722 0.346833 0.340556 ... 0.018778 0.015889 0.032056 0.043056 0.032167 0.109611 0.132111 0.163278 0.223556 10.441333
ACHRW-1 WIND 0.602884 0.575628 0.602140 0.651070 0.667721 0.654791 0.539209 0.628698 0.784233 0.782140 ... 0.174419 0.148465 0.139860 0.091535 0.094698 0.272419 0.205023 0.184651 0.177628 18.517814
AKGLW-2 WIND 0.000603 0.000603 0.000603 0.000635 0.000603 0.000635 0.000635 0.000635 0.000635 0.000603 ... 0.191079 0.195079 0.250476 0.281048 0.290000 0.279524 0.358508 0.452698 0.572730 8.616032
ANSUW-1 WIND 0.889368 0.865053 0.915684 0.894000 0.888526 0.858211 0.875158 0.878421 0.809368 0.898737 ... 0.142632 0.212526 0.276421 0.225053 0.235789 0.228000 0.152211 0.226000 0.299158 19.662421
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
08/01/2022 WBURB-2 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.636329 0.642447 0.961835 0.908706 0.650212 0.507012 0.513176 0.503576 0.518212 24.439765
HOWBO-3 Offshore Wind 0.030418 0.026355 0.026595 0.014373 0.012523 0.008418 0.010977 0.016918 0.019127 0.025641 ... 0.055509 0.063845 0.073850 0.073923 0.073895 0.073791 0.073886 0.074050 0.073986 2.332809
MRWD-1 CCGT 0.808043 0.894348 0.853043 0.650870 0.159783 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.701739 0.488913 0.488913 0.489348 0.489130 0.392826 0.079130 0.000000 0.000000 23.485217
WBURB-3 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.771402 0.699986 0.648386 0.919242 0.759520 0.424513 0.430598 0.420089 0.436376 25.436282
DRAXX-4 BIOMASS 0.706074 0.791786 0.806713 0.806462 0.806270 0.806136 0.806509 0.806369 0.799749 0.825070 ... 0.777395 0.816093 0.707122 0.666639 0.680406 0.679216 0.501433 0.000000 0.000000 36.576512
df3 - this was made by sorting the above dataframe to list sums for each day for each BM Unit ID filtered for specific technology types.
BM Unit ID 1 Technology 03/01/2022 04/01/2022 05/01/2022 06/01/2022 07/01/2022 08/01/2022 ave rank rank
0 FAWN-1 CCGT 1.0 5.0 1.0 5.0 2.0 1.0 2.500000 1.0
1 GRAI-6 CCGT 4.0 18.0 2.0 4.0 3.0 3.0 5.666667 2.0
2 EECL-1 CCGT 5.0 29.0 4.0 1.0 1.0 2.0 7.000000 3.0
3 PEMB-21 CCGT 7.0 1.0 6.0 13.0 8.0 8.0 7.166667 4.0
4 PEMB-51 CCGT 3.0 3.0 3.0 11.0 16.0 NaN 7.200000 5.0
5 PEMB-41 CCGT 9.0 4.0 7.0 7.0 10.0 13.0 8.333333 6.0
6 WBURB-1 CCGT 6.0 9.0 22.0 2.0 7.0 5.0 8.500000 7.0
7 PEMB-31 CCGT 14.0 6.0 13.0 6.0 4.0 9.0 8.666667 8.0
8 GRMO-1 CCGT 2.0 7.0 10.0 24.0 11.0 6.0 10.000000 9.0
9 PEMB-11 CCGT 21.0 2.0 9.0 10.0 9.0 14.0 10.833333 10.0
10 STAY-1 CCGT 19.0 12.0 5.0 23.0 6.0 7.0 12.000000 11.0
11 GRAI-7 CCGT 10.0 27.0 15.0 9.0 15.0 11.0 14.500000 12.0
12 DIDCB6 CCGT 28.0 11.0 11.0 8.0 19.0 15.0 15.333333 13.0
13 STAY-4 CCGT 12.0 8.0 20.0 18.0 14.0 23.0 15.833333 14.0
14 SCCL-3 CCGT 17.0 16.0 31.0 3.0 18.0 10.0 15.833333 14.0
15 CDCL-1 CCGT 13.0 22.0 8.0 25.0 12.0 16.0 16.000000 15.0
16 STAY-3 CCGT 8.0 17.0 17.0 20.0 13.0 22.0 16.166667 16.0
17 MRWD-1 CCGT NaN NaN 19.0 26.0 5.0 19.0 17.250000 17.0
18 WBURB-3 CCGT NaN NaN 24.0 14.0 17.0 17.0 18.000000 18.0
19 WBURB-2 CCGT NaN 14.0 21.0 12.0 31.0 18.0 19.200000 19.0
20 GYAR-1 CCGT NaN 26.0 14.0 17.0 20.0 21.0 19.600000 20.0
21 STAY-2 CCGT 18.0 20.0 18.0 21.0 24.0 20.0 20.166667 21.0
22 SHOS-1 CCGT 16.0 15.0 28.0 15.0 29.0 27.0 21.666667 22.0
23 KLYN-A-1 CCGT NaN 24.0 12.0 19.0 27.0 29.0 22.200000 23.0
24 DIDCB5 CCGT NaN 10.0 35.0 22.0 NaN NaN 22.333333 24.0
25 CARR-1 CCGT NaN 33.0 26.0 27.0 22.0 4.0 22.400000 25.0
26 LAGA-1 CCGT 15.0 13.0 29.0 32.0 23.0 24.0 22.666667 26.0
27 CARR-2 CCGT 24.0 25.0 27.0 29.0 21.0 12.0 23.000000 27.0
28 GRAI-8 CCGT 11.0 28.0 36.0 16.0 26.0 25.0 23.666667 28.0
29 SCCL-2 CCGT 29.0 NaN 16.0 28.0 25.0 NaN 24.500000 29.0
30 LBAR-1 CCGT NaN 19.0 25.0 31.0 28.0 NaN 25.750000 30.0
31 CNQPS-2 CCGT 20.0 NaN 32.0 NaN 32.0 26.0 27.500000 31.0
32 SPLN-1 CCGT NaN NaN 23.0 30.0 30.0 NaN 27.666667 32.0
33 CNQPS-1 CCGT 25.0 NaN 33.0 NaN NaN NaN 29.000000 33.0
34 DAMC-1 CCGT 23.0 21.0 38.0 34.0 NaN NaN 29.000000 33.0
35 KEAD-2 CCGT 30.0 NaN NaN NaN NaN NaN 30.000000 34.0
36 HUMR-1 CCGT 22.0 30.0 37.0 37.0 33.0 28.0 31.166667 35.0
37 SHBA-1 CCGT 26.0 23.0 40.0 35.0 37.0 NaN 32.200000 36.0
38 SEAB-1 CCGT NaN 32.0 34.0 36.0 NaN 30.0 33.000000 37.0
39 CNQPS-4 CCGT 27.0 NaN 41.0 33.0 35.0 31.0 33.400000 38.0
40 PETEM1 CCGT NaN 35.0 NaN NaN NaN NaN 35.000000 39.0
41 SEAB-2 CCGT NaN 31.0 39.0 39.0 34.0 NaN 35.750000 40.0
42 COSO-1 CCGT NaN NaN 30.0 42.0 36.0 NaN 36.000000 41.0
43 ROCK-1 CCGT 31.0 34.0 42.0 38.0 38.0 NaN 36.600000 42.0
44 WBURB-43 COAL 32.0 37.0 45.0 40.0 39.0 32.0 37.500000 43.0
45 WBURB-41 COAL 33.0 38.0 46.0 41.0 40.0 33.0 38.500000 44.0
46 FELL-1 CCGT 34.0 39.0 47.0 43.0 41.0 34.0 39.666667 45.0
47 FDUNT-1 OCGT NaN 36.0 44.0 NaN NaN NaN 40.000000 46.0
48 KEAD-1 CCGT NaN NaN 43.0 NaN NaN NaN 43.000000 47.0
My issue is that I am trying to create a new dataframe using the existing dataframes listed above in which I can list all my BM Unit ID 1's in order of rank from df2 while populating the values with means of values for all dates (not split by date) in df1. An example of what I am after is below, which I made on excel using index match. Here I have the results for each settlement period from df1 and df2 but instead of split by date they are an aggregated mean over all dates in the df but they are still ranked according to the last column of df2, which is key.
Desired Output:
BM Unit ID Technology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Rank Capacity
1 150 FAWN-1 CCGT 130.43 130.93 130.78 130.58 130.57 130.54 130.71 130.87 130.89 130.98 130.83 130.80 130.88 131.02 130.81 130.65 130.86 130.84 131.19 130.60 130.69 130.70 130.40 130.03 130.13 130.03 129.75
2 455 GRAI-6 CCGT 339.45 342.33 322.53 312.40 303.78 307.60 316.35 277.18 293.48 325.75 326.75 271.34 299.74 328.06 317.12 342.66 364.50 390.90 403.32 411.52 400.18 405.94 394.04 400.08 389.08 382.74 374.76
3 408 EECL-1 CCGT 363.31 386.71 364.46 363.31 363.31 363.38 361.87 305.06 286.99 282.74 323.93 242.88 242.64 207.73 294.71 357.15 383.47 426.93 433.01 432.98 435.14 436.38 416.04 417.69 430.42 415.09 406.45
4 430 PEMB-21 CCGT 334.40 419.50 436.70 441.90 440.50 415.80 327.90 323.70 322.70 331.10 367.50 368.40 396.70 259.05 415.95 356.32 386.84 400.00 429.52 435.40 434.84 435.88 435.60 438.48 438.16 437.84 437.76
5 465 PEMB-51 CCGT 370.65 370.45 359.90 326.25 326.20 322.65 324.60 274.25 319.55 288.80 301.75 279.08 379.60 376.76 389.92 419.24 403.64 420.92 428.20 421.32 396.92 397.80 424.40 433.92 434.56 431.44 434.40
6 445 PEMB-41 CCGT 337.00 423.40 423.10 427.50 427.00 419.00 361.00 318.80 263.20 226.70 268.70 231.35 366.90 378.35 392.20 421.55 354.96 382.48 422.64 428.28 428.76 431.24 431.92 431.84 429.52 429.00 431.48
7 425 WBURB-1 CCGT 240.41 293.17 252.27 256.51 261.65 253.44 247.14 217.08 223.11 199.27 254.69 314.16 361.07 317.50 259.54 266.83 349.64 383.43 408.18 412.29 395.54 383.48 355.98 340.49 360.87 352.74 376.92
8 465 PEMB-31 CCGT 297.73 360.27 355.40 357.07 358.67 353.07 300.93 284.73 268.73 255.20 248.53 257.75 366.75 376.45 396.40 320.56 342.68 352.52 361.16 379.40 386.64 390.36 409.12 427.48 426.60 426.80 427.16
9 144 GRMO-1 CCGT 106.62 106.11 105.96 106.00 106.00 105.98 105.99 105.90 105.47 105.31 105.28 105.07 105.04 105.06 105.06 105.04 105.06 105.06 105.07 105.04 105.05 105.06 105.04 105.04 105.04 105.06 105.07
10 430 PEMB-11 CCGT 432.80 430.40 430.70 431.90 432.10 429.30 430.00 408.30 320.90 346.50 432.90 432.20 312.93 297.20 414.55 432.00 420.40 429.80 402.60 426.90 430.65 435.85 435.10 431.15 435.20 431.50 431.75
11 457 STAY-1 CCGT 216.07 223.27 232.67 243.47 234.67 221.73 227.00 219.00 237.00 218.33 250.73 228.27 219.67 142.68 243.00 300.64 312.28 331.00 360.84 379.28 398.92 410.04 410.56 409.24 411.96 408.84 411.88
12 455 GRAI-7 CCGT 425.20 425.40 377.90 339.40 342.00 329.80 408.00 402.40 329.00 257.30 130.43 211.37 262.60 318.45 299.98 324.72 350.40 386.26 394.20 402.10 390.48 401.22 388.94 394.10 395.14 379.70 377.26
13 710 DIDCB6 CCGT 465.80 459.50 411.60 411.70 413.70 410.80 351.50 333.40 333.70 390.40 234.60 265.56 348.16 430.28 524.32 554.04 536.28 589.28 594.04 597.72 592.76 557.86 687.70 687.25 687.35 687.25 679.80
14 400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
16 440 CDCL-1 CCGT 270.63 255.24 210.87 197.10 195.12 198.72 197.64 198.99 233.19 221.31 176.94 317.52 280.68 213.12 297.68 342.25 397.26 372.28 371.74 379.87 347.51 348.48 352.15 384.88 395.14 381.02 360.40
17 457 STAY-3 CCGT 311.25 311.30 311.60 311.45 311.15 311.30 308.40 313.10 223.90 196.05 242.95 172.87 217.40 236.84 252.92 352.98 384.06 414.76 403.68 424.90 418.38 403.00 420.26 424.40 427.06 421.64 424.66
18 920 MRWD-1 CCGT 468.70 483.90 420.60 267.80 472.60 470.20 241.40 299.30 327.70 327.80 336.90 241.60 308.33 529.93 793.73 828.40 870.67 846.67 827.07 855.93 829.33 865.87 870.40 846.87 765.47 785.20 824.00
19 425 WBURB-3 CCGT 311.73 427.68 333.68 333.93 370.68 335.09 420.85 433.86 370.45 321.70 340.54 300.95 155.47 190.67 290.81 310.43 332.52 376.63 391.11 413.74 408.33 398.69 397.54 368.05 410.64 413.05 428.91
20 425 WBURB-2 CCGT 295.54 424.56 336.68 334.08 371.20 358.44 358.90 358.96 377.94 325.42 203.19 165.32 205.75 121.41 162.51 180.15 301.12 413.77 410.33 397.21 385.59 378.09 381.50 380.93 413.71 418.53 427.09
21 420 GYAR-1 CCGT 404.33 404.33 403.73 405.12 404.13 404.33 404.33 376.98 218.02 218.02 351.01 215.10 177.46 222.43 345.47 398.94 401.97 401.97 402.17 401.87 401.47 401.77 401.62 402.51 402.31 402.41 402.26
22 457 STAY-2 CCGT 434.20 435.40 435.40 435.20 434.20 434.20 434.20 434.60 249.80 196.20 291.20 234.80 196.80 88.73 167.10 239.52 324.52 372.80 412.40 423.32 424.04 423.96 423.92 424.08 423.88 420.96 422.44
23 400 KLYN-A-1 CCGT 382.58 382.50 384.94 385.81 385.83 385.79 385.02 384.94 259.16 141.03 195.65 205.75 278.81 256.95 296.85 337.82 369.26 376.38 376.84 376.56 376.30 376.09 375.62 375.45 375.11 375.17 375.09
24 420 SHOS-1 CCGT 290.63 326.33 229.60 265.70 269.05 259.40 299.45 310.20 301.65 266.00 307.90 319.30 253.06 246.85 263.04 220.46 277.68 297.84 290.62 297.86 302.83 295.13 293.73 289.04 306.14 314.24 321.76

How do I plot my cohorts by week and month without changing the underlying pivot table data?

I'm doing a cohort analysis. The cohorts are tied to the day that the user was created in the database.
Is there a way to plot this such that I can group cohorts together by week and month? I'd prefer not to change the pivot table data if possible.
periods_weeks 0.0 1.0 2.0 3.0 4.0 5.0 6.0
cohort
2021-03-01 2192.0 NaN 10.0 NaN NaN NaN NaN
2021-06-02 270.0 141.0 46.0 71.0 96.0 63.0 57.0
2021-06-03 338.0 97.0 62.0 30.0 10.0 21.0 15.0
2021-06-04 17801.0 7611.0 6342.0 6062.0 5064.0 4731.0 4518.0
2021-06-05 4105.0 1123.0 982.0 859.0 597.0 486.0 448.0

How do I manipulate a Dataframe with Pivot_Table in Python

I have spent much time on this but I am nowhere closer to a solution.
I have a dataframe which outputs as
RegionID AreaID Year Jan Feb Mar Apr May Jun
0 20.0 1.0 2020.0 1174.0 1056.0 1051.0 1107.0 1097.0 1118.0
1 19.0 2.0 2020.0 460.0 451.0 421.0 421.0 420.0 457.0
2 20.0 3.0 2020.0 2723.0 2594.0 2590.0 2399.0 2377.0 2331.0
3 21.0 4.0 2020.0 863.0 859.0 813.0 785.0 757.0 765.0
4 19.0 5.0 2020.0 4037.0 3942.0 4069.0 3844.0 3567.0 3721.0
5 19.0 6.0 2020.0 1695.0 1577.0 1531.0 1614.0 1671.0 1693.0
6 18.0 7.0 2020.0 1757.0 1505.0 1445.0 1514.0 1406.0 1444.0
7 18.0 8.0 2020.0 832.0 721.0 747.0 852.0 885.0 872.0
8 18.0 9.0 2020.0 2538.0 2000.0 2026.0 1981.0 1987.0 1949.0
9 21.0 10.0 2020.0 1145.0 1235.0 1114.0 1161.0 1150.0 1189.0
10 20.0 11.0 2020.0 551.0 497.0 503.0 472.0 505.0 532.0
11 19.0 12.0 2020.0 1664.0 1526.0 1389.0 1373.0 1384.0 1404.0
12 21.0 13.0 2020.0 381.0 351.0 299.0 286.0 297.0 319.0
13 21.0 14.0 2020.0 1733.0 1627.0 1567.0 1561.0 1498.0 1511.0
14 18.0 15.0 2020.0 1257.0 1257.0 1160.0 1172.0 1124.0 1113.0
I want to pivot this data so that I have a month combined field like below
RegionID AreaID Year Month Amout
20.0 1.0 2020 Jan 1174
20.0 1.0 2020 Feb 1056
20.0 1.0 2020 Mar 1051
Can this be done using pandas? I have been trying with the pivot_table but I cant get it to work.
I hope I've understood your question well. You can .set_index() and then .stack():
print(
df.set_index(["RegionID", "AreaID", "Year"])
.stack()
.reset_index()
.rename(columns={"level_3": "Month", 0: "Amount"})
)
Prints:
RegionID AreaID Year Month Amount
0 20.0 1.0 2020.0 Jan 1174.0
1 20.0 1.0 2020.0 Feb 1056.0
2 20.0 1.0 2020.0 Mar 1051.0
3 20.0 1.0 2020.0 Apr 1107.0
4 20.0 1.0 2020.0 May 1097.0
5 20.0 1.0 2020.0 Jun 1118.0
6 19.0 2.0 2020.0 Jan 460.0
7 19.0 2.0 2020.0 Feb 451.0
8 19.0 2.0 2020.0 Mar 421.0
9 19.0 2.0 2020.0 Apr 421.0
10 19.0 2.0 2020.0 May 420.0
11 19.0 2.0 2020.0 Jun 457.0
...
Or:
print(
df.melt(
["RegionID", "AreaID", "Year"], var_name="Month", value_name="Amount"
)
)

Calculating age from dataframe (dob -y/m/d)

I'm trying to add a column "Age" to my data
number of purchased hours(mins) dob Y dob M dob D
0 7200 2010.0 10.0 12.0
1 7320 2010.0 6.0 2.0
2 5400 2011.0 6.0 18.0
3 9180 2009.0 10.0 18.0
4 3102 2007.0 7.0 30.0
5 5400 2011.0 4.0 6.0
6 9000 2009.0 8.0 5.0
7 6000 2004.0 2.0 7.0
8 6000 2007.0 8.0 17.0
9 6000 2013.0 5.0 5.0
10 12000 2012.0 9.0 27.0
11 12000 2004.0 11.0 25.0
12 6000 2009.0 11.0 20.0
I've tried this code, but not sure what went wrong
from datetime import datetime as dt
df['Age'] = datetime.datetime.now()-pd.to_datetime(df[['dob D','dob M','dob Y']])
Below is the error that popped up
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
If want use to_datetime with 3 columns it working only with renamed columns names:
d = {'dob Y':'year', 'dob M':'month', 'dob D':'day'}
df['Age'] = (pd.Timestamp.now().floor('d') -
pd.to_datetime(df[['dob D','dob M','dob Y']].rename(columns=d)))
print (df)
number of purchased hours(mins) dob Y dob M dob D Age
0 7200 2010.0 10.0 12.0 3380 days
1 7320 2010.0 6.0 2.0 3512 days
2 5400 2011.0 6.0 18.0 3131 days
3 9180 2009.0 10.0 18.0 3739 days
4 3102 2007.0 7.0 30.0 4550 days
5 5400 2011.0 4.0 6.0 3204 days
6 9000 2009.0 8.0 5.0 3813 days
7 6000 2004.0 2.0 7.0 5819 days
8 6000 2007.0 8.0 17.0 4532 days
9 6000 2013.0 5.0 5.0 2444 days
10 12000 2012.0 9.0 27.0 2664 days
11 12000 2004.0 11.0 25.0 5527 days
12 6000 2009.0 11.0 20.0 3706 days
If want convert timedeltas to days:
d = {'dob Y':'year', 'dob M':'month', 'dob D':'day'}
df['Age'] = ((pd.Timestamp.now().floor('d') -
pd.to_datetime(df[['dob D','dob M','dob Y']].rename(columns=d)))
.dt.days)
print (df)
number of purchased hours(mins) dob Y dob M dob D Age
0 7200 2010.0 10.0 12.0 3380
1 7320 2010.0 6.0 2.0 3512
2 5400 2011.0 6.0 18.0 3131
3 9180 2009.0 10.0 18.0 3739
4 3102 2007.0 7.0 30.0 4550
5 5400 2011.0 4.0 6.0 3204
6 9000 2009.0 8.0 5.0 3813
7 6000 2004.0 2.0 7.0 5819
8 6000 2007.0 8.0 17.0 4532
9 6000 2013.0 5.0 5.0 2444
10 12000 2012.0 9.0 27.0 2664
11 12000 2004.0 11.0 25.0 5527
12 6000 2009.0 11.0 20.0 3706

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources