Python Pandas: Denormalize data from one data frame into another - python

I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)
I am working with a pandas data frame with many columns, so I will show you a simplified version below.
The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:
import pandas as pd
dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)
cols=['Date','Type','Count']
df_A1 = pd.DataFrame({'Date':datesA1,
'Type':'Apples',
'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
'Type':'Apples',
'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
'Type':'Berries',
'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
'Type':'Berries',
'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=9)})
frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]
dat_fra_source = pd.concat(frames)
Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.
dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])
for index,row in dat_fra_source.iterrows():
dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']
My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.

Option 1
dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)
Out[63]:
Count
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
Option 2
pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]:
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
And assuming you have columns vol and weight
dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]:
Type Apples Berries Canteloupes
Date
2017-07-01 52.0 0.0 96.0
2017-07-02 72.0 0.0 64.0
2017-07-03 44.0 0.0 116.0
2017-07-04 52.0 0.0 28.0
2017-07-05 96.0 44.0 92.0
2017-07-06 24.0 16.0 0.0
2017-07-07 116.0 104.0 0.0
2017-07-08 124.0 76.0 0.0
2017-07-09 152.0 68.0 104.0
2017-07-10 228.0 216.0 4.0
2017-07-11 16.0 164.0 40.0
2017-07-12 64.0 112.0 92.0
2017-07-13 100.0 80.0 80.0
2017-07-14 76.0 24.0 60.0
2017-07-15 24.0 88.0 28.0
2017-07-16 64.0 0.0 20.0
2017-07-17 116.0 28.0 16.0
2017-07-18 0.0 84.0 0.0
2017-07-19 0.0 76.0 0.0
2017-07-20 0.0 32.0 0.0

Use pd.crosstab:
pd.crosstab(dat_fra_source['Date'],
dat_fra_source['Type'],
dat_fra_source['Count'],
aggfunc='sum',
dropna=False).fillna(0)
Output:
Type Apples Berries Canteloupes
Date
2017-07-01 19.0 0.0 4.0
2017-07-02 25.0 0.0 4.0
2017-07-03 11.0 0.0 26.0
2017-07-04 27.0 0.0 8.0
2017-07-05 8.0 18.0 12.0
2017-07-06 10.0 11.0 0.0
2017-07-07 6.0 17.0 0.0
2017-07-08 10.0 5.0 0.0
2017-07-09 51.0 25.0 16.0
2017-07-10 31.0 23.0 21.0
2017-07-11 35.0 40.0 10.0
2017-07-12 16.0 30.0 9.0
2017-07-13 13.0 23.0 20.0
2017-07-14 21.0 26.0 27.0
2017-07-15 20.0 17.0 19.0
2017-07-16 12.0 4.0 2.0
2017-07-17 27.0 0.0 5.0
2017-07-18 0.0 5.0 0.0
2017-07-19 0.0 26.0 0.0
2017-07-20 0.0 6.0 0.0

Related

What is the best way to create a new dataframe with existing ones of different shapes and criteria

I have a few dataframes that I have made through various sorting and processing of data from the main dataframe (df1).
df1 - large and will currently covers 6 days worth of data for every 30 mins but I wish to scale up to longer periods:
import pandas as pd
import numpy as np
bmu_units = pd.read_csv('bmu_units_technology.csv')
b1610 = pd.read_csv('b1610_df.csv')
b1610 = (b1610.merge(bmu_units, on=['BM Unit ID 1'], how='left'))
b1610['% of capacity running'] = b1610.quantity / b1610.Capacity
def func(tech):
if tech in ["CCGT","OCGT","COAL"]:
return "Fossil"
else:
return "ZE"
b1610["Type"] = b1610['Technology'].apply(func)
settlementDate time BM Unit ID 1 BM Unit ID 2_x settlementPeriod quantity BM Unit ID 2_y Capacity Technology % of capacity running Type
0 03/01/2022 00:00:00 RCBKO-1 T_RCBKO-1 1 278.658 T_RCBKO-1 279.0 WIND 0.998774 ZE
1 03/01/2022 00:00:00 LARYO-3 T_LARYW-3 1 162.940 T_LARYW-3 180.0 WIND 0.905222 ZE
2 03/01/2022 00:00:00 LAGA-1 T_LAGA-1 1 262.200 T_LAGA-1 905.0 CCGT 0.289724 Fossil
3 03/01/2022 00:00:00 CRMLW-1 T_CRMLW-1 1 3.002 T_CRMLW-1 47.0 WIND 0.063872 ZE
4 03/01/2022 00:00:00 GRIFW-1 T_GRIFW-1 1 9.972 T_GRIFW-1 102.0 WIND 0.097765 ZE
... ... ... ... ... ... ... ... ... ... ... ...
52533 08/01/2022 23:30:00 CRMLW-1 T_CRMLW-1 48 8.506 T_CRMLW-1 47.0 WIND 0.180979 ZE
52534 08/01/2022 23:30:00 LARYO-4 T_LARYW-4 48 159.740 T_LARYW-4 180.0 WIND 0.887444 ZE
52535 08/01/2022 23:30:00 HOWBO-3 T_HOWBO-3 48 32.554 T_HOWBO-3 440.0 Offshore Wind 0.073986 ZE
52536 08/01/2022 23:30:00 BETHW-1 E_BETHW-1 48 5.010 E_BETHW-1 30.0 WIND 0.167000 ZE
52537 08/01/2022 23:30:00 HMGTO-1 T_HMGTO-1 48 92.094 HMGTO-1 108.0 WIND 0.852722 ZE
df2:
rank = (
b1610.pivot_table(
index=['settlementDate','BM Unit ID 1','Technology'],
columns='settlementPeriod',
values='% of capacity running',
aggfunc=sum,
fill_value=0)
)
rank['rank of capacity'] = rank.sum(axis=1)
rank
settlementPeriod 1 2 3 4 5 6 7 8 9 10 ... 40 41 42 43 44 45 46 47 48 rank of capacity
settlementDate BM Unit ID 1 Technology
03/01/2022 ABRBO-1 WIND 0.936970 0.969293 0.970909 0.925051 0.885657 0.939394 0.963434 0.938586 0.863232 0.781212 ... 0.461818 0.394545 0.428889 0.537172 0.520606 0.545253 0.873333 0.697778 0.651111 29.566263
ABRTW-1 WIND 0.346389 0.343333 0.345389 0.341667 0.342222 0.346778 0.347611 0.347722 0.346833 0.340556 ... 0.018778 0.015889 0.032056 0.043056 0.032167 0.109611 0.132111 0.163278 0.223556 10.441333
ACHRW-1 WIND 0.602884 0.575628 0.602140 0.651070 0.667721 0.654791 0.539209 0.628698 0.784233 0.782140 ... 0.174419 0.148465 0.139860 0.091535 0.094698 0.272419 0.205023 0.184651 0.177628 18.517814
AKGLW-2 WIND 0.000603 0.000603 0.000603 0.000635 0.000603 0.000635 0.000635 0.000635 0.000635 0.000603 ... 0.191079 0.195079 0.250476 0.281048 0.290000 0.279524 0.358508 0.452698 0.572730 8.616032
ANSUW-1 WIND 0.889368 0.865053 0.915684 0.894000 0.888526 0.858211 0.875158 0.878421 0.809368 0.898737 ... 0.142632 0.212526 0.276421 0.225053 0.235789 0.228000 0.152211 0.226000 0.299158 19.662421
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
08/01/2022 WBURB-2 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.636329 0.642447 0.961835 0.908706 0.650212 0.507012 0.513176 0.503576 0.518212 24.439765
HOWBO-3 Offshore Wind 0.030418 0.026355 0.026595 0.014373 0.012523 0.008418 0.010977 0.016918 0.019127 0.025641 ... 0.055509 0.063845 0.073850 0.073923 0.073895 0.073791 0.073886 0.074050 0.073986 2.332809
MRWD-1 CCGT 0.808043 0.894348 0.853043 0.650870 0.159783 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.701739 0.488913 0.488913 0.489348 0.489130 0.392826 0.079130 0.000000 0.000000 23.485217
WBURB-3 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.771402 0.699986 0.648386 0.919242 0.759520 0.424513 0.430598 0.420089 0.436376 25.436282
DRAXX-4 BIOMASS 0.706074 0.791786 0.806713 0.806462 0.806270 0.806136 0.806509 0.806369 0.799749 0.825070 ... 0.777395 0.816093 0.707122 0.666639 0.680406 0.679216 0.501433 0.000000 0.000000 36.576512
df3 - this was made by sorting the above dataframe to list sums for each day for each BM Unit ID filtered for specific technology types.
BM Unit ID 1 Technology 03/01/2022 04/01/2022 05/01/2022 06/01/2022 07/01/2022 08/01/2022 ave rank rank
0 FAWN-1 CCGT 1.0 5.0 1.0 5.0 2.0 1.0 2.500000 1.0
1 GRAI-6 CCGT 4.0 18.0 2.0 4.0 3.0 3.0 5.666667 2.0
2 EECL-1 CCGT 5.0 29.0 4.0 1.0 1.0 2.0 7.000000 3.0
3 PEMB-21 CCGT 7.0 1.0 6.0 13.0 8.0 8.0 7.166667 4.0
4 PEMB-51 CCGT 3.0 3.0 3.0 11.0 16.0 NaN 7.200000 5.0
5 PEMB-41 CCGT 9.0 4.0 7.0 7.0 10.0 13.0 8.333333 6.0
6 WBURB-1 CCGT 6.0 9.0 22.0 2.0 7.0 5.0 8.500000 7.0
7 PEMB-31 CCGT 14.0 6.0 13.0 6.0 4.0 9.0 8.666667 8.0
8 GRMO-1 CCGT 2.0 7.0 10.0 24.0 11.0 6.0 10.000000 9.0
9 PEMB-11 CCGT 21.0 2.0 9.0 10.0 9.0 14.0 10.833333 10.0
10 STAY-1 CCGT 19.0 12.0 5.0 23.0 6.0 7.0 12.000000 11.0
11 GRAI-7 CCGT 10.0 27.0 15.0 9.0 15.0 11.0 14.500000 12.0
12 DIDCB6 CCGT 28.0 11.0 11.0 8.0 19.0 15.0 15.333333 13.0
13 STAY-4 CCGT 12.0 8.0 20.0 18.0 14.0 23.0 15.833333 14.0
14 SCCL-3 CCGT 17.0 16.0 31.0 3.0 18.0 10.0 15.833333 14.0
15 CDCL-1 CCGT 13.0 22.0 8.0 25.0 12.0 16.0 16.000000 15.0
16 STAY-3 CCGT 8.0 17.0 17.0 20.0 13.0 22.0 16.166667 16.0
17 MRWD-1 CCGT NaN NaN 19.0 26.0 5.0 19.0 17.250000 17.0
18 WBURB-3 CCGT NaN NaN 24.0 14.0 17.0 17.0 18.000000 18.0
19 WBURB-2 CCGT NaN 14.0 21.0 12.0 31.0 18.0 19.200000 19.0
20 GYAR-1 CCGT NaN 26.0 14.0 17.0 20.0 21.0 19.600000 20.0
21 STAY-2 CCGT 18.0 20.0 18.0 21.0 24.0 20.0 20.166667 21.0
22 SHOS-1 CCGT 16.0 15.0 28.0 15.0 29.0 27.0 21.666667 22.0
23 KLYN-A-1 CCGT NaN 24.0 12.0 19.0 27.0 29.0 22.200000 23.0
24 DIDCB5 CCGT NaN 10.0 35.0 22.0 NaN NaN 22.333333 24.0
25 CARR-1 CCGT NaN 33.0 26.0 27.0 22.0 4.0 22.400000 25.0
26 LAGA-1 CCGT 15.0 13.0 29.0 32.0 23.0 24.0 22.666667 26.0
27 CARR-2 CCGT 24.0 25.0 27.0 29.0 21.0 12.0 23.000000 27.0
28 GRAI-8 CCGT 11.0 28.0 36.0 16.0 26.0 25.0 23.666667 28.0
29 SCCL-2 CCGT 29.0 NaN 16.0 28.0 25.0 NaN 24.500000 29.0
30 LBAR-1 CCGT NaN 19.0 25.0 31.0 28.0 NaN 25.750000 30.0
31 CNQPS-2 CCGT 20.0 NaN 32.0 NaN 32.0 26.0 27.500000 31.0
32 SPLN-1 CCGT NaN NaN 23.0 30.0 30.0 NaN 27.666667 32.0
33 CNQPS-1 CCGT 25.0 NaN 33.0 NaN NaN NaN 29.000000 33.0
34 DAMC-1 CCGT 23.0 21.0 38.0 34.0 NaN NaN 29.000000 33.0
35 KEAD-2 CCGT 30.0 NaN NaN NaN NaN NaN 30.000000 34.0
36 HUMR-1 CCGT 22.0 30.0 37.0 37.0 33.0 28.0 31.166667 35.0
37 SHBA-1 CCGT 26.0 23.0 40.0 35.0 37.0 NaN 32.200000 36.0
38 SEAB-1 CCGT NaN 32.0 34.0 36.0 NaN 30.0 33.000000 37.0
39 CNQPS-4 CCGT 27.0 NaN 41.0 33.0 35.0 31.0 33.400000 38.0
40 PETEM1 CCGT NaN 35.0 NaN NaN NaN NaN 35.000000 39.0
41 SEAB-2 CCGT NaN 31.0 39.0 39.0 34.0 NaN 35.750000 40.0
42 COSO-1 CCGT NaN NaN 30.0 42.0 36.0 NaN 36.000000 41.0
43 ROCK-1 CCGT 31.0 34.0 42.0 38.0 38.0 NaN 36.600000 42.0
44 WBURB-43 COAL 32.0 37.0 45.0 40.0 39.0 32.0 37.500000 43.0
45 WBURB-41 COAL 33.0 38.0 46.0 41.0 40.0 33.0 38.500000 44.0
46 FELL-1 CCGT 34.0 39.0 47.0 43.0 41.0 34.0 39.666667 45.0
47 FDUNT-1 OCGT NaN 36.0 44.0 NaN NaN NaN 40.000000 46.0
48 KEAD-1 CCGT NaN NaN 43.0 NaN NaN NaN 43.000000 47.0
My issue is that I am trying to create a new dataframe using the existing dataframes listed above in which I can list all my BM Unit ID 1's in order of rank from df2 while populating the values with means of values for all dates (not split by date) in df1. An example of what I am after is below, which I made on excel using index match. Here I have the results for each settlement period from df1 and df2 but instead of split by date they are an aggregated mean over all dates in the df but they are still ranked according to the last column of df2, which is key.
Desired Output:
BM Unit ID Technology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Rank Capacity
1 150 FAWN-1 CCGT 130.43 130.93 130.78 130.58 130.57 130.54 130.71 130.87 130.89 130.98 130.83 130.80 130.88 131.02 130.81 130.65 130.86 130.84 131.19 130.60 130.69 130.70 130.40 130.03 130.13 130.03 129.75
2 455 GRAI-6 CCGT 339.45 342.33 322.53 312.40 303.78 307.60 316.35 277.18 293.48 325.75 326.75 271.34 299.74 328.06 317.12 342.66 364.50 390.90 403.32 411.52 400.18 405.94 394.04 400.08 389.08 382.74 374.76
3 408 EECL-1 CCGT 363.31 386.71 364.46 363.31 363.31 363.38 361.87 305.06 286.99 282.74 323.93 242.88 242.64 207.73 294.71 357.15 383.47 426.93 433.01 432.98 435.14 436.38 416.04 417.69 430.42 415.09 406.45
4 430 PEMB-21 CCGT 334.40 419.50 436.70 441.90 440.50 415.80 327.90 323.70 322.70 331.10 367.50 368.40 396.70 259.05 415.95 356.32 386.84 400.00 429.52 435.40 434.84 435.88 435.60 438.48 438.16 437.84 437.76
5 465 PEMB-51 CCGT 370.65 370.45 359.90 326.25 326.20 322.65 324.60 274.25 319.55 288.80 301.75 279.08 379.60 376.76 389.92 419.24 403.64 420.92 428.20 421.32 396.92 397.80 424.40 433.92 434.56 431.44 434.40
6 445 PEMB-41 CCGT 337.00 423.40 423.10 427.50 427.00 419.00 361.00 318.80 263.20 226.70 268.70 231.35 366.90 378.35 392.20 421.55 354.96 382.48 422.64 428.28 428.76 431.24 431.92 431.84 429.52 429.00 431.48
7 425 WBURB-1 CCGT 240.41 293.17 252.27 256.51 261.65 253.44 247.14 217.08 223.11 199.27 254.69 314.16 361.07 317.50 259.54 266.83 349.64 383.43 408.18 412.29 395.54 383.48 355.98 340.49 360.87 352.74 376.92
8 465 PEMB-31 CCGT 297.73 360.27 355.40 357.07 358.67 353.07 300.93 284.73 268.73 255.20 248.53 257.75 366.75 376.45 396.40 320.56 342.68 352.52 361.16 379.40 386.64 390.36 409.12 427.48 426.60 426.80 427.16
9 144 GRMO-1 CCGT 106.62 106.11 105.96 106.00 106.00 105.98 105.99 105.90 105.47 105.31 105.28 105.07 105.04 105.06 105.06 105.04 105.06 105.06 105.07 105.04 105.05 105.06 105.04 105.04 105.04 105.06 105.07
10 430 PEMB-11 CCGT 432.80 430.40 430.70 431.90 432.10 429.30 430.00 408.30 320.90 346.50 432.90 432.20 312.93 297.20 414.55 432.00 420.40 429.80 402.60 426.90 430.65 435.85 435.10 431.15 435.20 431.50 431.75
11 457 STAY-1 CCGT 216.07 223.27 232.67 243.47 234.67 221.73 227.00 219.00 237.00 218.33 250.73 228.27 219.67 142.68 243.00 300.64 312.28 331.00 360.84 379.28 398.92 410.04 410.56 409.24 411.96 408.84 411.88
12 455 GRAI-7 CCGT 425.20 425.40 377.90 339.40 342.00 329.80 408.00 402.40 329.00 257.30 130.43 211.37 262.60 318.45 299.98 324.72 350.40 386.26 394.20 402.10 390.48 401.22 388.94 394.10 395.14 379.70 377.26
13 710 DIDCB6 CCGT 465.80 459.50 411.60 411.70 413.70 410.80 351.50 333.40 333.70 390.40 234.60 265.56 348.16 430.28 524.32 554.04 536.28 589.28 594.04 597.72 592.76 557.86 687.70 687.25 687.35 687.25 679.80
14 400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
16 440 CDCL-1 CCGT 270.63 255.24 210.87 197.10 195.12 198.72 197.64 198.99 233.19 221.31 176.94 317.52 280.68 213.12 297.68 342.25 397.26 372.28 371.74 379.87 347.51 348.48 352.15 384.88 395.14 381.02 360.40
17 457 STAY-3 CCGT 311.25 311.30 311.60 311.45 311.15 311.30 308.40 313.10 223.90 196.05 242.95 172.87 217.40 236.84 252.92 352.98 384.06 414.76 403.68 424.90 418.38 403.00 420.26 424.40 427.06 421.64 424.66
18 920 MRWD-1 CCGT 468.70 483.90 420.60 267.80 472.60 470.20 241.40 299.30 327.70 327.80 336.90 241.60 308.33 529.93 793.73 828.40 870.67 846.67 827.07 855.93 829.33 865.87 870.40 846.87 765.47 785.20 824.00
19 425 WBURB-3 CCGT 311.73 427.68 333.68 333.93 370.68 335.09 420.85 433.86 370.45 321.70 340.54 300.95 155.47 190.67 290.81 310.43 332.52 376.63 391.11 413.74 408.33 398.69 397.54 368.05 410.64 413.05 428.91
20 425 WBURB-2 CCGT 295.54 424.56 336.68 334.08 371.20 358.44 358.90 358.96 377.94 325.42 203.19 165.32 205.75 121.41 162.51 180.15 301.12 413.77 410.33 397.21 385.59 378.09 381.50 380.93 413.71 418.53 427.09
21 420 GYAR-1 CCGT 404.33 404.33 403.73 405.12 404.13 404.33 404.33 376.98 218.02 218.02 351.01 215.10 177.46 222.43 345.47 398.94 401.97 401.97 402.17 401.87 401.47 401.77 401.62 402.51 402.31 402.41 402.26
22 457 STAY-2 CCGT 434.20 435.40 435.40 435.20 434.20 434.20 434.20 434.60 249.80 196.20 291.20 234.80 196.80 88.73 167.10 239.52 324.52 372.80 412.40 423.32 424.04 423.96 423.92 424.08 423.88 420.96 422.44
23 400 KLYN-A-1 CCGT 382.58 382.50 384.94 385.81 385.83 385.79 385.02 384.94 259.16 141.03 195.65 205.75 278.81 256.95 296.85 337.82 369.26 376.38 376.84 376.56 376.30 376.09 375.62 375.45 375.11 375.17 375.09
24 420 SHOS-1 CCGT 290.63 326.33 229.60 265.70 269.05 259.40 299.45 310.20 301.65 266.00 307.90 319.30 253.06 246.85 263.04 220.46 277.68 297.84 290.62 297.86 302.83 295.13 293.73 289.04 306.14 314.24 321.76

How to fill missing Rows in DataFrame with values in specific range?

I am missing some rows in a DataFrame, my range of date_number spans from 0 to 50. I would like to create the missing rows, in this case missing the range from 0 to 13 and from 29 to 50. I would like that every missing is filled with the following data:
date_number tag_id price weekday brand_id sales stock profit
XX.X 665237982.0 12. Y.Y 2123.0 0.0 0.0 0.00
date_number tag_id price weekday brand_id sales stock profit
14.0 665237982.0 12.95 0.0 2123.0 0.0 128.0 0.00
15.0 665237982.0 12.95 1.0 2123.0 9.0 106.0 116.55
16.0 665237982.0 12.95 2.0 2123.0 29.0 137.0 375.55
17.0 665237982.0 12.95 3.0 2123.0 24.0 88.0 310.80
18.0 665237982.0 12.95 4.0 2123.0 27.0 35.0 349.65
19.0 665237982.0 12.95 5.0 2123.0 2.0 2.0 25.90
21.0 665237982.0 12.95 0.0 2123.0 14.0 312.0 181.30
22.0 665237982.0 12.95 1.0 2123.0 12.0 455.0 155.40
23.0 665237982.0 12.95 2.0 2123.0 12.0 450.0 155.40
24.0 665237982.0 12.95 3.0 2123.0 8.0 450.0 103.60
25.0 665237982.0 12.95 4.0 2123.0 11.0 427.0 142.45
26.0 665237982.0 12.95 5.0 2123.0 9.0 401.0 116.55
27.0 665237982.0 12.95 6.0 2123.0 19.0 377.0 246.05
28.0 665237982.0 12.95 0.0 2123.0 12.0 343.0 155.40
29.0 665237982.0 12.95 1.0 2123.0 9.0 314.0 116.55

Unable to understand the difference between two statements

I recently tried to strip ("\t") values from a dataframe.
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
dataFrame[:] = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P-
ID(dataFrame) - 4647827832
htn-->['yes' 'no']
dm-->['yes' 'no' ' yes']
cad-->['no' 'yes']
appet-->['good' 'poor']
pe-->['no' 'yes']
ane-->['no' 'yes']
ID(test_df) - 4647827832
And by slightly modifying the method , The O/P changes
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
***dataFrame = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)***
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P -
ID(dataFrame) - 4647827832
htn-->['yes' 'no' nan]
dm-->['yes' 'no' ' yes' nan]
cad-->['no' 'yes' nan]
appet-->['good' 'poor' nan]
pe-->['no' 'yes' nan]
ID(test_df) - 4517467528
I m unable to understand the difference , in the statements.
Also while executing the strip outside the function on the test_df it works completely fine.
Snapshot of Data used--
id age bp sg al su rbc pc pcc \
0 0 48.0 80.0 1.020 1.0 0.0 normal normal notpresent
1 1 7.0 50.0 1.020 4.0 0.0 normal normal notpresent
2 2 62.0 80.0 1.010 2.0 3.0 normal normal notpresent
3 3 48.0 70.0 1.005 4.0 0.0 normal abnormal present
4 4 51.0 80.0 1.010 2.0 0.0 normal normal notpresent
5 5 60.0 90.0 1.015 3.0 0.0 normal normal notpresent
6 6 68.0 70.0 1.010 0.0 0.0 normal normal notpresent
7 7 24.0 80.0 1.015 2.0 4.0 normal abnormal notpresent
8 8 52.0 100.0 1.015 3.0 0.0 normal abnormal present
9 9 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present
10 10 50.0 60.0 1.010 2.0 4.0 abnormal abnormal present
11 11 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present
12 12 68.0 70.0 1.015 3.0 1.0 normal normal present
13 13 68.0 70.0 1.020 0.0 0.0 normal abnormal notpresent
14 14 68.0 80.0 1.010 3.0 2.0 normal abnormal present
15 15 40.0 80.0 1.015 3.0 0.0 abnormal normal notpresent
16 16 47.0 70.0 1.015 2.0 0.0 abnormal normal notpresent
17 17 47.0 80.0 1.020 0.0 0.0 abnormal normal notpresent
18 18 60.0 100.0 1.025 0.0 3.0 abnormal normal notpresent
19 19 62.0 60.0 1.015 1.0 0.0 abnormal abnormal present
20 20 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent
21 21 60.0 90.0 1.020 0.0 0.0 normal abnormal notpresent
22 22 48.0 80.0 1.025 4.0 0.0 normal abnormal notpresent
23 23 21.0 70.0 1.010 0.0 0.0 normal normal notpresent
24 24 42.0 100.0 1.015 4.0 0.0 normal abnormal notpresent
25 25 61.0 60.0 1.025 0.0 0.0 normal normal notpresent
26 26 75.0 80.0 1.015 0.0 0.0 normal normal notpresent
27 27 69.0 70.0 1.010 3.0 4.0 normal abnormal notpresent
28 28 75.0 70.0 1.020 1.0 3.0 abnormal abnormal notpresent
29 29 68.0 70.0 1.005 1.0 0.0 abnormal abnormal present

Merging two Pandas series with duplicate datetime indices

I have two Pandas series (d1 and d2) indexed by datetime and each containing one column of data with both float and NaN. Both indices are at one-day intervals, although the time entries are inconsistent with many periods of missing days. d1 ranges from 1974-12-16 to 2002-01-30. d2 ranges from 1997-12-19 to 2017-07-06. The period from 1997-12-19 to 2002-01-30 contains many duplicate indices between the two series. The data for duplicated indices is sometimes the same value, different values, or one value and NaN.
I would like to combine these two series into one, prioritizing the data from d2 anytime there are duplicate indices (that is, replace all d1 data with d2 data anytime there is a duplicated index). What is the most efficient way to do this among the many Pandas tools available (merge, join, concatenate etc.)?
Here is an example of my data:
In [7]: print d1
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2002-01-01 18.0
2002-01-02 16.0
2002-01-03 NaN
2002-01-04 24.0
2002-01-05 23.0
2002-01-06 15.0
2002-01-07 22.0
2002-01-08 34.0
2002-01-09 35.0
2002-01-10 29.0
2002-01-11 21.0
2002-01-12 24.0
2002-01-13 NaN
2002-01-14 18.0
2002-01-15 14.0
2002-01-16 10.0
2002-01-17 5.0
2002-01-18 7.0
2002-01-19 7.0
2002-01-20 7.0
2002-01-21 11.0
2002-01-22 NaN
2002-01-23 9.0
2002-01-24 8.0
2002-01-25 15.0
2002-01-26 NaN
2002-01-27 NaN
2002-01-28 18.0
2002-01-29 13.0
2002-01-30 13.0
Name: MaxTempMid, dtype: float64
In [8]: print d2
fldDate
1997-12-19 22.0
1997-12-20 14.0
1997-12-21 18.0
1997-12-22 16.0
1997-12-23 16.0
1997-12-24 10.0
1997-12-25 12.0
1997-12-26 12.0
1997-12-27 9.0
1997-12-28 12.0
1997-12-29 18.0
1997-12-30 23.0
1997-12-31 28.0
1998-01-01 26.0
1998-01-02 29.0
1998-01-03 27.0
1998-01-04 22.0
1998-01-05 19.0
1998-01-06 17.0
1998-01-07 14.0
1998-01-08 14.0
1998-01-09 14.0
1998-01-10 16.0
1998-01-11 20.0
1998-01-12 21.0
1998-01-13 19.0
1998-01-14 20.0
1998-01-15 16.0
1998-01-16 17.0
1998-01-17 20.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0
Name: MaxTempMid, dtype: float64
Let's use, combine_first:
df2.combine_first(df1)
Output:
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources