Calculating age from dataframe (dob -y/m/d) - python

I'm trying to add a column "Age" to my data
number of purchased hours(mins) dob Y dob M dob D
0 7200 2010.0 10.0 12.0
1 7320 2010.0 6.0 2.0
2 5400 2011.0 6.0 18.0
3 9180 2009.0 10.0 18.0
4 3102 2007.0 7.0 30.0
5 5400 2011.0 4.0 6.0
6 9000 2009.0 8.0 5.0
7 6000 2004.0 2.0 7.0
8 6000 2007.0 8.0 17.0
9 6000 2013.0 5.0 5.0
10 12000 2012.0 9.0 27.0
11 12000 2004.0 11.0 25.0
12 6000 2009.0 11.0 20.0
I've tried this code, but not sure what went wrong
from datetime import datetime as dt
df['Age'] = datetime.datetime.now()-pd.to_datetime(df[['dob D','dob M','dob Y']])
Below is the error that popped up
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

If want use to_datetime with 3 columns it working only with renamed columns names:
d = {'dob Y':'year', 'dob M':'month', 'dob D':'day'}
df['Age'] = (pd.Timestamp.now().floor('d') -
pd.to_datetime(df[['dob D','dob M','dob Y']].rename(columns=d)))
print (df)
number of purchased hours(mins) dob Y dob M dob D Age
0 7200 2010.0 10.0 12.0 3380 days
1 7320 2010.0 6.0 2.0 3512 days
2 5400 2011.0 6.0 18.0 3131 days
3 9180 2009.0 10.0 18.0 3739 days
4 3102 2007.0 7.0 30.0 4550 days
5 5400 2011.0 4.0 6.0 3204 days
6 9000 2009.0 8.0 5.0 3813 days
7 6000 2004.0 2.0 7.0 5819 days
8 6000 2007.0 8.0 17.0 4532 days
9 6000 2013.0 5.0 5.0 2444 days
10 12000 2012.0 9.0 27.0 2664 days
11 12000 2004.0 11.0 25.0 5527 days
12 6000 2009.0 11.0 20.0 3706 days
If want convert timedeltas to days:
d = {'dob Y':'year', 'dob M':'month', 'dob D':'day'}
df['Age'] = ((pd.Timestamp.now().floor('d') -
pd.to_datetime(df[['dob D','dob M','dob Y']].rename(columns=d)))
.dt.days)
print (df)
number of purchased hours(mins) dob Y dob M dob D Age
0 7200 2010.0 10.0 12.0 3380
1 7320 2010.0 6.0 2.0 3512
2 5400 2011.0 6.0 18.0 3131
3 9180 2009.0 10.0 18.0 3739
4 3102 2007.0 7.0 30.0 4550
5 5400 2011.0 4.0 6.0 3204
6 9000 2009.0 8.0 5.0 3813
7 6000 2004.0 2.0 7.0 5819
8 6000 2007.0 8.0 17.0 4532
9 6000 2013.0 5.0 5.0 2444
10 12000 2012.0 9.0 27.0 2664
11 12000 2004.0 11.0 25.0 5527
12 6000 2009.0 11.0 20.0 3706

Related

How to do similar to conditional countifs on a dataframe

I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.

What is the best way to create a new dataframe with existing ones of different shapes and criteria

I have a few dataframes that I have made through various sorting and processing of data from the main dataframe (df1).
df1 - large and will currently covers 6 days worth of data for every 30 mins but I wish to scale up to longer periods:
import pandas as pd
import numpy as np
bmu_units = pd.read_csv('bmu_units_technology.csv')
b1610 = pd.read_csv('b1610_df.csv')
b1610 = (b1610.merge(bmu_units, on=['BM Unit ID 1'], how='left'))
b1610['% of capacity running'] = b1610.quantity / b1610.Capacity
def func(tech):
if tech in ["CCGT","OCGT","COAL"]:
return "Fossil"
else:
return "ZE"
b1610["Type"] = b1610['Technology'].apply(func)
settlementDate time BM Unit ID 1 BM Unit ID 2_x settlementPeriod quantity BM Unit ID 2_y Capacity Technology % of capacity running Type
0 03/01/2022 00:00:00 RCBKO-1 T_RCBKO-1 1 278.658 T_RCBKO-1 279.0 WIND 0.998774 ZE
1 03/01/2022 00:00:00 LARYO-3 T_LARYW-3 1 162.940 T_LARYW-3 180.0 WIND 0.905222 ZE
2 03/01/2022 00:00:00 LAGA-1 T_LAGA-1 1 262.200 T_LAGA-1 905.0 CCGT 0.289724 Fossil
3 03/01/2022 00:00:00 CRMLW-1 T_CRMLW-1 1 3.002 T_CRMLW-1 47.0 WIND 0.063872 ZE
4 03/01/2022 00:00:00 GRIFW-1 T_GRIFW-1 1 9.972 T_GRIFW-1 102.0 WIND 0.097765 ZE
... ... ... ... ... ... ... ... ... ... ... ...
52533 08/01/2022 23:30:00 CRMLW-1 T_CRMLW-1 48 8.506 T_CRMLW-1 47.0 WIND 0.180979 ZE
52534 08/01/2022 23:30:00 LARYO-4 T_LARYW-4 48 159.740 T_LARYW-4 180.0 WIND 0.887444 ZE
52535 08/01/2022 23:30:00 HOWBO-3 T_HOWBO-3 48 32.554 T_HOWBO-3 440.0 Offshore Wind 0.073986 ZE
52536 08/01/2022 23:30:00 BETHW-1 E_BETHW-1 48 5.010 E_BETHW-1 30.0 WIND 0.167000 ZE
52537 08/01/2022 23:30:00 HMGTO-1 T_HMGTO-1 48 92.094 HMGTO-1 108.0 WIND 0.852722 ZE
df2:
rank = (
b1610.pivot_table(
index=['settlementDate','BM Unit ID 1','Technology'],
columns='settlementPeriod',
values='% of capacity running',
aggfunc=sum,
fill_value=0)
)
rank['rank of capacity'] = rank.sum(axis=1)
rank
settlementPeriod 1 2 3 4 5 6 7 8 9 10 ... 40 41 42 43 44 45 46 47 48 rank of capacity
settlementDate BM Unit ID 1 Technology
03/01/2022 ABRBO-1 WIND 0.936970 0.969293 0.970909 0.925051 0.885657 0.939394 0.963434 0.938586 0.863232 0.781212 ... 0.461818 0.394545 0.428889 0.537172 0.520606 0.545253 0.873333 0.697778 0.651111 29.566263
ABRTW-1 WIND 0.346389 0.343333 0.345389 0.341667 0.342222 0.346778 0.347611 0.347722 0.346833 0.340556 ... 0.018778 0.015889 0.032056 0.043056 0.032167 0.109611 0.132111 0.163278 0.223556 10.441333
ACHRW-1 WIND 0.602884 0.575628 0.602140 0.651070 0.667721 0.654791 0.539209 0.628698 0.784233 0.782140 ... 0.174419 0.148465 0.139860 0.091535 0.094698 0.272419 0.205023 0.184651 0.177628 18.517814
AKGLW-2 WIND 0.000603 0.000603 0.000603 0.000635 0.000603 0.000635 0.000635 0.000635 0.000635 0.000603 ... 0.191079 0.195079 0.250476 0.281048 0.290000 0.279524 0.358508 0.452698 0.572730 8.616032
ANSUW-1 WIND 0.889368 0.865053 0.915684 0.894000 0.888526 0.858211 0.875158 0.878421 0.809368 0.898737 ... 0.142632 0.212526 0.276421 0.225053 0.235789 0.228000 0.152211 0.226000 0.299158 19.662421
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
08/01/2022 WBURB-2 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.636329 0.642447 0.961835 0.908706 0.650212 0.507012 0.513176 0.503576 0.518212 24.439765
HOWBO-3 Offshore Wind 0.030418 0.026355 0.026595 0.014373 0.012523 0.008418 0.010977 0.016918 0.019127 0.025641 ... 0.055509 0.063845 0.073850 0.073923 0.073895 0.073791 0.073886 0.074050 0.073986 2.332809
MRWD-1 CCGT 0.808043 0.894348 0.853043 0.650870 0.159783 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.701739 0.488913 0.488913 0.489348 0.489130 0.392826 0.079130 0.000000 0.000000 23.485217
WBURB-3 CCGT 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.771402 0.699986 0.648386 0.919242 0.759520 0.424513 0.430598 0.420089 0.436376 25.436282
DRAXX-4 BIOMASS 0.706074 0.791786 0.806713 0.806462 0.806270 0.806136 0.806509 0.806369 0.799749 0.825070 ... 0.777395 0.816093 0.707122 0.666639 0.680406 0.679216 0.501433 0.000000 0.000000 36.576512
df3 - this was made by sorting the above dataframe to list sums for each day for each BM Unit ID filtered for specific technology types.
BM Unit ID 1 Technology 03/01/2022 04/01/2022 05/01/2022 06/01/2022 07/01/2022 08/01/2022 ave rank rank
0 FAWN-1 CCGT 1.0 5.0 1.0 5.0 2.0 1.0 2.500000 1.0
1 GRAI-6 CCGT 4.0 18.0 2.0 4.0 3.0 3.0 5.666667 2.0
2 EECL-1 CCGT 5.0 29.0 4.0 1.0 1.0 2.0 7.000000 3.0
3 PEMB-21 CCGT 7.0 1.0 6.0 13.0 8.0 8.0 7.166667 4.0
4 PEMB-51 CCGT 3.0 3.0 3.0 11.0 16.0 NaN 7.200000 5.0
5 PEMB-41 CCGT 9.0 4.0 7.0 7.0 10.0 13.0 8.333333 6.0
6 WBURB-1 CCGT 6.0 9.0 22.0 2.0 7.0 5.0 8.500000 7.0
7 PEMB-31 CCGT 14.0 6.0 13.0 6.0 4.0 9.0 8.666667 8.0
8 GRMO-1 CCGT 2.0 7.0 10.0 24.0 11.0 6.0 10.000000 9.0
9 PEMB-11 CCGT 21.0 2.0 9.0 10.0 9.0 14.0 10.833333 10.0
10 STAY-1 CCGT 19.0 12.0 5.0 23.0 6.0 7.0 12.000000 11.0
11 GRAI-7 CCGT 10.0 27.0 15.0 9.0 15.0 11.0 14.500000 12.0
12 DIDCB6 CCGT 28.0 11.0 11.0 8.0 19.0 15.0 15.333333 13.0
13 STAY-4 CCGT 12.0 8.0 20.0 18.0 14.0 23.0 15.833333 14.0
14 SCCL-3 CCGT 17.0 16.0 31.0 3.0 18.0 10.0 15.833333 14.0
15 CDCL-1 CCGT 13.0 22.0 8.0 25.0 12.0 16.0 16.000000 15.0
16 STAY-3 CCGT 8.0 17.0 17.0 20.0 13.0 22.0 16.166667 16.0
17 MRWD-1 CCGT NaN NaN 19.0 26.0 5.0 19.0 17.250000 17.0
18 WBURB-3 CCGT NaN NaN 24.0 14.0 17.0 17.0 18.000000 18.0
19 WBURB-2 CCGT NaN 14.0 21.0 12.0 31.0 18.0 19.200000 19.0
20 GYAR-1 CCGT NaN 26.0 14.0 17.0 20.0 21.0 19.600000 20.0
21 STAY-2 CCGT 18.0 20.0 18.0 21.0 24.0 20.0 20.166667 21.0
22 SHOS-1 CCGT 16.0 15.0 28.0 15.0 29.0 27.0 21.666667 22.0
23 KLYN-A-1 CCGT NaN 24.0 12.0 19.0 27.0 29.0 22.200000 23.0
24 DIDCB5 CCGT NaN 10.0 35.0 22.0 NaN NaN 22.333333 24.0
25 CARR-1 CCGT NaN 33.0 26.0 27.0 22.0 4.0 22.400000 25.0
26 LAGA-1 CCGT 15.0 13.0 29.0 32.0 23.0 24.0 22.666667 26.0
27 CARR-2 CCGT 24.0 25.0 27.0 29.0 21.0 12.0 23.000000 27.0
28 GRAI-8 CCGT 11.0 28.0 36.0 16.0 26.0 25.0 23.666667 28.0
29 SCCL-2 CCGT 29.0 NaN 16.0 28.0 25.0 NaN 24.500000 29.0
30 LBAR-1 CCGT NaN 19.0 25.0 31.0 28.0 NaN 25.750000 30.0
31 CNQPS-2 CCGT 20.0 NaN 32.0 NaN 32.0 26.0 27.500000 31.0
32 SPLN-1 CCGT NaN NaN 23.0 30.0 30.0 NaN 27.666667 32.0
33 CNQPS-1 CCGT 25.0 NaN 33.0 NaN NaN NaN 29.000000 33.0
34 DAMC-1 CCGT 23.0 21.0 38.0 34.0 NaN NaN 29.000000 33.0
35 KEAD-2 CCGT 30.0 NaN NaN NaN NaN NaN 30.000000 34.0
36 HUMR-1 CCGT 22.0 30.0 37.0 37.0 33.0 28.0 31.166667 35.0
37 SHBA-1 CCGT 26.0 23.0 40.0 35.0 37.0 NaN 32.200000 36.0
38 SEAB-1 CCGT NaN 32.0 34.0 36.0 NaN 30.0 33.000000 37.0
39 CNQPS-4 CCGT 27.0 NaN 41.0 33.0 35.0 31.0 33.400000 38.0
40 PETEM1 CCGT NaN 35.0 NaN NaN NaN NaN 35.000000 39.0
41 SEAB-2 CCGT NaN 31.0 39.0 39.0 34.0 NaN 35.750000 40.0
42 COSO-1 CCGT NaN NaN 30.0 42.0 36.0 NaN 36.000000 41.0
43 ROCK-1 CCGT 31.0 34.0 42.0 38.0 38.0 NaN 36.600000 42.0
44 WBURB-43 COAL 32.0 37.0 45.0 40.0 39.0 32.0 37.500000 43.0
45 WBURB-41 COAL 33.0 38.0 46.0 41.0 40.0 33.0 38.500000 44.0
46 FELL-1 CCGT 34.0 39.0 47.0 43.0 41.0 34.0 39.666667 45.0
47 FDUNT-1 OCGT NaN 36.0 44.0 NaN NaN NaN 40.000000 46.0
48 KEAD-1 CCGT NaN NaN 43.0 NaN NaN NaN 43.000000 47.0
My issue is that I am trying to create a new dataframe using the existing dataframes listed above in which I can list all my BM Unit ID 1's in order of rank from df2 while populating the values with means of values for all dates (not split by date) in df1. An example of what I am after is below, which I made on excel using index match. Here I have the results for each settlement period from df1 and df2 but instead of split by date they are an aggregated mean over all dates in the df but they are still ranked according to the last column of df2, which is key.
Desired Output:
BM Unit ID Technology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Rank Capacity
1 150 FAWN-1 CCGT 130.43 130.93 130.78 130.58 130.57 130.54 130.71 130.87 130.89 130.98 130.83 130.80 130.88 131.02 130.81 130.65 130.86 130.84 131.19 130.60 130.69 130.70 130.40 130.03 130.13 130.03 129.75
2 455 GRAI-6 CCGT 339.45 342.33 322.53 312.40 303.78 307.60 316.35 277.18 293.48 325.75 326.75 271.34 299.74 328.06 317.12 342.66 364.50 390.90 403.32 411.52 400.18 405.94 394.04 400.08 389.08 382.74 374.76
3 408 EECL-1 CCGT 363.31 386.71 364.46 363.31 363.31 363.38 361.87 305.06 286.99 282.74 323.93 242.88 242.64 207.73 294.71 357.15 383.47 426.93 433.01 432.98 435.14 436.38 416.04 417.69 430.42 415.09 406.45
4 430 PEMB-21 CCGT 334.40 419.50 436.70 441.90 440.50 415.80 327.90 323.70 322.70 331.10 367.50 368.40 396.70 259.05 415.95 356.32 386.84 400.00 429.52 435.40 434.84 435.88 435.60 438.48 438.16 437.84 437.76
5 465 PEMB-51 CCGT 370.65 370.45 359.90 326.25 326.20 322.65 324.60 274.25 319.55 288.80 301.75 279.08 379.60 376.76 389.92 419.24 403.64 420.92 428.20 421.32 396.92 397.80 424.40 433.92 434.56 431.44 434.40
6 445 PEMB-41 CCGT 337.00 423.40 423.10 427.50 427.00 419.00 361.00 318.80 263.20 226.70 268.70 231.35 366.90 378.35 392.20 421.55 354.96 382.48 422.64 428.28 428.76 431.24 431.92 431.84 429.52 429.00 431.48
7 425 WBURB-1 CCGT 240.41 293.17 252.27 256.51 261.65 253.44 247.14 217.08 223.11 199.27 254.69 314.16 361.07 317.50 259.54 266.83 349.64 383.43 408.18 412.29 395.54 383.48 355.98 340.49 360.87 352.74 376.92
8 465 PEMB-31 CCGT 297.73 360.27 355.40 357.07 358.67 353.07 300.93 284.73 268.73 255.20 248.53 257.75 366.75 376.45 396.40 320.56 342.68 352.52 361.16 379.40 386.64 390.36 409.12 427.48 426.60 426.80 427.16
9 144 GRMO-1 CCGT 106.62 106.11 105.96 106.00 106.00 105.98 105.99 105.90 105.47 105.31 105.28 105.07 105.04 105.06 105.06 105.04 105.06 105.06 105.07 105.04 105.05 105.06 105.04 105.04 105.04 105.06 105.07
10 430 PEMB-11 CCGT 432.80 430.40 430.70 431.90 432.10 429.30 430.00 408.30 320.90 346.50 432.90 432.20 312.93 297.20 414.55 432.00 420.40 429.80 402.60 426.90 430.65 435.85 435.10 431.15 435.20 431.50 431.75
11 457 STAY-1 CCGT 216.07 223.27 232.67 243.47 234.67 221.73 227.00 219.00 237.00 218.33 250.73 228.27 219.67 142.68 243.00 300.64 312.28 331.00 360.84 379.28 398.92 410.04 410.56 409.24 411.96 408.84 411.88
12 455 GRAI-7 CCGT 425.20 425.40 377.90 339.40 342.00 329.80 408.00 402.40 329.00 257.30 130.43 211.37 262.60 318.45 299.98 324.72 350.40 386.26 394.20 402.10 390.48 401.22 388.94 394.10 395.14 379.70 377.26
13 710 DIDCB6 CCGT 465.80 459.50 411.60 411.70 413.70 410.80 351.50 333.40 333.70 390.40 234.60 265.56 348.16 430.28 524.32 554.04 536.28 589.28 594.04 597.72 592.76 557.86 687.70 687.25 687.35 687.25 679.80
14 400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
400 SCCL-3 CCGT 311.50 337.40 378.80 311.50 381.30 338.60 302.70 300.70 300.60 300.70 338.20 321.50 363.80 260.35 228.18 308.70 334.73 324.60 354.63 362.38 347.30 306.22 346.86 365.04 365.40 370.68 370.52
16 440 CDCL-1 CCGT 270.63 255.24 210.87 197.10 195.12 198.72 197.64 198.99 233.19 221.31 176.94 317.52 280.68 213.12 297.68 342.25 397.26 372.28 371.74 379.87 347.51 348.48 352.15 384.88 395.14 381.02 360.40
17 457 STAY-3 CCGT 311.25 311.30 311.60 311.45 311.15 311.30 308.40 313.10 223.90 196.05 242.95 172.87 217.40 236.84 252.92 352.98 384.06 414.76 403.68 424.90 418.38 403.00 420.26 424.40 427.06 421.64 424.66
18 920 MRWD-1 CCGT 468.70 483.90 420.60 267.80 472.60 470.20 241.40 299.30 327.70 327.80 336.90 241.60 308.33 529.93 793.73 828.40 870.67 846.67 827.07 855.93 829.33 865.87 870.40 846.87 765.47 785.20 824.00
19 425 WBURB-3 CCGT 311.73 427.68 333.68 333.93 370.68 335.09 420.85 433.86 370.45 321.70 340.54 300.95 155.47 190.67 290.81 310.43 332.52 376.63 391.11 413.74 408.33 398.69 397.54 368.05 410.64 413.05 428.91
20 425 WBURB-2 CCGT 295.54 424.56 336.68 334.08 371.20 358.44 358.90 358.96 377.94 325.42 203.19 165.32 205.75 121.41 162.51 180.15 301.12 413.77 410.33 397.21 385.59 378.09 381.50 380.93 413.71 418.53 427.09
21 420 GYAR-1 CCGT 404.33 404.33 403.73 405.12 404.13 404.33 404.33 376.98 218.02 218.02 351.01 215.10 177.46 222.43 345.47 398.94 401.97 401.97 402.17 401.87 401.47 401.77 401.62 402.51 402.31 402.41 402.26
22 457 STAY-2 CCGT 434.20 435.40 435.40 435.20 434.20 434.20 434.20 434.60 249.80 196.20 291.20 234.80 196.80 88.73 167.10 239.52 324.52 372.80 412.40 423.32 424.04 423.96 423.92 424.08 423.88 420.96 422.44
23 400 KLYN-A-1 CCGT 382.58 382.50 384.94 385.81 385.83 385.79 385.02 384.94 259.16 141.03 195.65 205.75 278.81 256.95 296.85 337.82 369.26 376.38 376.84 376.56 376.30 376.09 375.62 375.45 375.11 375.17 375.09
24 420 SHOS-1 CCGT 290.63 326.33 229.60 265.70 269.05 259.40 299.45 310.20 301.65 266.00 307.90 319.30 253.06 246.85 263.04 220.46 277.68 297.84 290.62 297.86 302.83 295.13 293.73 289.04 306.14 314.24 321.76

Adding column to dataframe based on another dataframe using pandas

I need to create a new column in dataframe based on intervals from another dataframe.
For example, I have a dataframe where in the time column I have values ​​and I want to create column in another dataframe based on the intervals in that time column.
I think a practical example is simpler to understand:
Dataframe with intervals
df1
time value var2
0 1.0 34.0 35.0
1 4.0 754.0 755.0
2 9.0 768.0 769.0
3 12.0 65.0 66.0
Dataframe that I need to filter
df2
time value var2
0 1.0 23.0 23.0
1 2.0 43.0 43.0
2 3.0 76.0 12.0
3 4.0 88.0 22.0
4 5.0 64.0 45.0
5 6.0 98.0 33.0
6 7.0 76.0 11.0
7 8.0 56.0 44.0
8 9.0 23.0 22.0
9 10.0 54.0 44.0
10 11.0 65.0 22.0
11 12.0 25.0 25.0
should result
df3
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3
EDIT: As Shubham Sharma said, it's not a filter, I want to add a new column based on intervals in other dataframe.
You can use pd.cut to categorize the time in df2 into discrete intervals based on the time in df1 then use Series.factorize to obtain a numeric array identifying distinct ordered values.
df2['interval'] = pd.cut(df2['time'], df1['time'], include_lowest=True)\
.factorize(sort=True)[0] + 1
Result:
time value var2 interval
0 1.0 23.0 23.0 1
1 2.0 43.0 43.0 1
2 3.0 76.0 12.0 1
3 4.0 88.0 22.0 1
4 5.0 64.0 45.0 2
5 6.0 98.0 33.0 2
6 7.0 76.0 11.0 2
7 8.0 56.0 44.0 2
8 9.0 23.0 22.0 2
9 10.0 54.0 44.0 3
10 11.0 65.0 22.0 3
11 12.0 25.0 25.0 3

groupby two columns, then reset_index fails due to having the same name

I made the following groupby with my pandas dataframe:
df.groupby([df.Hora.dt.hour, df.Hora.dt.minute]).describe()['Qtd']
after groupby the data is as follows:
count mean std min 25% 50% 75% max
Hora Hora
9 0 11.0 5.909091 2.022600 5.000 5.0 5.0 5.00 10.0
1 197.0 6.421320 4.010210 5.000 5.0 5.0 5.00 30.0
2 125.0 6.040000 4.679054 5.000 5.0 5.0 5.00 50.0
3 131.0 6.450382 5.700491 5.000 5.0 5.0 5.00 60.0
4 182.0 6.401099 5.212458 5.000 5.0 5.0 5.00 50.0
5 147.0 6.054422 5.402666 5.000 5.0 5.0 5.00 60.0
6 59.0 6.779661 6.416756 5.000 5.0 5.0 5.00 45.0
7 16.0 6.875000 5.123475 5.000 5.0 5.0 5.00 25.0
when trying to use reset_index() I get an error, because the index names are the same:
ValueError: cannot insert Hora, already exists
How do I reset_index and get the data as follows:
Hora Minute count
9 0 11.0
9 1 197.0
9 2 125.0
9 3 131.0
9 4 182.0
9 5 147.0
9 6 59.0
9 7 16.0
You can first rename and then reset_index:
(
df.rename_axis(index=['Hora', 'Minute'])
.reset_index()
['count']
)

Easy pythonic way to classify columns in groups and store it in Dictionary?

Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0
20 21.0 205.0
I am trying to classify according to machine number. Like Machine_number 1 to 5 will be one group. Then 6 to 10 in one group and so on.
I think you need substract 1 by sub and then floordiv:
df['g'] = df.Machine_number.sub(1).floordiv(5)
#same as //
#df['g'] = df.Machine_number.sub(1) // 5
print (df)
Machine_number Machine_Running_Hours g
0 1.0 424.0 -0.0
1 2.0 458.0 0.0
2 3.0 465.0 0.0
3 4.0 446.0 0.0
4 5.0 466.0 0.0
5 6.0 466.0 1.0
6 7.0 445.0 1.0
7 8.0 466.0 1.0
8 9.0 447.0 1.0
9 10.0 469.0 1.0
10 11.0 467.0 2.0
11 12.0 449.0 2.0
12 13.0 436.0 2.0
13 14.0 465.0 2.0
14 15.0 463.0 2.0
15 16.0 372.0 3.0
16 17.0 460.0 3.0
17 18.0 450.0 3.0
18 19.0 467.0 3.0
19 20.0 463.0 3.0
20 21.0 205.0 4.0
If need store in dictionary use groupby with dict comprehension:
dfs = {i:g for i, g in df.groupby(df.Machine_number.astype(int).sub(1).floordiv(5))}
print (dfs)
{0: Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0, 1: Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0, 2: Machine_number Machine_Running_Hours
10 11.0 467.0
11 12.0 449.0
12 13.0 436.0
13 14.0 465.0
14 15.0 463.0, 3: Machine_number Machine_Running_Hours
15 16.0 372.0
16 17.0 460.0
17 18.0 450.0
18 19.0 467.0
19 20.0 463.0, 4: Machine_number Machine_Running_Hours
20 21.0 205.0}
print (dfs[0])
Machine_number Machine_Running_Hours
0 1.0 424.0
1 2.0 458.0
2 3.0 465.0
3 4.0 446.0
4 5.0 466.0
print (dfs[1])
Machine_number Machine_Running_Hours
5 6.0 466.0
6 7.0 445.0
7 8.0 466.0
8 9.0 447.0
9 10.0 469.0

Categories

Resources