How to extract hourly data from a df in python?

How to extract hourly data from a df in python? - python

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks

Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5

If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Related

Convert pandas DataFrame to nested JSON array

I have a df as follows:
dates values
0 2020-01-01 00:15:00 25.7
1 2020-01-01 00:30:00 25.0
2 2020-01-01 00:45:00 24.6
3 2020-01-01 01:00:00 24.6
4 2020-01-01 01:15:00 25.0
5 2020-01-01 01:30:00 25.6
6 2020-01-01 01:45:00 26.2
7 2020-01-01 02:00:00 26.5
8 2020-01-01 02:15:00 26.3
9 2020-01-01 02:30:00 25.7
and I do:
df = df.to_json(orient='records', date_unit='s')
print({'items' : df})
It gives me output as follows:
{
"items": "[{\"dates\":1577837700,\"values\":25.7},{\"dates\":1577838600,\"values\":25.0},{\"dates\":1577839500,\"values\":24.6},{\"dates\":1577840400,\"values\":24.6},{\"dates\":1577841300,\"values\":25.0},{\"dates\":1577842200,\"values\":25.6},{\"dates\":1577843100,\"values\":26.2},{\"dates\":1577844000,\"values\":26.5},{\"dates\":1577844900,\"values\":26.3},{\"dates\":1577845800,\"values\":25.7}]" }
I want the output to look like
{
"items": [[1577837700, 25.7],[1577838600,25.0],[1577839500,24.6],[1577840400,24.6],[1577841300,25.0],[1577842200,25.6],[1577843100,26.2],[1577844000,26.5],[1577844900,26.3],[1577845800,25.7]] }
That is, from the output that I get from my code, I want to have all the records in the form of a list instead of a dict
Is there a way I can do it?

Try orient="values":
df.to_json(orient='values', date_unit='s')
'[[1577837700,25.7],[1577838600,25.0],[1577839500,24.6],[1577840400,24.6],[1577841300,25.0],[1577842200,25.6],[1577843100,26.2],[1577844000,26.5],[1577844900,26.3],[1577845800,25.7]]'}
This just dumps out the .values (or .to_numpy) output as json. See DataFrame.to_json for more.

How to add values occurring between 2 consecutive hours?

I have a df as follows:
dates values
2020-01-01 00:15:00 87.321
2020-01-01 00:30:00 87.818
2020-01-01 00:45:00 88.514
2020-01-01 01:00:00 89.608
2020-01-01 01:15:00 90.802
2020-01-01 01:30:00 91.896
2020-01-01 01:45:00 92.393
2020-01-01 02:00:00 91.995
2020-01-01 02:15:00 90.504
2020-01-01 02:30:00 88.216
2020-01-01 02:45:00 85.929
2020-01-01 03:00:00 84.238
I want to just keep hourly values when the minute is 00 and the values occurring before it must be added.
Example: For finding the value at 2020-01-01 01:00:00, the values from 2020-01-01 00:15:00 to 2020-01-01 01:00:00 should be added (87.321+87.818+88.514+59.608 = 353.261). Similarly, for finding the value at 2020-01-01 02:00:00, the values from 2020-01-01 01:15:00 to 2020-01-01 02:00:00 should be added (90.802+91.896+92.393+91.995 = 348.887)
Desired output
dates values
2020-01-01 01:00:00 353.261
2020-01-01 02:00:00 348.887
2020-01-01 03:00:00 333.67
I used df['dates'].dt.minute.eq(0) to obtain the boolean masking, but I am unable to find a way to add them.
Thanks in advance

hourly = df.set_index('dates') \ # Set the dates as index
.resample('1H', closed='right', label='right') \ # Resample, so that you have one value for each hour
.sum() # Set the sum of values as new value
hourly = hourly.reset_index() # If you want to have the dates as column again

How to fill the first date in the column?

I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?

Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00

How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00

Using pandas.resample().agg() with 'interpolate'

i need to resample a df, different columns with different functions.
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(start='2020-01-01 00:00:00', end='2020-01-02 00:00:00', freq='3H'),
data=np.random.rand(9,3),
columns=['A','B','C'])
df = df.resample('1H').agg({'A': 'ffill',
'B': 'interpolate',
'C': 'max'})
Functions like 'mean', 'max', 'sum' work.
But it seems that 'interpolate' can't be used like that.
Any workarounds?

One work-around would be to concat the different aggregated frames:
df_out = pd.concat([df[['A']].resample('1H').ffill(),
df[['B']].resample('1H').interpolate(),
df[['C']].resample('1H').max().ffill()],
axis=1)
[out]
A B C
2020-01-01 00:00:00 0.836547 0.436186 0.520913
2020-01-01 01:00:00 0.836547 0.315646 0.520913
2020-01-01 02:00:00 0.836547 0.195106 0.520913
2020-01-01 03:00:00 0.577291 0.074566 0.754697
2020-01-01 04:00:00 0.577291 0.346092 0.754697
2020-01-01 05:00:00 0.577291 0.617617 0.754697
2020-01-01 06:00:00 0.490666 0.889143 0.685191
2020-01-01 07:00:00 0.490666 0.677584 0.685191
2020-01-01 08:00:00 0.490666 0.466025 0.685191
2020-01-01 09:00:00 0.603678 0.254466 0.605424
2020-01-01 10:00:00 0.603678 0.358240 0.605424
2020-01-01 11:00:00 0.603678 0.462014 0.605424
2020-01-01 12:00:00 0.179458 0.565788 0.596706
2020-01-01 13:00:00 0.179458 0.477367 0.596706
2020-01-01 14:00:00 0.179458 0.388946 0.596706
2020-01-01 15:00:00 0.702992 0.300526 0.476644
2020-01-01 16:00:00 0.702992 0.516952 0.476644
2020-01-01 17:00:00 0.702992 0.733378 0.476644
2020-01-01 18:00:00 0.884276 0.949804 0.793237
2020-01-01 19:00:00 0.884276 0.907233 0.793237
2020-01-01 20:00:00 0.884276 0.864661 0.793237
2020-01-01 21:00:00 0.283859 0.822090 0.186542
2020-01-01 22:00:00 0.283859 0.834956 0.186542
2020-01-01 23:00:00 0.283859 0.847822 0.186542
2020-01-02 00:00:00 0.410897 0.860688 0.894249

how can i get conditonal hourly mean in pandas?

i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...

It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract hourly data from a df in python? - python

Use Series.dt.minute to performance a boolean indexing: df_filtered = df.loc[df['dates'].dt.minute.eq(0)] #if necessary #df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)] print(df_filtered) dates Final 3 2020-01-01 01:00:00 95.0 7 2020-01-01 02:00:00 99.8 11 2020-01-01 03:00:00 89.5

If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Related

Convert pandas DataFrame to nested JSON array

How to add values occurring between 2 consecutive hours?

How to fill the first date in the column?

Using pandas.resample().agg() with 'interpolate'

how can i get conditonal hourly mean in pandas?

Categories

Resources