I have a pandas time series dataframe with a value for each hour of the day over an extended period, like this:
value
datetime
2018-01-01 00:00:00 38
2018-01-01 01:00:00 31
2018-01-01 02:00:00 78
2018-01-01 03:00:00 82
2018-01-01 04:00:00 83
2018-01-01 05:00:00 95
...
I want to create a new dataframe with the minimum value between hours 01:00 - 04:00 for each day but can't figure out how to do this.. the closest i can think of is:
df2 = df.groupby([pd.Grouper(freq='d'), df.between_time('01:00', '04:00')]).min()))
but that gives me:
ValueError: Grouper for '' not 1-dimensional
Use DataFrame.between_time with DataFrame.resample:
df = df.between_time('01:00', '04:00').resample('d').min()
print (df)
value
datetime
2018-01-01 31
Your solution is very close, only chain functions differently:
df = df.between_time('01:00', '04:00').groupby(pd.Grouper(freq='d')).min()
print (df)
value
datetime
2018-01-01 31
Related
I have two columns startDate and endDate
I need to calculate number of hours count from 0 to 23 between these dates
Example, start date is 2000-12-05 10:00:00 and end date is 2001-01-15 15:00:00
I need to calculate how many times hour 0 to 23 occurred between these two dates in python
I took the difference between the dates and calculated hours from the difference.
After which I plan to extract start hour from startDate till startDateHour * hours to get the endHour
and iterate through a dictionary to increase the count, but is there any other approach with which I can do this?
df['diff'] = df['endDate'] - df['startDate']
df['hours']= df['diff'] / np.timedelta64(1, 'h')
from datetime import datetime
X = (datetime.strptime(2020-01-05 01:19:49, '%Y-%m-%d %h:%m:%s') -
datetime.strptime(2020-01-02 06:12:44, '%Y-%m-%d %h:%m:%s'))
print(X)
You can do:
>>> df['diff'] = df['endDate'] - df['startDate']
>>> df['hours'] = df['diff'].dt.components.hours
Considering these are pd.Timedelta objects.
>>> idx = pd.date_range('2018-01-01', periods=5, freq='H')
>>> df = pd.DataFrame({'ts':ts, 'ts_2':ts + pd.Timedelta(hours=1)})
>>> df
ts ts_2
0 2018-01-01 00:00:00 2018-01-01 01:00:00
1 2018-01-01 01:00:00 2018-01-01 02:00:00
2 2018-01-01 02:00:00 2018-01-01 03:00:00
3 2018-01-01 03:00:00 2018-01-01 04:00:00
4 2018-01-01 04:00:00 2018-01-01 05:00:00
>>> df['hour'] = (df['ts_2'] - df['ts']).dt.components.hours
>>> df
ts ts_2 hour
0 2018-01-01 00:00:00 2018-01-01 01:00:00 1
1 2018-01-01 01:00:00 2018-01-01 02:00:00 1
2 2018-01-01 02:00:00 2018-01-01 03:00:00 1
3 2018-01-01 03:00:00 2018-01-01 04:00:00 1
4 2018-01-01 04:00:00 2018-01-01 05:00:00 1
I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04
I have a CSV file that has a column that has values like:
10/23/2018 11:00:00 PM
I want to convert these values strictly by time and create a new column which takes the time of the entry (11:00:00 etc) and changes it into an hour ending time.
Example looks like:
11:00:00 PM to 12:00:00 AM = 24, 12:00:00 AM to 1:00:00 AM = 1, 1:00:00 AM to 2:00:00 AM = 2 .....etc
Looking for a simple way to calculate these by indexing them based off this conversion.
My first pseudo code idea is to do something like grabbing the column df['Date'] and finding out what the time is:
file = pd.read_csv()
def conv(n):
date_time = n.iloc[1,1] #Position of the date-time column in file
for i in date_time:
time = date_time[11:] #Point of the line where time begins
Unsure how to proceed.
You can also do this:
import pandas as pd
data ='''
10/23/2018 11:00:00 PM
10/23/2018 12:00:00 AM
'''.strip().split('\n')
df = pd.DataFrame(data, columns=['date'])
df['date'] = pd.to_datetime(df['date'])
#df['pad1hour'] = df['date'].dt.hour+1
#or
df['pad1hour'] = df['date'] + pd.Timedelta('1 hours')
# I prefer the second as you can add whatever interval e.g. '1 days 3 minutes'
print(df['pad1hour'].dt.time)
You should convert to a datetime with pd.to_datetime(df.your_col) (your format will be automatically parsed correctly, though you can specify it to improve the speed) and then you can use the .dt.hour accessor.
import pandas as pd
# Sample Data
df = pd.DataFrame({'date': pd.date_range('2018-01-01', '2018-01-03', freq='30min')})
df['hour'] = df.date.dt.hour+1
print(df.sample(20))
date hour
95 2018-01-02 23:30:00 24
66 2018-01-02 09:00:00 10
82 2018-01-02 17:00:00 18
80 2018-01-02 16:00:00 17
75 2018-01-02 13:30:00 14
83 2018-01-02 17:30:00 18
49 2018-01-02 00:30:00 1
47 2018-01-01 23:30:00 24
30 2018-01-01 15:00:00 16
52 2018-01-02 02:00:00 3
29 2018-01-01 14:30:00 15
86 2018-01-02 19:00:00 20
59 2018-01-02 05:30:00 6
65 2018-01-02 08:30:00 9
92 2018-01-02 22:00:00 23
8 2018-01-01 04:00:00 5
91 2018-01-02 21:30:00 22
10 2018-01-01 05:00:00 6
89 2018-01-02 20:30:00 21
51 2018-01-02 01:30:00 2
This is the best way to do it:
from datetime import timedelta
import pandas as pd
file = pd.read_csv()
Case One: If you want to keep the date
file['New datetime'] = file['Date_time'].apply(lambda x: pd.to_datetime(x) + timedelta(hours = 1))
Case Two: If you just want the time
file['New time'] = file['Date_time'].apply(lambda x: (pd.to_datetime(x) + timedelta(hours = 1)).time())
If you need the column's data type as string instead of Timestamp you can just do:
file['New time'] = file['New time'].astype(str)
To convert it to a readable string.
Hope it helps.
Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')
If df is a Dataframe indexed by DateTime objects, the following code splits it into the list groups_list where each index containts all the data in df that belongs to a given day:
groupby_clause = [df.index.year,df.index.month,df.index.day]
groups_list = [group[1] for group in df.groupby(groupby_clause)]
I am having trouble, though, to understand how the grouping is actually made, since I don't need to label the elements of groupby_clause as year, month, and day for the grouping to be made on DateTime objects.
As an example, I have the following components for groups_list:
Maybe I'm missing something obvious, but I don't get it: how does pandas know that it should associate groupby_clause[0] to year, groupby_clause[1] to month, and groupby_clause[2] to day in order to group the dataframe indexes that have DateTime type?
Suppose you have a DataFrame like this:
0
2011-01-01 00:00:00 -0.324398
2011-01-01 01:00:00 -0.761585
2011-01-01 02:00:00 0.057204
2011-01-01 03:00:00 -1.162510
2011-01-01 04:00:00 -0.680896
2011-01-01 05:00:00 -0.701835
2011-01-01 06:00:00 -0.431338
2011-01-01 07:00:00 0.306935
2011-01-01 08:00:00 -0.503177
2011-01-01 09:00:00 -0.507444
2011-01-01 10:00:00 0.230590
2011-01-01 11:00:00 -2.326702
2011-01-01 12:00:00 -0.034664
2011-01-01 13:00:00 0.224373
2011-01-01 14:00:00 -0.242884
If you want the index to be by year month and date then just set_index it:
df.set_index([ts.index.year, ts.index.month, ts.index.day])
Output
0
2011 1 1 -0.324398
1 -0.761585
1 0.057204
1 -1.162510
1 -0.680896
1 -0.701835
1 -0.431338
1 0.306935
1 -0.503177
1 -0.507444
1 0.230590
1 -2.326702
1 -0.034664
1 0.224373
1 -0.242884
1 -0.134757
1 -1.177362
1 0.931335
1 0.904084
1 -0.757860
1 0.406597
1 -0.664150