Minimal reproducible code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index("date", inplace=True)
now I have dataframe like
data
date
2018-01-01 00:00:00 47
2018-01-01 01:00:00 97
2018-01-01 02:00:00 98
2018-01-01 03:00:00 36
Since I've made its index datetimeindex I can do things like df["2018-01-01"] to get only index within January 1st of 2018.
I cannot find any resource that explains way to certain hours.
I want to get hours from 6am ~ 12pm for all days, leading to expected output
data
date
2018-01-01 06:00:00 47
2018-01-01 07:00:00 97
2018-01-01 08:00:00 98
.
.
.
2018-01-02 06:00:00 36
2018-01-02 07:00:00 47
2018-01-02 08:00:00 97
.
.
.
2018-01-03 06:00:00 98
2018-01-03 07:00:00 36
2018-01-03 08:00:00 47
.
.
. and so on
You can simply use between_time:
print (df.between_time("06:00","12:00"))
#
data
date
2018-01-01 06:00:00 51
2018-01-01 07:00:00 61
2018-01-01 08:00:00 37
2018-01-01 09:00:00 77
2018-01-01 10:00:00 7
2018-01-01 11:00:00 59
2018-01-01 12:00:00 69
2018-01-02 06:00:00 85
2018-01-02 07:00:00 70
2018-01-02 08:00:00 72
2018-01-02 09:00:00 55
2018-01-02 10:00:00 27
2018-01-02 11:00:00 32
2018-01-02 12:00:00 8
...
Related
I have a time series dataset that can be created with the following code.
idx = pd.date_range("2018-01-01", periods=100, freq="H")
ts = pd.Series(idx)
dft = pd.DataFrame(ts,columns=["date"])
dft["data"] = ""
dft["data"][0:5]= "a"
dft["data"][5:15]= "b"
dft["data"][15:20]= "c"
dft["data"][20:30]= "d"
dft["data"][30:40]= "a"
dft["data"][40:70]= "c"
dft["data"][70:85]= "b"
dft["data"][85:len(dft)]= "c"
In the data column, the unique values are a,b,c,d. These values are repeating in a sequence in different time windows. I want to capture the first and last value of that time window. How can I do that?
Compute a grouper for your changing values using shift to compare consecutive rows, then use groupby+agg to get the min/max per group:
group = dft.data.ne(dft.data.shift()).cumsum()
dft.groupby(group)['date'].agg(['min', 'max'])
output:
min max
data
1 2018-01-01 00:00:00 2018-01-01 04:00:00
2 2018-01-01 05:00:00 2018-01-01 14:00:00
3 2018-01-01 15:00:00 2018-01-01 19:00:00
4 2018-01-01 20:00:00 2018-01-02 05:00:00
5 2018-01-02 06:00:00 2018-01-02 15:00:00
6 2018-01-02 16:00:00 2018-01-03 21:00:00
7 2018-01-03 22:00:00 2018-01-04 12:00:00
8 2018-01-04 13:00:00 2018-01-05 03:00:00
edit. combining with original data:
dft.groupby(group).agg({'data': 'first', 'date': ['min', 'max']})
output:
data date
first min max
data
1 a 2018-01-01 00:00:00 2018-01-01 04:00:00
2 b 2018-01-01 05:00:00 2018-01-01 14:00:00
3 c 2018-01-01 15:00:00 2018-01-01 19:00:00
4 d 2018-01-01 20:00:00 2018-01-02 05:00:00
5 a 2018-01-02 06:00:00 2018-01-02 15:00:00
6 c 2018-01-02 16:00:00 2018-01-03 21:00:00
7 b 2018-01-03 22:00:00 2018-01-04 12:00:00
8 c 2018-01-04 13:00:00 2018-01-05 03:00:00
I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04
I have a dataframe with a datetime index and 100 columns.
I want to have a new dataframe with the same datetime index and columns, but the values would contain the sum of the first 10 hours of each day.
So if I had an original dataframe like this:
A B C
---------------------------------
2018-01-01 00:00:00 2 5 -10
2018-01-01 01:00:00 6 5 7
2018-01-01 02:00:00 7 5 9
2018-01-01 03:00:00 9 5 6
2018-01-01 04:00:00 10 5 2
2018-01-01 05:00:00 7 5 -1
2018-01-01 06:00:00 1 5 -1
2018-01-01 07:00:00 -4 5 10
2018-01-01 08:00:00 9 5 10
2018-01-01 09:00:00 21 5 -10
2018-01-01 10:00:00 2 5 -1
2018-01-01 11:00:00 8 5 -1
2018-01-01 12:00:00 8 5 10
2018-01-01 13:00:00 8 5 9
2018-01-01 14:00:00 7 5 -10
2018-01-01 15:00:00 7 5 5
2018-01-01 16:00:00 7 5 -10
2018-01-01 17:00:00 4 5 7
2018-01-01 18:00:00 5 5 8
2018-01-01 19:00:00 2 5 8
2018-01-01 20:00:00 2 5 4
2018-01-01 21:00:00 8 5 3
2018-01-01 22:00:00 1 5 3
2018-01-01 23:00:00 1 5 1
2018-01-02 00:00:00 2 5 2
2018-01-02 01:00:00 3 5 8
2018-01-02 02:00:00 4 5 6
2018-01-02 03:00:00 5 5 6
2018-01-02 04:00:00 1 5 7
2018-01-02 05:00:00 7 5 7
2018-01-02 06:00:00 5 5 1
2018-01-02 07:00:00 2 5 2
2018-01-02 08:00:00 4 5 3
2018-01-02 09:00:00 6 5 4
2018-01-02 10:00:00 9 5 4
2018-01-02 11:00:00 11 5 5
2018-01-02 12:00:00 2 5 8
2018-01-02 13:00:00 2 5 0
2018-01-02 14:00:00 4 5 5
2018-01-02 15:00:00 5 5 4
2018-01-02 16:00:00 7 5 4
2018-01-02 17:00:00 -1 5 7
2018-01-02 18:00:00 1 5 7
2018-01-02 19:00:00 1 5 7
2018-01-02 20:00:00 5 5 7
2018-01-02 21:00:00 2 5 7
2018-01-02 22:00:00 2 5 7
2018-01-02 23:00:00 8 5 7
So for all rows with date 2018-01-01:
The value for column A would be 68 (2+6+7+9+10+7+1-4+9+21)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 22 (-10+7+9+6+2-1-1+10+10-10)
So for all rows with date 2018-01-02:
The value for column A would be 39 (2+3+4+5+1+7+5+2+4+6)
The value for column B would be 50 (5+5+5+5+5+5+5+5+5+5)
The value for column C would be 46 (2+8+6+6+7+7+1+2+3+4)
The outcome would be:
A B C
---------------------------------
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I figured I'd group by date first and perform a sum and then merge the results based on the date. Is there a better/faster way to do this?
Thanks.
EDIT: I worked on this answer in the mean time:
df= df.between_time('0:00','9:00').groupby(pd.Grouper(freq='D')).sum()
df= df.resample('1H').ffill()
You need groupby df.index.date and use transfrom with lambda function to find sum of first 10 values as:
df.loc[:,['A','B','C']] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
Or if the sequence is the same for both grouped values and real columns
df.loc[:,:] = df.groupby(df.index.date).transform(lambda x: x[:10].sum())
print(df)
A B C
2018-01-01 00:00:00 68 50 22
2018-01-01 01:00:00 68 50 22
2018-01-01 02:00:00 68 50 22
2018-01-01 03:00:00 68 50 22
2018-01-01 04:00:00 68 50 22
2018-01-01 05:00:00 68 50 22
2018-01-01 06:00:00 68 50 22
2018-01-01 07:00:00 68 50 22
2018-01-01 08:00:00 68 50 22
2018-01-01 09:00:00 68 50 22
2018-01-01 10:00:00 68 50 22
2018-01-01 11:00:00 68 50 22
2018-01-01 12:00:00 68 50 22
2018-01-01 13:00:00 68 50 22
2018-01-01 14:00:00 68 50 22
2018-01-01 15:00:00 68 50 22
2018-01-01 16:00:00 68 50 22
2018-01-01 17:00:00 68 50 22
2018-01-01 18:00:00 68 50 22
2018-01-01 19:00:00 68 50 22
2018-01-01 20:00:00 68 50 22
2018-01-01 21:00:00 68 50 22
2018-01-01 22:00:00 68 50 22
2018-01-01 23:00:00 68 50 22
2018-01-02 00:00:00 39 50 46
2018-01-02 01:00:00 39 50 46
2018-01-02 02:00:00 39 50 46
2018-01-02 03:00:00 39 50 46
2018-01-02 04:00:00 39 50 46
2018-01-02 05:00:00 39 50 46
2018-01-02 06:00:00 39 50 46
2018-01-02 07:00:00 39 50 46
2018-01-02 08:00:00 39 50 46
2018-01-02 09:00:00 39 50 46
2018-01-02 10:00:00 39 50 46
2018-01-02 11:00:00 39 50 46
2018-01-02 12:00:00 39 50 46
2018-01-02 13:00:00 39 50 46
2018-01-02 14:00:00 39 50 46
2018-01-02 15:00:00 39 50 46
2018-01-02 16:00:00 39 50 46
2018-01-02 17:00:00 39 50 46
2018-01-02 18:00:00 39 50 46
2018-01-02 19:00:00 39 50 46
2018-01-02 20:00:00 39 50 46
2018-01-02 21:00:00 39 50 46
2018-01-02 22:00:00 39 50 46
2018-01-02 23:00:00 39 50 46
I have a CSV file that has a column that has values like:
10/23/2018 11:00:00 PM
I want to convert these values strictly by time and create a new column which takes the time of the entry (11:00:00 etc) and changes it into an hour ending time.
Example looks like:
11:00:00 PM to 12:00:00 AM = 24, 12:00:00 AM to 1:00:00 AM = 1, 1:00:00 AM to 2:00:00 AM = 2 .....etc
Looking for a simple way to calculate these by indexing them based off this conversion.
My first pseudo code idea is to do something like grabbing the column df['Date'] and finding out what the time is:
file = pd.read_csv()
def conv(n):
date_time = n.iloc[1,1] #Position of the date-time column in file
for i in date_time:
time = date_time[11:] #Point of the line where time begins
Unsure how to proceed.
You can also do this:
import pandas as pd
data ='''
10/23/2018 11:00:00 PM
10/23/2018 12:00:00 AM
'''.strip().split('\n')
df = pd.DataFrame(data, columns=['date'])
df['date'] = pd.to_datetime(df['date'])
#df['pad1hour'] = df['date'].dt.hour+1
#or
df['pad1hour'] = df['date'] + pd.Timedelta('1 hours')
# I prefer the second as you can add whatever interval e.g. '1 days 3 minutes'
print(df['pad1hour'].dt.time)
You should convert to a datetime with pd.to_datetime(df.your_col) (your format will be automatically parsed correctly, though you can specify it to improve the speed) and then you can use the .dt.hour accessor.
import pandas as pd
# Sample Data
df = pd.DataFrame({'date': pd.date_range('2018-01-01', '2018-01-03', freq='30min')})
df['hour'] = df.date.dt.hour+1
print(df.sample(20))
date hour
95 2018-01-02 23:30:00 24
66 2018-01-02 09:00:00 10
82 2018-01-02 17:00:00 18
80 2018-01-02 16:00:00 17
75 2018-01-02 13:30:00 14
83 2018-01-02 17:30:00 18
49 2018-01-02 00:30:00 1
47 2018-01-01 23:30:00 24
30 2018-01-01 15:00:00 16
52 2018-01-02 02:00:00 3
29 2018-01-01 14:30:00 15
86 2018-01-02 19:00:00 20
59 2018-01-02 05:30:00 6
65 2018-01-02 08:30:00 9
92 2018-01-02 22:00:00 23
8 2018-01-01 04:00:00 5
91 2018-01-02 21:30:00 22
10 2018-01-01 05:00:00 6
89 2018-01-02 20:30:00 21
51 2018-01-02 01:30:00 2
This is the best way to do it:
from datetime import timedelta
import pandas as pd
file = pd.read_csv()
Case One: If you want to keep the date
file['New datetime'] = file['Date_time'].apply(lambda x: pd.to_datetime(x) + timedelta(hours = 1))
Case Two: If you just want the time
file['New time'] = file['Date_time'].apply(lambda x: (pd.to_datetime(x) + timedelta(hours = 1)).time())
If you need the column's data type as string instead of Timestamp you can just do:
file['New time'] = file['New time'].astype(str)
To convert it to a readable string.
Hope it helps.
I have a pandas DataFrame that represents a value for every hour of a day and I want to report each value of each day for a year. I have written the 'naive' way to do it. Is there a more efficient way?
Naive way (that works correctly, but takes a lot of time):
dfConsoFrigo = pd.read_csv("../assets/datas/refregirateur.csv", sep=';')
dataframe = pd.DataFrame(columns=['Puissance'])
iterator = 0
for day in pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H'):
iterator = iterator % 24
dataframe.loc[day] = dfConsoFrigo.iloc[iterator]['Puissance']
iterator += 1
Input (time;value) 24 rows:
Heure;Puissance
00:00;48.0
01:00;47.0
02:00;46.0
03:00;46.0
04:00;45.0
05:00;46.0
...
19:00;55.0
20:00;53.0
21:00;51.0
22:00;50.0
23:00;49.0
Expected Output (8760 rows):
Puissance
2017-01-01 00:00:00 48
2017-01-01 01:00:00 47
2017-01-01 02:00:00 46
2017-01-01 03:00:00 46
2017-01-01 04:00:00 45
...
2017-12-31 20:00:00 53
2017-12-31 21:00:00 51
2017-12-31 22:00:00 50
2017-12-31 23:00:00 49
I think you need numpy.tile:
np.random.seed(10)
df = pd.DataFrame({'Puissance':np.random.randint(100, size=24)})
rng = pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H')
df = pd.DataFrame({'a':np.tile(df['Puissance'].values, 365)}, index=rng)
print (df.head(30))
a
2017-01-01 00:00:00 9
2017-01-01 01:00:00 15
2017-01-01 02:00:00 64
2017-01-01 03:00:00 28
2017-01-01 04:00:00 89
2017-01-01 05:00:00 93
2017-01-01 06:00:00 29
2017-01-01 07:00:00 8
2017-01-01 08:00:00 73
2017-01-01 09:00:00 0
2017-01-01 10:00:00 40
2017-01-01 11:00:00 36
2017-01-01 12:00:00 16
2017-01-01 13:00:00 11
2017-01-01 14:00:00 54
2017-01-01 15:00:00 88
2017-01-01 16:00:00 62
2017-01-01 17:00:00 33
2017-01-01 18:00:00 72
2017-01-01 19:00:00 78
2017-01-01 20:00:00 49
2017-01-01 21:00:00 51
2017-01-01 22:00:00 54
2017-01-01 23:00:00 77
2017-01-02 00:00:00 9
2017-01-02 01:00:00 15
2017-01-02 02:00:00 64
2017-01-02 03:00:00 28
2017-01-02 04:00:00 89
2017-01-02 05:00:00 93