I have a df of crypto data and am trying to see if there is a particular time of the day/week when prices move one way or the other. I have the time stamp, day of the week and return from the previous time stamps close, as is the case in the example data below.
Date Day Return
2019-06-22 01:00:00 Saturday -0.046910
2019-06-22 07:00:00 Saturday -0.018756
2019-06-22 13:00:00 Saturday 0.036842
2019-06-22 19:00:00 Saturday 0.000998
2019-06-23 01:00:00 Sunday 0.017672
2019-06-23 07:00:00 Sunday 0.021102
2019-06-23 13:00:00 Sunday -0.014737
2019-06-23 19:00:00 Sunday -0.039085
2019-06-24 01:00:00 Monday 0.009690
2019-06-24 07:00:00 Monday -0.004367
2019-06-24 13:00:00 Monday -0.005342
2019-06-24 19:00:00 Monday 0.001060
2019-06-25 01:00:00 Tuesday -0.027738
2019-06-25 07:00:00 Tuesday -0.001599
2019-06-25 13:00:00 Tuesday 0.006247
2019-06-25 19:00:00 Tuesday -0.036937
2019-06-26 01:00:00 Wednesday -0.064866
2019-06-26 07:00:00 Wednesday 0.012319
My first issue is the time stamp is confusing. As I get data from different exchanges the time stamp is different across a lot of them so I have abandoned the idea of trying to standardise the Date column and would now just like a new column that numbers the period in each day. So the first 6 hours in each saturday would be Saturday_1 and so on. So in the end I would have 28 different categories (4 time periods x 7 days in the week).
What I would then like is to groupby this new column, and have returned to me the average return for each category as it were.
Cheers
Assuming that your Day column is correct:
# ignore if already datetime
df.Date = pd.to_datetime(df.Date)
# hour block in the day
s = df.Date.dt.hour//6 + 1
# new column
df['group'] = df['Day'] + '_' + s.astype(str)
output:
0 Saturday_1
1 Saturday_2
2 Saturday_3
3 Saturday_4
4 Sunday_1
5 Sunday_2
6 Sunday_3
7 Sunday_4
8 Monday_1
9 Monday_2
10 Monday_3
11 Monday_4
12 Tuesday_1
13 Tuesday_2
14 Tuesday_3
15 Tuesday_4
16 Wednesday_1
17 Wednesday_2
Name: group, dtype: object
Related
I have a DataFrame with a column of the time and a column in which I have stored a time lag. The data looks like this:
2020-04-18 14:00:00 0 days 03:00:00
2020-04-19 02:00:00 1 days 13:00:00
2020-04-28 14:00:00 1 days 17:00:00
2020-04-29 20:00:00 2 days 09:00:00
2020-04-30 19:00:00 2 days 11:00:00
Time, Length: 282, dtype: datetime64[ns] Average time lag, Length: 116, dtype: object
I want to plot the Time on the x-axis vs the time lag on the y-axis. However, I keep having errors with plotting the second column. Any tips on how to handle this data for the plot?
I am trying to extract hour, minutes and seconds from below column data frame.
0 days 09:30:00
0 days 10:00:00
0 days 10:30:00
0 days 11:00:00
0 days 11:30:00
0 days 12:00:00
0 days 12:30:00
0 days 01:00:00
0 days 01:30:00
From this to below format i want to remove this "0 days" from the data frame column
09:30:00
10:00:00
10:30:00
11:00:00
11:30:00
12:00:00
12:30:00
01:00:00
01:30:00
If you use the .split() method and pass in a space, it will return a list of all the words in the string. i.e: ['0', 'days', '09:30:00']. Then, you just have to return the third entry, assuming that everything follows this format.
def foo(string):
return string.split(' ')[2]
Using series.apply() it will look like this
df['newcolumn'] = df['originalcolumn'].apply(lambda x: x[-8:])
I'm trying to resample daily frequency data to business days using the Pandas resample function with an offset so the last day of the week becomes Thursday and the beginning Sunday.
This is the code so far:
import pandas as pd
resampled_data = df.resample('B', base=-1)
But it keeps resampling so Friday is being used in the resample and Sunday is excluded. I tried many different values for base and loffset but it's not affecting the resampling.
Please note: The raw data is using UTC timestamps. Timezone is Eastern Daylight Time. Sunday UTC 21:00 - Thursday UTC 21:00.
Use a CustomBusinessDay(). I've resampled the whole of Jan which includes Fri / Sat and also included day_name() and dayofweek to show it has worked.
import datetime as dt
df = pd.DataFrame(index=pd.date_range(dt.datetime(2020,1,1), dt.datetime(2020,2,1)))
bd = pd.tseries.offsets.CustomBusinessDay(n=1,
weekmask="Sun Mon Tue Wed Thu")
df = df.resample(rule=bd).first().assign(
day=lambda dfa: dfa.index.day_name(),
dn=lambda dfa: dfa.index.dayofweek
)
output
day dn
2020-01-01 Wednesday 2
2020-01-02 Thursday 3
2020-01-05 Sunday 6
2020-01-06 Monday 0
2020-01-07 Tuesday 1
2020-01-08 Wednesday 2
2020-01-09 Thursday 3
2020-01-12 Sunday 6
2020-01-13 Monday 0
2020-01-14 Tuesday 1
2020-01-15 Wednesday 2
2020-01-16 Thursday 3
2020-01-19 Sunday 6
2020-01-20 Monday 0
2020-01-21 Tuesday 1
2020-01-22 Wednesday 2
2020-01-23 Thursday 3
2020-01-26 Sunday 6
2020-01-27 Monday 0
2020-01-28 Tuesday 1
2020-01-29 Wednesday 2
2020-01-30 Thursday 3
I am working on a data frame with DateTimeIndex of hourly temperature data spanning a couple of years. I want to add a column with the minimum temperature between 20:00 of a day and 8:00 of the following day. Daytime temperatures - from 8:00 to 20:00 - are not of interest. The result can either be at the same hourly resolution of the original data or be resampled to days.
I have researched a number of strategies to solve this, but am unsure about the most efficienct (in terms of primarily coding efficiency and secondary computing efficiency) respectively pythonic way to do this. Some of the possibilities I have come up with:
Attach a column with labels 'day', 'night' depending on df.index.hour and use group_by or df.loc to find the minimum
Resample to 12h and drop every second value. Not sure how I can make the resampling period start at 20:00.
Add a multi-index - I guess this is similar to approach 1, but feels a bit over the top for what I'm trying to achieve.
Use df.between_time (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html#pandas.DataFrame.between_time) though I'm not sure if the date change over midnight will make this a bit messy.
Lastly there is some discussion about combining rolling with a stepping parameter as new pandas feature: https://github.com/pandas-dev/pandas/issues/15354
Original df looks like this:
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21 #<-- minimum for the night 2009-06-30 (previous date since periods starts 2009-06-30 20:00)
... ...
2019-06-24 22:00:00 14.03 #<-- minimum for the night 2019-06-24
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25
I want to get something like this (min temp from day 20:00 to day+1 8:00):
datetime temp
2009-06-30 23:00:00 16.21
2009-07-01 00:00:00 16.21
2009-07-01 01:00:00 16.21
2009-07-01 02:00:00 16.21
2009-07-01 03:00:00 16.21
... ...
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 14.03
2019-06-25 00:00:00 14.03
2019-06-25 01:00:00 14.03
or a bit more succinct:
datetime temp
2009-06-30 16.21
... ...
2019-06-24 14.03
Use the base option to resample:
rs = df.resample('12h', base=8).min()
Then keep only the rows for 20:00:
rs[rs.index.hour == 20]
you can use TimeGrouper with freq=12h and base=8 to chunk the dataframe every 12h from 20:00 - (+day)08:00,
then you can just use .min()
try this:
import pandas as pd
from io import StringIO
s = """
datetime temp
2009-07-01 01:00:00 17.16
2009-07-01 02:00:00 16.64
2009-07-01 03:00:00 16.21
2019-06-24 22:00:00 14.03
2019-06-24 23:00:00 18.87
2019-06-25 00:00:00 17.85
2019-06-25 01:00:00 17.25"""
df = pd.read_csv(StringIO(s), sep="\s\s+")
df['datetime'] = pd.to_datetime(df['datetime'])
result = df.sort_values('datetime').groupby(pd.Grouper(freq='12h', base=8, key='datetime')).min()['temp'].dropna()
print(result)
Output:
datetime
2009-06-30 20:00:00 16.21
2019-06-24 20:00:00 14.03
Name: temp, dtype: float64
I've tried this problem for hours and has no solution..
Below is intraday 5 min forex data - df. It is recorded every 5 mins every day.
I have excluded weekends data since market is close on weekends.Here a weekend is defined as from Friday 5:00 pm - Sunday 5:00 pm.
Time OPEN CLOSE
216 2014-01-01 18:05:00 0.891975 0.892185
217 2014-01-01 18:10:00 0.892075 0.892090
...
210238 2015-12-31 23:55:00 1.000390 1.000390
210239 2016-01-01 00:00:00 1.000390 1.000390
A day is defined as from 11:00 am - 11:00 am. So when I say 2014-01-02, I am aggregating data from 2014-01-01 11:00 am to 2014-01-02 11:00 am.
2014-01-06(monday) will include data from two intervals:
1. 2014-01-03(friday) 11:00 - 17:00
2. 2014-01-05(sunday) 17:00 - 2014-01-06 11:00
I want to create a new column 'Date' to define the new "day"
so by reading df.Time, the column will record which day it belongs to.
how would you approach this?
Time OPEN CLOSE Date
2014-01-03 14:05:00 0.891975 0.892185 2014-01-06
2014-01-05 17:00:00 0.892075 0.892090 2014-01-06
2014-01-06 11:00:00 0.892075 0.892090 2014-01-06
...
2015-12-31 23:55:00 1.000390 1.000390 2016-01-01
2016-01-01 00:00:00 1.000390 1.000390 2016-01-01
I use pandas offsets.
import pandas as pd
times = pd.date_range('2016-01-01 11:00:00', '2016-01-15 11:00:00', freq='H')
pd.to_datetime((times - pd.offsets.Hour(11) + pd.offsets.BDay()).date)
times gets every hour between '2016-01-01 11:00:00' and '2016-01-15 11:00:00'
Based on your description, if I subtract 11 hours and add a business day, I should be on the day you are looking for. Then I convert to the day without the time component.