Time series prediction, make new X, new row from some past rows - python

For times series prediction I am using pandas.
this is a some sample of my data frame:
Close Price
DateTime
2017-01-02 23:00:00 1.04630
2017-01-02 23:30:00 1.04575
2017-01-03 00:00:00 1.04672
2017-01-03 00:30:00 1.04662
2017-01-03 01:00:00 1.04766
......
in my X matrix for sklearn prediction I want to have something like this:
use 3 past row as input for making a new row
X:
ClosePrice ClosePrice-1 ClosePrice-2 ClosePrice-3
2017-01-03 00:30:00 1.04662 1.04672 1.04575 1.04630
2017-01-03 01:00:00 1.04766 1.04662 1.04672 1.04575
...
what is the best method?
is there a way to use pandas function to do this?

thanks a lot
if I want to use n instead of 3 what is best method?
this is worked:
for i in range(1, NumberOfLastData):
ColunmNameHighLowBin = 'HighLowBin-' + str(i)
ColunmNameOpenCloseBin = 'OpenCloseBin-' + str(i)
X[ColunmNameHighLowBin] = X['HighLowBin'].shift(i)
X[ColunmNameOpenCloseBin] = X['OpenCloseBin'].shift(i)

Related

Create Multiple DataFrames using Rolling Window from DataFrame Timestamps

I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

How to find the min value between two different datetime using Pandas?

(Not duplicate / my question is entirely different)
My dataframe looks like this:
# [df2] is day based
time time2
2017-01-01, 2017-01-01 00:12:00
2017-01-02, 2017-01-02 03:15:00
2017-01-03, 2017-01-03 01:25:00
2017-01-04, 2017-01-04 04:12:00
2017-01-05, 2017-01-05 00:45:00
....
# [df] is minute based
time value
2017-01-01 00:01:00, 0.1232
2017-01-01 00:02:00, 0.1232
2017-01-01 00:03:00, 0.1232
2017-01-01 00:04:00, 0.1232
2017-01-01 00:05:00, 0.1232
....
I want to create a new column called time_val_min in [df2] that finds the min value between df2['time2'] and df2['time'] form [df] within the range specified in df2['time'] and df2['time2']
What did I do?
I did df2['time_val_min'] = df[df['time'].dt.hour.between(df2['time'], df2['time'])].min() but it does not work.
Could you please let me know how to fix it?
You can merge the two data frame on date, and filter the time:
# create the date from the time column
df['date'] = df['time'].dt.normalize()
# merge
new_df = (df.merge(df2, left_on='date', # left on date
right_on='time', # right on time, if time is purely beginning of days
how='right',
suffixes=['','_y'])
.query('time < time2')
.groupby('date')
['time'].min()
.to_frame(name='time_val_min')
.merge(df2, right_on='time', left_index=True)
)
Output:
time_val_min time time2
0 2017-01-01 00:01:00 2017-01-01 2017-01-01 00:12:00

Calculate the sum between the fixed time range using Pandas

My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.

Add columns dataframe python plus Multiplication by a number from an array

How can I multiply an array to the columns of a dataframe and then sum these columns to a new column in a dataframe?
I tried it with the code below but somehow get wrong numbers:
AAPL Portfolio ACN
Date
2017-01-03 116.150002 1860.880008 116.459999
2017-01-04 116.019997 1862.079960 116.739998
2017-01-05 116.610001 1852.799992 114.989998
2017-01-06 117.910004 1873.680056 116.300003
...
How it should look like is the following:
AAPL Portfolio ACN
Date
2017-01-03 116.150002 1046.900003 116.459999
2017-01-04 116.019997 1047.779978 116.739998
2017-01-05 116.610001 1041.389994 114.989998
2017-01-06 117.910004 1053.140031 116.300003
...
The code looks like the following. Might be that I am thinking too complicated and therefore the code makes no sense:
import pandas_datareader.data as pdr
import pandas as pd
import datetime
start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2017, 3, 17)
ticker_list = ["AAPL","ACN"]
position_size = [4,5]
for i in range(0,len(ticker_list)):
#print(i)
DataInitial = pdr.DataReader(ticker_list[i], 'yahoo', start, end)
ClosingPrices[ticker_list[i]] = DataInitial[['Close']]
ClosingPrices['Portfolio'] = ClosingPrices['Portfolio'] + ClosingPrices[ticker_list[i]]*position_size[i]
print(ClosingPrices)
What I want is actually:
2017-01-03: 116.150002*4+116.150002*5
2017-01-03: 116.019997*4+116.739998*5
etc...
If need:
2017-01-03: 116.150002*4+116.150002*5
2017-01-03: 116.019997*4+116.739998*5
then use concat of multiple columns by value from dict and last sum all columns together:
ticker_list = ["AAPL","ACN"]
position_size = [4,5]
d = dict(zip(ticker_list,position_size))
print (pd.concat([ClosingPrices[col] * d[col] for col in ticker_list], axis=1))
AAPL ACN
Date
2017-01-03 400.000000 500.000000
2017-01-04 464.079988 583.699990
2017-01-05 466.440004 574.949990
2017-01-06 471.640016 581.500015
ClosingPrices['Portfolio'] = pd.concat([ClosingPrices[col] * d[col] for col in ticker_list],
axis=1).sum(axis=1)
print (ClosingPrices)
AAPL Portfolio ACN
Date
2017-01-03 100.000000 900.000000 100.000000 <-for testing values was changed to 100
2017-01-04 116.019997 1047.779978 116.739998
2017-01-05 116.610001 1041.389994 114.989998
2017-01-06 117.910004 1053.140031 116.300003

Categories

Resources