Simple Linear Regression Stock Price Prediction - python

This simple linear regression LR predicts the close price but it doesn't go further than the end of the dataframe, I mean, I have the last closing price and aside is the prediction, but I want to know the next 10 closing prices which of course I don't have yet because are still coming. How do I see in the LR column the next 10 predictions without having the closing price yet?
# get prices from the exchange
prices = SESSION_DATA.query_kline(
symbol = 'BTCUSDT',
interval = 60, # timeframe (1 hour)
limit = 200, # numbers of candles
from_time = (TIMESTAMP() - (200 * 60)*60)) # from now go back 200 candles, 1 hour each
# pull data to a dataframe
df = pd.DataFrame(prices['result'])
df = df[['open_time','open','high','low','close']].astype(float)
df['open_time'] = pd.to_datetime(df['open_time'], unit='s')
# df['open_time'] = pd.to_datetime(df['open_time':]).strftime("%Y%m%d %I:%M:%S")
df.rename(columns={'open_time': 'Date'}, inplace=True)
# using Ta-Lib
prediction = TAL.LINEARREG(df['close'], 10)
df['LR'] = prediction
print(df)
Date open high low close LR
0 2022-10-06 14:00:00 20099.0 20116.5 19871.5 20099.0 NaN
1 2022-10-06 15:00:00 20099.0 20115.5 19987.0 20002.5 NaN
2 2022-10-06 16:00:00 20002.5 20092.0 19932.5 20050.0 NaN
3 2022-10-06 17:00:00 20050.0 20270.0 20002.5 20105.5 NaN
4 2022-10-06 18:00:00 20105.5 20106.0 19979.0 20010.5 NaN
5 2022-10-06 19:00:00 20010.5 20063.0 19985.0 20004.5 NaN
6 2022-10-06 20:00:00 20004.5 20064.5 19995.5 20042.5 NaN
7 2022-10-06 21:00:00 20042.5 20043.0 19878.5 19905.0 NaN
8 2022-10-06 22:00:00 19905.0 19944.0 19836.5 19894.0 NaN
9 2022-10-06 23:00:00 19894.0 19965.0 19851.0 19954.5 19925.527273
10 2022-10-07 00:00:00 19954.5 20039.5 19937.5 19984.5 19936.263636
11 2022-10-07 01:00:00 19984.5 20010.0 19957.0 19988.5 19935.327273
. . . I want the df ends this way
188 2022-10-14 10:00:00 19639.0 19733.5 19621.0 19680.0 19623.827273
189 2022-10-14 11:00:00 19680.0 19729.0 19576.5 NaN 19592.990909
190 2022-10-14 12:00:00 19586.5 19835.0 19535.5 NaN 19638.054545
191 2022-10-14 13:00:00 19785.5 19799.0 19612.0 NaN 19637.463636
192 2022-10-14 14:00:00 19656.5 19656.5 19334.5 NaN 19574.572727
193 2022-10-14 15:00:00 19455.0 19507.5 19303.5 NaN 19493.990909
194 2022-10-14 16:00:00 19351.0 19390.0 19220.0 NaN 19416.154545
195 2022-10-14 17:00:00 19296.5 19369.5 19284.5 NaN 19356.072727
196 2022-10-14 18:00:00 19358.0 19358.0 19127.5 NaN 19253.918182
197 2022-10-14 19:00:00 19208.5 19264.5 19100.0 NaN 19164.745455
198 2022-10-14 20:00:00 19164.0 19211.0 19114.0 NaN 19112.445455
199 2022-10-14 21:00:00 19172.0 19201.0 19125.0 NaN 19067.772727

Since Linear regression is ax + b the 10 further predictions would repeat itself, because you don't have any more input to alter the predictions beside the close price, i think, you are trying to look for a Monte Carlo simulation, that would try to predict based on random walk hypothesis for stock market prices.

Related

How to apply a condition to Pandas dataframe rows, but only apply the condition to rows of the same day?

I have a dataframe that's indexed by datetime and has one column of integers and another column that I want to put in a string if a condition of the integers is met. I need the condition to assess the integer in row X against the integer in row X-1, but only if both rows are on the same day.
I am currently using the condition:
df.loc[(df['IntCol'] > df['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
This successfully applies my condition, however if the shifted row is on a different day then the condition will still use it and I want it to ignore any rows that are on a different day. I've tried various iterations of groupby(df.index.date) but can't seem to figure out if that will work or not.
Not sure if this is the best way to do it but gets you the answer:
df['out'] = np.where(df['int_col'] > df.groupby(df.index)['int_col'].shift(1), 'Success', 'Failure')
I think this is what you want. You were probably closer to the answer than you thought...
There is two dataframes use to show that the logic you have works whether or not data is random or integers are sorted range.
You will need to import random to see the data
dates = list(pd.date_range(start='2021/1/1', periods=16, freq='4H'))
def compare(x):
x.loc[(x['IntCol'] > x['IntCol'].shift(periods=1)), 'StringCol'] = 'Success'
return x
#### Will show success in all rows except where dates change because it's a range in numerical order
df = pd.DataFrame({'IntCol': range(10,26)}, index=dates)
df.groupby(df.index.date).apply(compare)
2021-01-01 00:00:00 10 NaN
2021-01-01 04:00:00 11 Success
2021-01-01 08:00:00 12 Success
2021-01-01 12:00:00 13 Success
2021-01-01 16:00:00 14 Success
2021-01-01 20:00:00 15 Success
2021-01-02 00:00:00 16 NaN
2021-01-02 04:00:00 17 Success
2021-01-02 08:00:00 18 Success
2021-01-02 12:00:00 19 Success
2021-01-02 16:00:00 20 Success
2021-01-02 20:00:00 21 Success
2021-01-03 00:00:00 22 NaN
2021-01-03 04:00:00 23 Success
2021-01-03 08:00:00 24 Success
2021-01-03 12:00:00 25 Success
### random numbers to show that it works here too
df = pd.DataFrame({'IntCol': [random.randint(3, 500) for x in range(0,16)]}, index=dates)
df.groupby(df.index.date).apply(compare)
IntCol StringCol
2021-01-01 00:00:00 386 NaN
2021-01-01 04:00:00 276 NaN
2021-01-01 08:00:00 143 NaN
2021-01-01 12:00:00 144 Success
2021-01-01 16:00:00 10 NaN
2021-01-01 20:00:00 343 Success
2021-01-02 00:00:00 424 NaN
2021-01-02 04:00:00 362 NaN
2021-01-02 08:00:00 269 NaN
2021-01-02 12:00:00 35 NaN
2021-01-02 16:00:00 278 Success
2021-01-02 20:00:00 268 NaN
2021-01-03 00:00:00 58 NaN
2021-01-03 04:00:00 169 Success
2021-01-03 08:00:00 85 NaN
2021-01-03 12:00:00 491 Success

How to use pd.interpolate fill the gap with only one missing data

I have a time series data for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
The gaps were composed of several consecutive or inconsecutive NAs, and there are some helpful statistics done by R like:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
Generally if I use the df.interpolate(limit=1),
gaps with more than one missing will be interpolated as well.
So I guess a better way to interpolate the gap with only one missing is to get the gap id.
To do so, I grouped the different size of gap and used the following function:
cum = df.notna().cumsum()
cum[cum.duplicated()]
and got the result:
PM2.5
2019-01-09 13:00:00 205
2019-01-10 15:00:00 230
2019-01-10 16:00:00 230
2019-01-16 11:00:00 368
2019-01-23 14:00:00 538
...
2019-12-02 10:00:00 7971
2019-12-10 09:00:00 8161
2019-12-16 15:00:00 8310
2019-12-24 12:00:00 8498
2019-12-31 10:00:00 8663
How to get the index of each first missing value in each gap like this?
PM2.5 gap size
2019-01-09 13:00:00 1
2019-01-10 15:00:00 2
2019-01-16 11:00:00 1
2019-01-23 14:00:00 1
...
2019-12-02 10:00:00 1
2019-12-10 09:00:00 1
2019-12-16 15:00:00 1
2019-12-24 12:00:00 1
2019-12-31 10:00:00 1
but when I used cum[cum.duplicated()].groupby(cum[cum.duplicated()]).count() the index would miss.
Are there better solutions to do these?
OR How to interpolate case by case?
Anyone can help me?

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

find specific value that meets conditions - python

Trying to create new column with values that meet specific conditions. Below I have set out code which goes some way in explaining the logic but does not produce the correct output:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2019-08-06 09:00:00', '2019-08-06 12:00:00', '2019-08-06 18:00:00', '2019-08-06 21:00:00', '2019-08-07 09:00:00', '2019-08-07 16:00:00', '2019-08-08 17:00:00' ,'2019-08-09 16:00:00'],
'type': [0, 1, np.nan, 1, np.nan, np.nan, 0 ,0],
'colour': ['blue', 'red', np.nan, 'blue', np.nan, np.nan, 'blue', 'red'],
'maxPixel': [255, 7346, 32, 5184, 600, 322, 72, 6000],
'minPixel': [86, 96, 14, 3540, 528, 300, 12, 4009],
'colourDate': ['2019-08-06 12:00:00', '2019-08-08 16:00:00', '2019-08-06 23:00:00', '2019-08-06 22:00:00', '2019-08-08 09:00:00', '2019-08-09 16:00:00', '2019-08-08 23:00:00' ,'2019-08-11 16:00:00'] })
max_conditions = [(df['type'] == 1) & (df['colour'] == 'blue'),
(df['type'] == 1) & (df['colour'] == 'red')]
max_choices = [np.where(df['date'] <= df['colourDate'], max(df['maxPixel']), np.nan),
np.where(df['date'] <= df['colourDate'], min(df['minPixel']), np.nan)]
df['pixelLimit'] = np.select(max_conditions, max_choices, default=np.nan)
Incorrect output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 12.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 6000.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is incorrect:
Value 12.0 in index row 1 for column df['pixelLimit'] is incorrect because this value is from df['minPixel'] index row 6 which has has a df['date'] datetime of 2019-08-08 17:00:00 which is greater than the 2019-08-08 16:00:00 df['date'] datetime contained in index row 1.
Value 6000.0 in index row 3 for column df['pixelLimit'] is incorrect because this value is from df['maxPixel'] index row 7 which has a df['date'] datetime of 2019-08-09 16:00:00 which is greater than the 2019-08-06 22:00:00 df['date'] datetime contained in index row .
Correct output:
date type colour maxPixel minPixel colourDate pixelLimit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
Explanation why output is correct:
Value 14.0 in index row 1 for column df['pixelLimit'] is correct because we are looking for the minimum value in column df['minPixel'] which has a datetime in column df['date'] less than the datetime in index row 1 for column df['colourDate'] and greater or equal to the datetime in index row 1 for column df['date']
Value 5184.0 in index row 3 for column df['pixelLimit'] is correct because we are looking for the maximum value in column df['maxPixel'] which has a datetime in column df['date'] less than the datetime in index row 3 for column df['colourDate'] and greater or equal to the datetime in index row 3 for column df['date']
Considerations:
Maybe np.select is not best suited for this task and some sort of function might serve the task better?
Also, maybe I need to create some sort of dynamic len to use as a starting point for each row?
Request
Please can anyone out there help me amend my code to achieve the correct output
For matching problems like this one possibility is to do the complete merge, then subset, using a Boolean Series, to all rows that satisfy your condition (for that row) and find the max or min among all the possible matches. Since this requires slightly different columns and different functions I split the operations into 2 very similar pieces of code, one to deal with 1/blue and the other for 1/red.
First some housekeeping, make things datetime
import pandas as pd
df['date'] = pd.to_datetime(df['date'])
df['colourDate'] = pd.to_datetime(df['colourDate'])
Calculate the min pixel for 1/red between the times for each row
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']], how='cross')
# If pd.version < 1.2 instead use:
#dfmin = dfmin.assign(t=1).merge(df[['date', 'minPixel']].assign(t=1), on='t')
# Only keep rows between the dates, then among those find the min minPixel
smin = (dfmin[dfmin.date_y.between(dfmin.date_x, dfmin.colourDate)]
.groupby('index')['minPixel_y'].min()
.rename('pixel_limit'))
#index
#1 14
#Name: pixel_limit, dtype: int64
# Max is basically a mirror
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']], how='cross')
#dfmax = dfmax.assign(t=1).merge(df[['date', 'maxPixel']].assign(t=1), on='t')
smax = (dfmax[dfmax.date_y.between(dfmax.date_x, dfmax.colourDate)]
.groupby('index')['maxPixel_y'].max()
.rename('pixel_limit'))
Finally because the above groups over the original index (i.e. 'index') we can simply assign back to align with the original DataFrame.
df['pixel_limit'] = pd.concat([smin, smax])
date type colour maxPixel minPixel colourDate pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN
If you need to bring along a lot of different information for the row with the min/max Pixel then instead of groupby min/max we will sort_values and then gropuby + head or tail to get the min or max pixel. For the min this would look like (slight renaming of suffixes):
# Subset of rows we need to do this for
dfmin = df[df.type.eq(1) & df.colour.eq('red')].reset_index()
# To each row merge all rows from the original DataFrame
dfmin = dfmin.merge(df[['date', 'minPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
# For older pandas < 1.2
#dfmin = (dfmin.assign(t=1)
# .merge(df[['date', 'minPixel']].reset_index().assign(t=1),
# on='t', suffixes=['', '_match']))
# Only keep rows between the dates, then among those find the min minPixel row.
# A bunch of renaming.
smin = (dfmin[dfmin.date_match.between(dfmin.date, dfmin.colourDate)]
.sort_values('minPixel_match', ascending=True)
.groupby('index').head(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'minPixel_match': 'pixel_limit'}))
The Max would then be similar using .tail
dfmax = df[df.type.eq(1) & df.colour.eq('blue')].reset_index()
dfmax = dfmax.merge(df[['date', 'maxPixel']].reset_index(), how='cross',
suffixes=['', '_match'])
smax = (dfmax[dfmax.date_match.between(dfmax.date, dfmin.colourDate)]
.sort_values('maxPixel_match', ascending=True)
.groupby('index').tail(1)
.set_index('index')
.filter(like='_match')
.rename(columns={'maxPixel_match': 'pixel_limit'}))
And finally we concat along axis=1 now that we need to join multiple columns to the original:
result = pd.concat([df, pd.concat([smin, smax])], axis=1)
date type colour maxPixel minPixel colourDate index_match date_match pixel_limit
0 2019-08-06 09:00:00 0.0 blue 255 86 2019-08-06 12:00:00 NaN NaN NaN
1 2019-08-06 12:00:00 1.0 red 7346 96 2019-08-08 16:00:00 2.0 2019-08-06 18:00:00 14.0
2 2019-08-06 18:00:00 NaN NaN 32 14 2019-08-06 23:00:00 NaN NaN NaN
3 2019-08-06 21:00:00 1.0 blue 5184 3540 2019-08-06 22:00:00 3.0 2019-08-06 21:00:00 5184.0
4 2019-08-07 09:00:00 NaN NaN 600 528 2019-08-08 09:00:00 NaN NaN NaN
5 2019-08-07 16:00:00 NaN NaN 322 300 2019-08-09 16:00:00 NaN NaN NaN
6 2019-08-08 17:00:00 0.0 blue 72 12 2019-08-08 23:00:00 NaN NaN NaN
7 2019-08-09 16:00:00 0.0 red 6000 4009 2019-08-11 16:00:00 NaN NaN NaN

Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame

Given a dataframe of timestamp data, I would like to compute the median of certain variable of past 4-6 days.
Median of past 1-3 days can be computed by pd.pandas.DataFrame.rolling, but I couldn't find how to use rolling to compute the median of past 4-6 days.
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='6H')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
np.random.seed(1)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
timestamp var
0 2011-01-01 00:00:00 1.624345
1 2011-01-01 06:00:00 -0.611756
2 2011-01-01 12:00:00 -0.528172
3 2011-01-01 18:00:00 -1.072969
4 2011-01-02 00:00:00 0.865408
5 2011-01-02 06:00:00 -2.301539
6 2011-01-02 12:00:00 1.744812
7 2011-01-02 18:00:00 -0.761207
8 2011-01-03 00:00:00 0.319039
9 2011-01-03 06:00:00 -0.249370
10 2011-01-03 12:00:00 1.462108
Desired output:
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN # no data in past 4-6 days
1 2011-01-01 06:00:00 -0.611756 NaN # no data in past 4-6 days
2 2011-01-01 12:00:00 -0.528172 NaN # no data in past 4-6 days
3 2011-01-01 18:00:00 -1.072969 NaN # no data in past 4-6 days
4 2011-01-02 00:00:00 0.865408 NaN # no data in past 4-6 days
5 2011-01-02 06:00:00 -2.301539 NaN # no data in past 4-6 days
6 2011-01-02 12:00:00 1.744812 NaN # no data in past 4-6 days
7 2011-01-02 18:00:00 -0.761207 NaN # no data in past 4-6 days
8 2011-01-03 00:00:00 0.319039 NaN # no data in past 4-6 days
9 2011-01-03 06:00:00 -0.249370 NaN # no data in past 4-6 days
10 2011-01-03 12:00:00 1.462108 NaN # no data in past 4-6 days
11 2011-01-03 18:00:00 -2.060141 NaN # no data in past 4-6 days
12 2011-01-04 00:00:00 -0.322417 NaN # no data in past 4-6 days
13 2011-01-04 06:00:00 -0.384054 NaN # no data in past 4-6 days
14 2011-01-04 12:00:00 1.133769 NaN # no data in past 4-6 days
15 2011-01-04 18:00:00 -1.099891 NaN # no data in past 4-6 days
16 2011-01-05 00:00:00 -0.172428 NaN # only 4 data in past 4-6 days
17 2011-01-05 06:00:00 -0.877858 -0.528172
18 2011-01-05 12:00:00 0.042214 -0.569964
19 2011-01-05 18:00:00 0.582815 -0.528172
20 2011-01-06 00:00:00 -1.100619 -0.569964
21 2011-01-06 06:00:00 1.144724 -0.528172
22 2011-01-06 12:00:00 0.901591 -0.388771
23 2011-01-06 18:00:00 0.502494 -0.249370
My current code:
def findPastVar2(df, var='var' ,window=3, method='median'):
# window= # of past days
for i in xrange(len(df)):
pastVar2 = df[var].loc[(df['timestamp'] - df['timestamp'].loc[i] < datetime.timedelta(days=-window)) & (df['timestamp'] - df['timestamp'].loc[i] >= datetime.timedelta(days=-window*2))]
if pastVar2.shape[0]>=5: # At least 5 data points
if method == 'median':
df.loc[i,'past{}d-{}d_{}_median'.format(window+1,window*2,var)] = np.median(pastVar2.values)
return(df)
Current speed:
In [35]: %timeit df2 = findPastVar2(df)
1 loop, best of 3: 821 ms per loop
I edited the post so that I can clearly show my expected output of at least 5 data points. I've set the random seed so that everyone should be able to get the same input and show the same output. As far as I know simple rolling and shift does not work for the case of multiple data in the same day.
here we go:
df.set_index('timestamp', inplace = True)
df['var'] =df['var'].rolling('3D', min_periods = 3).median().shift(freq = pd.Timedelta('4d')).shift(-1)
df['var']
Out[55]:
timestamp
2011-01-01 00:00:00 NaN
2011-01-01 06:00:00 NaN
2011-01-01 12:00:00 NaN
2011-01-01 18:00:00 NaN
2011-01-02 00:00:00 NaN
2011-01-02 06:00:00 NaN
2011-01-02 12:00:00 NaN
2011-01-02 18:00:00 NaN
2011-01-03 00:00:00 NaN
2011-01-03 06:00:00 NaN
2011-01-03 12:00:00 NaN
2011-01-03 18:00:00 NaN
2011-01-04 00:00:00 NaN
2011-01-04 06:00:00 NaN
2011-01-04 12:00:00 NaN
2011-01-04 18:00:00 NaN
2011-01-05 00:00:00 NaN
2011-01-05 06:00:00 -0.528172
2011-01-05 12:00:00 -0.569964
2011-01-05 18:00:00 -0.528172
2011-01-06 00:00:00 -0.569964
2011-01-06 06:00:00 -0.528172
2011-01-06 12:00:00 -0.569964
2011-01-06 18:00:00 -0.528172
2011-01-07 00:00:00 -0.388771
2011-01-07 06:00:00 -0.249370
2011-01-07 12:00:00 -0.388771
The way this is setup is for each row, and as an irregular timeseries, it will have different widths thus requiring an iterative approach like you have started. But, if we make the index the timeseries
# setup the df:
df = pd.DataFrame(index = pd.date_range('1/1/2011', periods=100, freq='12H'))
df['var'] = np.random.randn(len(df))
in this case, I chose an interval every 12hrs, but could be whatever is available or irregular. Using a modified function with a window for the median, along with an offset (here, positive Delta is looking backwards), gives you the flexibility you wanted:
def GetMedian(df,var='var',window='2D',Delta='3D'):
for Ti in df.index:
Vals=df[(df.index < Ti-pd.Timedelta(Delta)) & \
(df.index > Ti-pd.Timedelta(Delta)-pd.Timedelta(window))]
df.loc[Ti,'Medians']=Vals[var].median()
return df
This runs substantially faster:
%timeit GetMedian(df)
84.8 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The min_period should be 2 instead of 5 because you should not count window size in. (5 - 3 = 2)
import pandas as pd
import numpy as np
import datetime
np.random.seed(1) # set random seed for easier comparison
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
def first():
df['past4d-6d_var_median'] = [np.nan]*3 + df.rolling(window=3, min_periods=2).median()[:-3]['var'].tolist()
return df
 
%timeit -n1000 first()
1000 loops, best of 3: 6.23 ms per loop
My first try didn't use shift(), but then I saw Noobie's answer.
I made the following one with shift(), which is much faster than previous one.
def test():
df['past4d-6d_var_median'] = df['var'].rolling(window=3, min_periods=2).median().shift(3)
return df
 
%timeit -n1000 test()
1000 loops, best of 3: 1.66 ms per loop
The second one is around 4 times as fast as the first one.
These two function creates the same result, which looks like this:
df2 = test()
df2
timestamp var past4d-6d_var_median
0 2011-01-01 00:00:00 1.624345 NaN
1 2011-01-02 00:00:00 -0.611756 NaN
2 2011-01-03 00:00:00 -0.528172 NaN
3 2011-01-04 00:00:00 -1.072969 NaN
4 2011-01-05 00:00:00 0.865408 0.506294
5 2011-01-06 00:00:00 -2.301539 -0.528172
6 2011-01-07 00:00:00 1.744812 -0.611756
... ... ... ...
93 2011-04-04 00:00:00 -0.638730 1.129484
94 2011-04-05 00:00:00 0.423494 1.129484
95 2011-04-06 00:00:00 0.077340 0.185156
96 2011-04-07 00:00:00 -0.343854 -0.375285
97 2011-04-08 00:00:00 0.043597 -0.375285
98 2011-04-09 00:00:00 -0.620001 0.077340
99 2011-04-10 00:00:00 0.698032 0.077340

Categories

Resources