probability of the price having n positive consecutive days - python

This table is a pandas dataframe. Can someone help me out with writing function that show the probability of price up for 5 consecutive days in a row for the past 1000 days. So I will know the probability of price up for tomorrow if the past 4 days' price has been increasing.
Appreciate if any help.
import ccxt
import pandas as pd
binance=ccxt.binance()
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=1000) #limit = 30
df = pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df['date'] = pd.to_datetime(df['date'], unit='ms') + pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
df=get_price("BTC/USDT")
df["daily_return"]=df.close.pct_change()

Random comment, I think based on context you're after empirical probability, in which case this is a simple one-liner using pandas rolling. If this isn't the case, you probably need to explain what you mean by "probability" or explain in words what you're after.
df["probability_5"] = (df["daily_return"] > 0).rolling(5).mean()
df[["daily_return", "probability_5"]].head(15)
Output:
daily_return probability_5
date
2017-08-17 08:00:00 NaN NaN
2017-08-18 08:00:00 -0.041238 NaN
2017-08-19 08:00:00 0.007694 NaN
2017-08-20 08:00:00 -0.012969 NaN
2017-08-21 08:00:00 -0.017201 0.2
2017-08-22 08:00:00 0.005976 0.4
2017-08-23 08:00:00 0.018319 0.6
2017-08-24 08:00:00 0.049101 0.6
2017-08-25 08:00:00 -0.008186 0.6
2017-08-26 08:00:00 0.013260 0.8
2017-08-27 08:00:00 -0.006324 0.6
2017-08-28 08:00:00 0.017791 0.6
2017-08-29 08:00:00 0.045773 0.6
2017-08-30 08:00:00 -0.007050 0.6
2017-08-31 08:00:00 0.037266 0.6

Just to frame the question properly, I believe you are trying to calculate the relative frequency of n-consecutive postive or negative days of price series/array.
Some research:
https://medium.com/#mikeharrisny/probability-in-trading-and-finance-96344108e1d9
Please see my implementation bellow, using a Pandas Dataframe:
import pandas as pd
random_prices = [100, 90, 95, 98, 99, 98, 97, 100, 99, 98]
df = pd.DataFrame(random_prices, columns=['price'])
def consecutive_days_proba(pandas_series, n, positive=True):
# Transform to daily returns
daily_return = pandas_series.pct_change()
# Drop NA values, this makes the original series n-1
daily_return.dropna(inplace=True)
# Count the total number of days in the new series
total_days = len(daily_return)
if positive:
# count the number of n consecutive days with positive returns
consecutive_n = ((daily_return > 0).rolling(n).sum() == n).sum()
else:
# count the number of n consecutive days with negative returns
consecutive_n = ((daily_return < 0).rolling(n).sum() == n).sum()
return ((consecutive_n / total_days) * 100).round(2)
consecutive_days_proba(df['price'], n=3, positive=True)
So this returns 11.11% which is 1/9. Although the original series is a length of 10, I dont' think it makes sense to use the null days as part of the base.

Related

How do I plot a scatter graph comparing two dataframes?

I have two separate DataFrames, which both contain rainfall amounts and dates corresponding to them.
df1:
time tp
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 0.0
3 2013-01-01 03:00:00 0.0
4 2013-01-01 04:00:00 0.0
... ...
8755 2013-12-31 19:00:00 0.0
8756 2013-12-31 20:00:00 0.0
8757 2013-12-31 21:00:00 0.0
8758 2013-12-31 22:00:00 0.0
8759 2013-12-31 23:00:00 0.0
[8760 rows x 2 columns]
df2:
time tp
0 2013-07-18T18:00:01 0.002794
1 2013-07-18T20:00:00 0.002794
2 2013-07-18T21:00:00 0.002794
3 2013-07-18T22:00:00 0.002794
4 2013-07-19T00:00:00 0.000000
... ...
9656 2013-12-30T13:30:00 0.000000
9657 2013-12-30T23:30:00 0.000000
9658 2013-12-31T00:00:00 0.000000
9659 2013-12-31T00:00:00 0.000000
9660 2014-01-01T00:00:00 0.000000
[9661 rows x 2 columns]
I'm trying to plot a scatter graph comparing the two data frames. The way I'm doing it is by choosing a specific date and time and plotting the df1 tp on one axis and df2 tp on the other axis.
For example,
If the date/time on both dataframes = 2013-12-31 19:00:00, then plot tp for df1 onto x-axis, and tp for df2 on the y-axis.
To solve this, I tried using the following:
df1['dates_match'] = np.where(df1['time'] == df2['time'], 'True', 'False')
which will tell me if the dates match, and if they do I can plot. The problem arises as I have a different number of rows on each dataframe, and most methods only allow comparison of dataframes with exactly the same amount of rows.
Does anyone know of an alternative method I could use to plot the graph?
Thanks in advance!
The main goal is to plot two time series with that apparently don't have the same frequency to be able to compare them.
Since the main issue here is the different timestamps let's tackle that with pandas resample so we have a more uniform timestamps for each observation. To take the sum of 30 minutes intervals you can do (feel free to change the time interval and the agg function if you want to)
df1.set_index("time", inplace=True)
df2.set_index("time", inplace=True)
df1_resampled = df1.resample("30T").sum() # taking the sum of 30 minutes intervals
df2_resampled = df2.resample("30T").sum() # taking the sum of 30 minutes intervals
Now that the timestamps are more organized you can either merge the newer resampled dataframes if you want to and then plot i
df_joined = df1_resampled.join(df2_resampled, lsuffix="_1", rsuffix="_2")
df_joined.plot(marker="o", figsize=(12,6))
# df_joined.plot(subplots=True) if you want to plot them separately
Since df1 starts on 2013-01-01 and df2 on 2013-07-18 you'll have a first period where only df1 will exist if you want to plot only the overlapped period you can pass how="outer" to when joining both dataframes.

Resample df to smaller time steps and average the counts

I have a dataframe containing counts over time periods (rainfall in periods of 3 hours), something like this:
time_stamp, rain_fall_in_mm
2019-01-01 00:03:00, 0.0
2019-01-01 00:06:00, 3.9
2019-01-01 00:09:00, 0.0
2019-01-01 00:12:00, 1.2
I need to upsample the dataframe into time periods of 1 hour and I would like to average out the counts for the rain, so that there are no NaNs and the total sum of rain remains the same, means this is the desired result:
time_stamp, rain_fall_in_mm
2019-01-01 00:01:00, 0.0
2019-01-01 00:02:00, 0.0
2019-01-01 00:03:00, 0.0
2019-01-01 00:04:00, 1.3
2019-01-01 00:05:00, 1.3
2019-01-01 00:06:00, 1.3
2019-01-01 00:07:00, 0.0
2019-01-01 00:08:00, 0.0
2019-01-01 00:09:00, 0.0
2019-01-01 00:10:00, 0.4
2019-01-01 00:11:00, 0.4
2019-01-01 00:12:00, 0.4
I found that I can do something like series.resample('1H').bfill() or series.resample('1H').pad(). These solve the resampling issue, but don't fulfil the desired averaging. Do you have any suggestions what to do? Tnx
Try this:
df2 = df.reindex(pd.date_range(start = '1/1/2019',periods = 13,freq='1min'))
df2.fillna(0).groupby((~df2['rain_fall_in_mm'].isna()).iloc[::-1].cumsum()).transform('mean')
First, make sure that your index is in datetime format. If it is not you can do this in the following way:
df.set_index(pd.date_range(start=df.time_stamp[0], periods=len(df), freq='3H'), inplace=True)
Then use this if want to upscale only the one column
df_rain_hourly_column = df.resample('H').bfill().rain / 3.
If your initial df contains only floats you can operate on the whole dataframe
df2 = df.resample('H').bfill() / 3.
The division by 3. (the length factor of old_time_period/new_time_period) is a bit hacky, but I really haven't found a more general and simple solution anywhere.

How to match time series in python?

I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.

percentile for datetime column python

Is there a way to compute the percentile for a dataframe column with datetime format while still retaining the datetime format (Y-m-d H:M:S) and not converted to seconds for the percentile value?
example of the data with datetime format
df:
0 2016-07-31 08:00:00
1 2016-07-30 14:30:00
2 2006-06-24 14:15:00
3 2016-07-15 08:15:45
4 2016-08-01 23:50:00
There is a built-in function quantile that can be used for that. Let
df = pd.Series(['2016-07-31 08:00:00', '2016-07-30 14:30:00', '2006-06-24 14:15:00', '2016-07-15 08:15:45', '2016-08-01 23:50:00'])
df
0 2016-07-31 08:00:00
1 2016-07-30 14:30:00
2 2006-06-24 14:15:00
3 2016-07-15 08:15:45
4 2016-08-01 23:50:00
then
>>> df.quantile(0.5)
Timestamp('2016-07-30 14:30:00')
See also the official documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html
describe() method on datetime column doesn't work the same way as it does on integer columns or float columns
So we can create our custom method to do the same:
import pandas as pd
from datetime import timedelta
from datetime import datetime
base = datetime.now()
date_list = [base - timedelta(days=x) for x in range(0, 20)]
df = pd.DataFrame.from_dict({'Date': date_list})
df
Date
0 2017-08-17 21:32:54.044948
1 2017-08-16 21:32:54.044948
2 2017-08-15 21:32:54.044948
3 2017-08-14 21:32:54.044948
def describe_datetime(dataframe, column, percentiles=[i/10 for i in range(1,11)]):
new_date = dataframe[column].dt.strftime('%Y-%m-%d').sort_values().values
length = len(new_date)
for percentile in percentiles:
print(percentile, ':', new_date[int(percentile * length)-1])
describe_datetime(df, 'Date')
output:
0.1 : 2017-07-30
0.2 : 2017-08-01
0.3 : 2017-08-03
0.4 : 2017-08-05
0.5 : 2017-08-07
0.6 : 2017-08-09
0.7 : 2017-08-11
0.8 : 2017-08-13
0.9 : 2017-08-15
1.0 : 2017-08-17
After trying some code. I was a able to compute the percentile using the code below, I sorted the column and used its index to compute the percentile.
dataframe is 'df', column with datetime format is 'dates'
date_column = list(df.sort_values('dates')['dates'])
index = range(0,len(date_column)+1)
date_column[np.int((np.percentile(index, 50)))]

Data Frame in Panda with Time series data

I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()

Categories

Resources