I just started learning pandas. I came across this;
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
I have understood what is the above data means and I tried with IPython:
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
Is it correct way of creating a data frame?
The Next step given is to : Return a series where the absolute difference between a number and the next number in the series is less than 0.5
Do I need to find the difference between each random number generated and store only the sets where the abs diff is < 0.5 ? Can someone explain how can I do that in pandas?
Also I tried to plot the series as histogram with;
df_new.diff().hist()
The graph display the x as Random number with Y axis 0 to 18 (which I don't understand). Can some one explain this to me as well?
To give you some pointers in addition to #Dthal's comments:
r = pd.date_range('1/1/2011', periods=72, freq='H')
As commented by #Dthal, you can simplify the creation of your DataFrame randomly sampled from the normal distribution like so:
df = pd.DataFrame(index=r, data=randn(len(r)), columns=['Random Number Generated'])
To show only values that differ by less than 0.5 from the preceding value:
diff = df.diff()
diff[abs(diff['Random Number Generated']) < 0.5]
Random Number Generated
2011-01-01 02:00:00 0.061821
2011-01-01 05:00:00 0.463712
2011-01-01 09:00:00 -0.402802
2011-01-01 11:00:00 -0.000434
2011-01-01 22:00:00 0.295019
2011-01-02 03:00:00 0.215095
2011-01-02 05:00:00 0.424368
2011-01-02 08:00:00 -0.452416
2011-01-02 09:00:00 -0.474999
2011-01-02 11:00:00 0.385204
2011-01-02 12:00:00 -0.248396
2011-01-02 14:00:00 0.081890
2011-01-02 17:00:00 0.421897
2011-01-02 18:00:00 0.104898
2011-01-03 05:00:00 -0.071969
2011-01-03 15:00:00 0.101156
2011-01-03 18:00:00 -0.175296
2011-01-03 20:00:00 -0.371812
Can simplify using .dropna() to get rid of the missing values.
The pandas.Series.hist() docs inform that the default number of bins is 10, so that's number of bars you should expect and so it turns out in this case roughly symmetric around zero ranging roughly [-4, +4].
Series.hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, **kwds)
diff.hist()
Related
I have two high frequency time series of 3 months worth of data.
The problem is that one goes from 15:30 to 23:00, the other from 01:00 to 00:00.
IS there any way to match the two time series, by discarding the extra data, in order to run some regression analysis?
use can use the function combine_first of pandas Series. This function selects the element of the calling object, if both series contain the same index.
Following code shows a minimum example:
idx1 = pd.date_range('2018-01-01', periods=5, freq='H')
idx2 = pd.date_range('2018-01-01 01:00', periods=5, freq='H')
ts1 = pd.Series(range(len(ts1)), index=idx1)
ts2 = pd.Series(range(len(ts2)), index=idx2)
idx1.combine_first(idx2)
This gives a dataframe with the content:
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
2018-01-01 04:00:00 4.0
2018-01-01 05:00:00 4.0
For more complex combinations you can use combine.
This table is a pandas dataframe. Can someone help me out with writing function that show the probability of price up for 5 consecutive days in a row for the past 1000 days. So I will know the probability of price up for tomorrow if the past 4 days' price has been increasing.
Appreciate if any help.
import ccxt
import pandas as pd
binance=ccxt.binance()
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=1000) #limit = 30
df = pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"volume"})
df['date'] = pd.to_datetime(df['date'], unit='ms') + pd.Timedelta(hours=8)
df.set_index("date",inplace=True)
return df
df=get_price("BTC/USDT")
df["daily_return"]=df.close.pct_change()
Random comment, I think based on context you're after empirical probability, in which case this is a simple one-liner using pandas rolling. If this isn't the case, you probably need to explain what you mean by "probability" or explain in words what you're after.
df["probability_5"] = (df["daily_return"] > 0).rolling(5).mean()
df[["daily_return", "probability_5"]].head(15)
Output:
daily_return probability_5
date
2017-08-17 08:00:00 NaN NaN
2017-08-18 08:00:00 -0.041238 NaN
2017-08-19 08:00:00 0.007694 NaN
2017-08-20 08:00:00 -0.012969 NaN
2017-08-21 08:00:00 -0.017201 0.2
2017-08-22 08:00:00 0.005976 0.4
2017-08-23 08:00:00 0.018319 0.6
2017-08-24 08:00:00 0.049101 0.6
2017-08-25 08:00:00 -0.008186 0.6
2017-08-26 08:00:00 0.013260 0.8
2017-08-27 08:00:00 -0.006324 0.6
2017-08-28 08:00:00 0.017791 0.6
2017-08-29 08:00:00 0.045773 0.6
2017-08-30 08:00:00 -0.007050 0.6
2017-08-31 08:00:00 0.037266 0.6
Just to frame the question properly, I believe you are trying to calculate the relative frequency of n-consecutive postive or negative days of price series/array.
Some research:
https://medium.com/#mikeharrisny/probability-in-trading-and-finance-96344108e1d9
Please see my implementation bellow, using a Pandas Dataframe:
import pandas as pd
random_prices = [100, 90, 95, 98, 99, 98, 97, 100, 99, 98]
df = pd.DataFrame(random_prices, columns=['price'])
def consecutive_days_proba(pandas_series, n, positive=True):
# Transform to daily returns
daily_return = pandas_series.pct_change()
# Drop NA values, this makes the original series n-1
daily_return.dropna(inplace=True)
# Count the total number of days in the new series
total_days = len(daily_return)
if positive:
# count the number of n consecutive days with positive returns
consecutive_n = ((daily_return > 0).rolling(n).sum() == n).sum()
else:
# count the number of n consecutive days with negative returns
consecutive_n = ((daily_return < 0).rolling(n).sum() == n).sum()
return ((consecutive_n / total_days) * 100).round(2)
consecutive_days_proba(df['price'], n=3, positive=True)
So this returns 11.11% which is 1/9. Although the original series is a length of 10, I dont' think it makes sense to use the null days as part of the base.
My dataset looks like this:
time Open
2017-01-01 00:00:00 1.219690
2017-01-01 01:00:00 1.688490
2017-01-01 02:00:00 1.015285
2017-01-01 03:00:00 1.357672
2017-01-01 04:00:00 1.293786
2017-01-01 05:00:00 1.040048
2017-01-01 06:00:00 1.225080
2017-01-01 07:00:00 1.145402
...., ....
2017-12-31 23:00:00 1.145402
I want to find the sum between the time-range specified and save it to new dataframe.
let's say,
I want to find the sum between 2017-01-01 22:00:00 and 2017-01-02 04:00:00. This is the sum of 6 hours between 2 days. I want to find the sum of the data in the time-range such as 10 PM to next day 4 AM and put it in a different data frame for example df_timerange_sum. Please note that we are doing sum of time in 2 different date?
What did I do?
I used the sum() to calculate time-range like this: df[~df['time'].dt.hour.between(10, 4)].sum()but it gives me sum as a whole of the df but not on the between time-range I have specified.
I also tried the resample but I cannot find a way to do it for time-specific
df['time'].dt.hour.between(10, 4) is always False because no number is larger than 10 and smaller than 4 at the same time. What you want is to mark between(4,21) and then negate that to get the other hours.
Here's what I would do:
# mark those between 4AM and 10PM
# data we want is where s==False, i.e. ~s
s = df['time'].dt.hour.between(4, 21)
# use s.cumsum() marks the consecutive False block
# on which we will take sum
blocks = s.cumsum()
# again we only care for ~s
(df[~s].groupby(blocks[~s], as_index=False) # we don't need the blocks as index
.agg({'time':'min', 'Open':'sum'}) # time : min -- select the beginning of blocks
) # Open : sum -- compute sum of Open
Output for random data:
time Open
0 2017-01-01 00:00:00 1.282701
1 2017-01-01 22:00:00 2.766324
2 2017-01-02 22:00:00 2.838216
3 2017-01-03 22:00:00 4.151461
4 2017-01-04 22:00:00 2.151626
5 2017-01-05 22:00:00 2.525190
6 2017-01-06 22:00:00 0.798234
an alternative (in my opinion more straightforward) approach that accomplishes the same thing..there's definitely ways to reduce the code but I am also relatively new to pandas
df.set_index(['time'],inplace=True) #make time the index col (not 100% necessary)
df2=pd.DataFrame(columns=['start_time','end_time','sum_Open']) #new df that stores your desired output + start and end times if you need them
df2['start_time']=df[df.index.hour == 22].index #gets/stores all start datetimes
df2['end_time']=df[df.index.hour == 4].index #gets/stores all end datetimes
for i,row in df2.iterrows():
df2.set_value(i,'sum_Open',df[(df.index >= row['start_time']) & (df.index <= row['end_time'])]['Open'].sum())
you'd have to add an if statement or something to handle the last day which ends at 11pm.
I have a time series with 1hr time interval, which I'm trying to decompose - with seasonality of a week.
Time Total_request
2018-04-09 22:00:00 1019656
2018-04-09 23:00:00 961867
2018-04-10 00:00:00 881291
2018-04-10 01:00:00 892974
import pandas as pd
import statsmodels as sm
d.reset_index(inplace=True)
d['env_time'] = pd.to_datetime(d['env_time'])
d = d.set_index('env_time')
s=sm.tsa.seasonal_decompose(d.total_request, freq = 24*7)
This gives me a resulting graphs of Seasonal, Trend, Residue - https://imgur.com/a/CjhWphO
But on trying to extract the residual values using s.resid I get this -
env_time
2018-04-09 20:00:00 NaN
2018-04-09 21:00:00 NaN
2018-04-09 22:00:00 NaN
I get values when I modify it to a lower frequency. What's strange is why I can't derive the values, when it's being plotted. I have found similar questions being asked, none of the answers were relevant to this case.
I am trying to learn about rolling statistics. I created a data frame for :
d = date_range('1/1/2011', periods=72, freq='H')
s = Series(randn(len(rng)), index=rng)
as :
import numpy as np
from numpy.random import randn
import time
r = date_range('1/1/2011', periods=72, freq='H')
r
len(r)
[r[i] for i in range(len(r))]
s = Series(randn(len(r)), index=r)
s
s.plot()
df_new = DataFrame(data = s, columns=['Random Number Generated'])
df_new.diff().hist()
Now I am trying to find the rolling mean of the series over the last 3 hours in a new column on a DataFrame. I tried to find the rolling mean first:
df_new['mean'] = rolling_mean(df_new, window=3)
Am I correct ? But the result doesn't look like mean. Can someone explain me this one please.
I have rerun your code and could not find any problems. It seems to work.
If you want to take the rolling mean over the last 3 hours, rolling_mean(df_new, window=5) should be rolling_mean(df_new, window=3)
Here is my code for the verification.
import numpy as np
window = 3
mean_list = []
val_list = []
for i, val in enumerate(s):
val_list.append(val)
if i < window - 1:
mean_list.append(np.nan)
else:
mean_list.append(np.mean(np.array(val_list)))
val_list.pop(0)
df_new['mean2'] = mean_list
print(df_new)
Output:
Random Number Generated mean mean2
2011-01-01 00:00:00 1.457483 NaN NaN
2011-01-01 01:00:00 0.009979 NaN NaN
2011-01-01 02:00:00 0.581128 0.682864 0.682864
2011-01-01 03:00:00 1.905528 0.832212 0.832212
2011-01-01 04:00:00 2.221040 1.569232 1.569232
2011-01-01 05:00:00 0.696211 1.607593 1.607593
2011-01-01 06:00:00 -0.854759 0.687497 0.687497
2011-01-01 07:00:00 -0.033226 -0.063925 -0.063925
2011-01-01 08:00:00 0.097187 -0.263599 -0.263599
2011-01-01 09:00:00 -1.579210 -0.505083 -0.505083
...
The results by rolling_mean is consistent with manually calculated rolling mean values.
Another way to confirm the validity is looking at the plots of calculated rolling mean. pandas.DataFrame prepares plot method to draw graph easily.
from matplotlib import pyplot
df_new.plot()
pyplot.show()
As long as your index is a timestamp (as it currently is), you can just use resample:
s.resample('3H')
When you use random numbers, it is best to set a seed value so that others can replicate your results.
np.random.seed(0)
s = pd.Series(np.random.randn(72), pd.date_range('1/1/2011', periods=72, freq='H'))
s.plot();s.resample('3H').plot()