I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.
It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.
import pandas as pd
def get_gaps(series):
"""
#param series: a continuous time series of data with the index's freq set
#return: a series where the index is the start of gaps, and the values are
the ends
"""
missing = series.isnull()
different_from_last = missing.diff()
# any row not missing while the last was is a gap end
gap_ends = series[~missing & different_from_last].index
# count the start as different from the last
different_from_last[0] = True
# any row missing while the last wasn't is a gap start
gap_starts = series[missing & different_from_last].index
# check and remedy if series ends with missing data
if len(gap_starts) > len(gap_ends):
gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)
return pd.Series(index=gap_starts, data=gap_ends)
For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7
This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)
import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby
# create an example
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)
def find_gap(s):
""" just treat it as a list
"""
nullindex = np.where( s.isnull())[0]
ranges = []
for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
startgap, endgap = zip(* ranges)
return pd.series( endgap, index= startgap )
reference : Identify groups of continuous numbers in a list
Related
so I do have this NumPy array result(final), and I want to reduce it, I mean, if the value is repeated, then I want to delete the first value and maintain the second,third value repeated and so on...
import hmac
import hashlib
import time
from argparse import _MutuallyExclusiveGroup
from tkinter import *
import pandas as pd
import base64
import matplotlib.pyplot as plt
import numpy as np
key="800070FF00FF08012"
key=bytes(key,'utf-8')
collision=[]
for x in range(1,1000001):
msg=bytes(f'{x}','utf-8')
digest = hmac.new(key, msg,"sha256").digest()
code = base64.b64encode(digest).decode('utf-8')
code=code[:6]
key=key.replace(key,digest)
collision.append(code)
df=pd.DataFrame(collision)
df=df[df.duplicated(keep=False)]
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)
Results of the variable "final":
I HAVE:
[[14093 'JRp1kX']
[43985 'KGlW7X']
[59212 'pU97Tr']
[90668 'ecTjTB']
[140615 'JRp1kX']
[218480 '25gtjT']
[344174 'dtXg6E']
[380467 'DdHQ3M']
[395699 'vnFw/c']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]
And I WANT TO HAVE:
[[140615 'JRp1kX']
[503504 'dtXg6E']
[531073 'KGlW7X']
[633091 'ecTjTB']
[671091 'vnFw/c']
[672111 '25gtjT']
[785568 'pU97Tr']
[991540 'DdHQ3M']
[991548 'JRp1kX']]
Eliminating the first values that were repeated in the array.
Does someone have some code that could work for my case?
In more simple terms it would be, if you have this list [1,2,3,4,5,1,3,5,5]
I would like to have [2,4,1,3,5,5]
df = pd.DataFrame([1, 2, 3, 4, 5, 1, 3, 5, 5])
# keep the unique rows
unique_mask = ~df.duplicated(keep=False)
# keep the repeated rows (skipping the first for each non-unique)
repeated_mask = df.duplicated()
df.loc[unique_mask | repeated_mask]
0
1 2
3 4
5 1
6 3
7 5
8 5
final is a numpy array, so you can use np.unique on the second column to get the indices of the first occurrence and number of occurrences to avoid deleting single values
_, idx, counts = np.unique(final[:, 1], return_index=True, return_counts=True)
idx = idx[counts > 1]
final = np.delete(final, idx, axis=0)
This will work on the ndarray, for your second 1d array example use
_, idx, counts = np.unique(final, return_index=True, return_counts=True)
Maybe you could create for cycle.
to_remove = list()
for i in range(len(your_list)):
if your_list[i] in your_list[i:]:
to_remove.append(i)
removed_count = 0
for i in to_remove:
del your_list[i - removed_count]
removed_count += 1
You cannot del instantly in the first cycle because i is gonna iterate next number, which would lead to skipping number every time you delete one.
[i - removed_count] because every time you delete lower index then higher indexes gets instantly decreased by one.
I think it could be written in more effective way but this shoudl work, maybe with little changes.
After you generate df, add the following lines:
df=pd.DataFrame(collision)
# ... your code ends here
removed_already=[]
for idx in df[df.duplicated(keep=False)].index:
if df.loc[idx][0] not in removed_already:
removed_already.append(df.loc[idx][0])
df.drop(index=idx, inplace=True)
# your code continues
df_index=df.index.to_numpy()
df=df.values.flatten()
final=np.stack((df_index,df),axis=1)
My goal:
I have two time-series data frames, one with a time interval of 1m and the other with a time interval of 5m. The 5m data frame is a resampled version of the 1m data. What I'm doing is computing a set of RSI values that correspond to the 5m df using the vectorbt library, then aligning and broadcasting these values to the 1m df using df.align
The Problem:
When trying to do this line by line, it works perfectly. Here's what the final result looks like:
However, when applying it under the function, it returns the following error while having overlapping index names:
ValueError: cannot join with no overlapping index names
Here's the complete code:
import vectorbt as vbt
import numpy as np
import pandas as pd
import datetime
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=3)
btc_price = vbt.YFData.download('BTC-USD',
interval='1m',
start=start_date,
end=end_date,
missing_index='drop').get('Close')
def custom_indicator(close, rsi_window=14, ma_window=50):
close_5m = close.resample('5T').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill')
print(rsi) #to check
print(close) #to check
return
#setting up indicator factory
ind = vbt.IndicatorFactory(
class_name='Combination',
short_name='comb',
input_names=['close'],
param_names=['rsi_window', 'ma_window'],
output_names=['value']).from_apply_func(custom_indicator,
rsi_window=14,
ma_window=50,
keep_pd=True)
res = ind.run(btc_price, rsi_window=21, ma_window=50)
print(res)
Thank you for taking the time to read this. Any help would be appreciated!
if you checked the columns of both , rsi and close
print('close is', close.columns)
print('rsi is', rsi.columns)
you will find
rsi is MultiIndex([(14, 'Close')],
names=['rsi_window', None])
close is Index(['Close'], dtype='object')
as it has two indexes, one should be dropped, so it can be done by the below code
rsi.columns = rsi.columns.droplevel()
to drop one level of the indexes, so it could be align,
The problem is that the data must be a time series and not a pandas data frame for table joins using align
You need to fix the data type
# Time Series
close = close['Close']
close_5m = close.resample('15min').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right')
When you are aligning the data make sure to include join='right'
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right'
I have a pandas dataframe with daily data. At the last day of each month, I would like to compute a quantity that depends on the daily data of the previous n months (e.g., n=3).
My current solution is to use the pandas rolling function to compute this quantity for every day, and then, only keep the quantities of the last days of each month (and discard all the other quantities). This however implies that I perform a lot of unnecessary computations.
Does somebody of you know how I can improve that?
Thanks a lot in advance!
EDIT:
In the following, I add two examples. In both cases, I compute rolling regressions of stock returns. The first (short) example shows the problem described above and is a sub-problem of my actual problem. The second (long) example shows my actual problem. Therefore, I would either need a solution of the first example that can be embedded in my algorithm for solving the second example or a completely different solution of the second example. Note: The dataframe that I'm using is very large, which means that multiple copies of the entire dataframe are not feasible.
Example 1:
import pandas as pd
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
df = pd.DataFrame(index=dates,columns=['Y','X']).sort_index()
# Generate Data
df['X'] = np.array(range(0,365))
df['Y'] = 3.1*X-2.5
df = df.iloc[random.sample(range(365),280)] # some days are missing
df.iloc[random.sample(range(280),20),0] = np.nan # some observations are missing
df = df.sort_index()
# Compute Beta
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'], sm.add_constant(df.loc[ser.index,'X']), missing = 'drop').fit().params[-1]
df['beta'] = df['Y'].rolling('60D', min_periods=10).apply(estimate_beta) # use last 60 days and require at least 10 observations
# Get last entries per month
df_monthly = df[['beta']].groupby([pd.Grouper(freq='M', level='date')]).agg('last')
df_monthly
Example 2:
import pandas as pd
from pandas import IndexSlice as idx
import random
import statsmodels.api as sm
# Generate a time index
dates = pd.date_range("2018-01-01", periods=365, freq="D", name='date')
arrays = [dates.tolist()+dates.tolist(),["10000"]*365+["10001"]*365]
index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["Date", "Stock"])
df = pd.DataFrame(index=index,columns=['Y','X']).sort_index()
# Generate Data
df.loc[idx[:,"10000"],'X'] = X = np.array(range(0,365)).astype(float)
df.loc[idx[:,"10000"],'Y'] = 3*X-2
df.loc[idx[:,"10001"],'X'] = X
df.loc[idx[:,"10001"],'Y'] = -X+1
df = df.iloc[random.sample(range(365*2),360*2)] # some days are missing
df.iloc[random.sample(range(280*2),20*2),0] = np.nan # some observations are missing
# Estimate beta
def estimate_beta_grouped(df_in):
def estimate_beta(ser):
return sm.OLS(df.loc[ser.index,'Y'].astype(float),sm.add_constant(df.loc[ser.index,'X'].astype(float)), missing = 'drop').fit().params[-1]
df = df_in.droplevel('Stock').reset_index().set_index(['Date']).sort_index()
df['beta'] = df['Y'].rolling('60D',min_periods=10).apply(estimate_beta)
return df[['beta']]
df_beta = df.groupby(level='Stock').apply(estimate_beta_grouped)
# Extract beta at last day per month
df_monthly = df.groupby([pd.Grouper(freq='M', level='Date'), df.index.get_level_values(1)]).agg('last') # get last observations
df_monthly = df_monthly.merge(df_beta, left_index=True, right_index=True, how='left') # merge beta on df_monthly
df_monthly
I have two arrays (data frames actually but the function below deals with arrays)
bh_arr = array of bank holiday dates in UK
sales_dates = Sales dates for a few years (millions of rows, I mean really millions)
I want to know for each date in sales_dates, how many days to the next bank holiday (from bh_arr).
I built a function like the below and it works but as is evident from the code it is very wasteful, as it calculates all differences first and then gets the non-negative min.
def get_days_to_bh_arr(sale_dates, bh_arr):
"""
Subtract all elements of bh_arr from each element of sales_dates.
Get the min ( > 0) for each element of sales_dates.
Return that array.
"""
bh_arr = pd.to_datetime(bh_arr)
res = []
for each in sale_dates:
gg = int(np.min([stuff for stuff in (bh_arr - pd.to_datetime(each))/np.timedelta64(1,'D') if stuff >= 0]))
res.append(gg)
return np.array(res)
bh_arr = ['2018-03-30', '2018-08-27', '2019-05-27']
sales_dates = ['2018-03-15', '2019-05-22', '2018-02-01', '2018-08-05', '2018-06-21']
get_days_to_bh_arr(sale_dates, bh_arr)
15, 5, 57, 22, 67
In the actual code, I finally make the call like so:
sales['days_to_next_bh'] = get_days_to_bh_arr(sales['full_date'], bh['holiday']).astype(np.int32)
Is there a more efficient way of writing the function (of course there is)?
If not, should I try something else like finding the next date from a sorted 'bh_arr', for each date in 'sales_dates' and only at the end do the subtraction? How would I make that work?
Could I vectorise that instead of looping?
Any guidance would be much appreciated.
In pandas this can be done with pd.merge_asof to bring the closest bank holiday in the future. Then datetime subtraction gets the days in between.
Because merge_asof requires sorting, we need to reset the index so that we can maintain the original ordering after the merge.
import pandas as pd
df_b = pd.DataFrame({'bh': pd.to_datetime(bh_arr)})
df_s = pd.DataFrame({'sd': pd.to_datetime(sales_dates)})
df_s = (pd.merge_asof(df_s.reset_index().sort_values('sd'),
df_b.sort_values('bh'),
direction='forward',
left_on='sd',
right_on='bh')
.sort_values('index'))
arr = (df_s['bh'] - df_s['sd']).dt.days.to_numpy()
#array([15, 5, 57, 22, 67], dtype=int64)
I'm trying to calculate tracking error for a number of different benchmarks versus a fund that I'm looking at (tracking error is defined as the standard deviation of the percent difference between the fund and benchmark). The time series for the fund and all the benchmarks are all in a data frame that I'm reading from an excel on file and what I have so far is this (with the idea that arg1 represents all the benchmarks and is then applied using applymap), but it's returning a KeyError, any suggestions?
import pandas as pd
import numpy as np
data = pd.read_excel('File_Path.xlsx')
def index_analytics(arg1):
tracking_err = np.std((data['Fund'] - data[arg1]) / data[arg1])
return tracking_err
data.applymap(index_analytics)
There are a few things that need fixed. First,applymap passes each individual value for all the columns to your calling function (index_analytics). So arg1 is the individual scalar value for all the values in your dataframe. data[arg1] is always going to return a key error unless all your values are also column names.
You also shouldn't need to use apply to do this. Assuming your benchmarks are in the same dataframe then you should be able to do something like this for each benchmark. Next time include a sample of your dataframe.
df['Benchmark1_result'] = (df['Fund'] - data['Benchmark1']) / data['Benchmark1']
And if you want to calculate all the standard deviations for all the benchmarks you can do this
# assume you have a dataframe with a list of all the benchmark columns
benchmark_columns = [list, of, benchmark, columns]
np.std((df['Fund'].values - df[benchmark_columns].values) / df['Fund'].values, axis=1)
Assuming you're following the definition of Tracking Error below:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
df['Active_Return'] = df['Portfolio_Returns'] - df['Bench_Returns']
print(df.head())
list_ = df['Active_Return']
temp_ = []
for val in list_:
x = val**2
temp_.append(x)
tracking_error = np.sqrt(sum(temp_))
print(f"Tracking Error is: {tracking_error}")
Or if you want it more compact (because apparently the cool kids do it):
df = pd.DataFrame({'Portfolio_Returns': [5.00, 1.67], 'Bench_Returns': [2.89, .759]})
tracking_error = np.sqrt(sum([val**2 for val in df['Portfolio_Returns'] - df['Bench_Returns']]))
print(f"Tracking Error is: {tracking_error}")