Is there a way to create a pipeline locally using zipline? - python

I have set up zipline locally on PyCharm. The simulations work, moreover, I have access to premium data from quandl (which automatically updated when I entered my API key). However, now my question is, how do I make a pipeline locally using zipline.

Zipline's documentation is challenging. Zipline.io (as of 2021-0405) is also down. Fortunately, Blueshift has documentation and sample code that shows how to make a pipeline that can be run locally:
Blueshift sample pipeline code is here. (Pipelines library here.)
Zipline documentation can be accessed from MLTrading (archive documentation here) since though challenging it is still useful.
Full code of the pipeline sample code from Blueshift, but modified to run locally through PyCharm, is below the line. Please note as I'm sure you're already aware, the strategy is a bad strategy and you shouldn't trade on it. It does show local instantiations of pipelines though.
"""
Title: Classic (Pedersen) time-series momentum (equal weights)
Description: This strategy uses past returns and go long (short)
the positive (negative) n-percentile
Style tags: Momentum
Asset class: Equities, Futures, ETFs, Currencies
Dataset: All
"""
"""
Sources:
Overall Algorithm here:
https://github.com/QuantInsti/blueshift-demo-strategies/blob/master/factors/time_series_momentum.py
Custom (Ave Vol Filter, Period Returns) Functions Here:
https://github.com/QuantInsti/blueshift-demo-strategies/blob/master/library/pipelines/pipelines.py
"""
import numpy as np
from zipline.pipeline import CustomFilter, CustomFactor, Pipeline
from zipline.pipeline.data import EquityPricing
from zipline.api import (
order_target_percent,
schedule_function,
date_rules,
time_rules,
attach_pipeline,
pipeline_output,
)
def average_volume_filter(lookback, amount):
"""
Returns a custom filter object for volume-based filtering.
Args:
lookback (int): lookback window size
amount (int): amount to filter (high-pass)
Returns:
A custom filter object
Examples::
# from library.pipelines.pipelines import average_volume_filter
pipe = Pipeline()
volume_filter = average_volume_filter(200, 1000000)
pipe.set_screen(volume_filter)
"""
class AvgDailyDollarVolumeTraded(CustomFilter):
inputs = [EquityPricing.close, EquityPricing.volume]
def compute(self, today, assets, out, close_price, volume):
dollar_volume = np.mean(close_price * volume, axis=0)
high_volume = dollar_volume > amount
out[:] = high_volume
return AvgDailyDollarVolumeTraded(window_length=lookback)
def period_returns(lookback):
"""
Returns a custom factor object for computing simple returns over
period.
Args:
lookback (int): lookback window size
Returns:
A custom factor object.
Examples::
# from library.pipelines.pipelines import period_returns
pipe = Pipeline()
momentum = period_returns(200)
pipe.add(momentum,'momentum')
"""
class SignalPeriodReturns(CustomFactor):
inputs = [EquityPricing.close]
def compute(self, today, assets, out, close_price):
start_price = close_price[0]
end_price = close_price[-1]
returns = end_price / start_price - 1
out[:] = returns
return SignalPeriodReturns(window_length=lookback)
def initialize(context):
'''
A function to define things to do at the start of the strategy
'''
# The context variables can be accessed by other methods
context.params = {'lookback': 12,
'percentile': 0.1,
'min_volume': 1E7
}
# Call rebalance function on the first trading day of each month
schedule_function(strategy, date_rules.month_start(),
time_rules.market_close(minutes=1))
# Set up the pipe-lines for strategies
attach_pipeline(make_strategy_pipeline(context),
name='strategy_pipeline')
def strategy(context, data):
generate_signals(context, data)
rebalance(context, data)
def make_strategy_pipeline(context):
pipe = Pipeline()
# get the strategy parameters
lookback = context.params['lookback'] * 21
v = context.params['min_volume']
# Set the volume filter
volume_filter = average_volume_filter(lookback, v)
# compute past returns
momentum = period_returns(lookback)
pipe.add(momentum, 'momentum')
pipe.set_screen(volume_filter)
return pipe
def generate_signals(context, data):
try:
pipeline_results = pipeline_output('strategy_pipeline')
except:
context.long_securities = []
context.short_securities = []
return
p = context.params['percentile']
momentum = pipeline_results
long_candidates = momentum[momentum > 0].dropna().sort_values('momentum')
short_candidates = momentum[momentum < 0].dropna().sort_values('momentum')
n_long = len(long_candidates)
n_short = len(short_candidates)
n = int(min(n_long, n_short) * p)
if n == 0:
print("{}, no signals".format(data.current_dt))
context.long_securities = []
context.short_securities = []
context.long_securities = long_candidates.index[-n:]
context.short_securities = short_candidates.index[:n]
def rebalance(context, data):
# weighing function
n = len(context.long_securities)
if n < 1:
return
weight = 0.5 / n
# square off old positions if any
for security in context.portfolio.positions:
if security not in context.long_securities and \
security not in context.short_securities:
order_target_percent(security, 0)
# Place orders for the new portfolio
for security in context.long_securities:
order_target_percent(security, weight)
for security in context.short_securities:
order_target_percent(security, -weight)

Related

how to fix missing 1 required positional argument PyTorch

I tried to check the length of my training data to train the model but I got this error. I am implementing this in PyTorch. I have 3 main functions. dataset, extract beat and extract signal. can someone help to fix this issue, please?
This is my dataset class
class MyDataset(Dataset):
def __init__(self, patient_ids,bih2aami=True):#This method runs once when we call this class, and we pass the data or its references here with the label data.
self.patient_ids = patient_ids # list of patients ID
self.directory="C:\\Users\\User\\Downloads\\list\mit-bih-arrhythmia-database-1.0.0\\" # path
self.nb_qrs = 99 #number of beats extracted for each patient, found that each recording had at least 99 normal beats
self.idx_tuples = flatten([[(patient_idx, rpeak_idx) for rpeak_idx in range(self.nb_qrs)]
for patient_idx in range(len(patient_ids))])
self.bih2aami=bih2aami
#if bih2aami==True:
# self.y = self.bih2aami(self.y)
def __len__(self):#returns the size of the data set.
return len(self.idx_tuples) # length of the dataset
def __getitem__(self, idx): # get one sample from the dataset
patient_idx, rpeak_idx = self.idx_tuples[idx]
patient_id = self.patient_ids[patient_idx]
file = self.directory + patient_id
signal, normal_qrs_pos = get_signal(file)
qrs_pos = normal_qrs_pos[rpeak_idx]
beat, label = extract_beat(signal, qrs_pos)
#sample = {'signal': torch.tensor(beat).float(),
# 'label': torch.tensor(label).float()}
print(patient_id, patient_idx, beat.shape,label.shape) # bug : what if label null ??
X, y = torch.tensor(beat).float(), torch.tensor(label).float()
return X,y
Get signal function
def get_signal(file):
record = wfdb.rdrecord(file, channels=[0])
df = pd.DataFrame(record.p_signal, columns=record.sig_name)
lead = df.columns[0]
signal = df[lead] #getting the 1D signal
annotation = wfdb.rdann(file, 'atr') #getting the annotation
relabeled_ann = bih2lamedo(annotation.symbol)
annotations = pd.DataFrame(relabeled_ann,annotation.sample)
normal_qrs_pos = list(annotations[annotations[0]=='N'].index) #normal beats
#normal_qrs_pos = list(annotations[annotations[0]!='O'].index) #beats
#normal_qrs_pos = list(annotations.index) #normal beats
return signal, normal_qrs_pos
Get beat function
def extract_beat(signal, win_pos, qrs_positions, win_msec=40, fs=360, start_beat=36, end_beat=108):
"""
win_pos position at which you place the window of your beat
qrs_positions (list) the qrs indices from the annotations (read them from the atr file)-->obtained from annotation.sample
win_msec in milliseconds
"""
#extract signal
signal = np.array(signal)
#print(signal.shape)
#beat_array = np.zeros(start_beat+end_beat)#number of channels
start = int(max(win_pos-start_beat,0))
stop=start+start_beat+end_beat
#print(beat_array.shape,signal.shape)
beat = signal[start:stop]
#compute the nearest neighbor of win_pos among qrs_positions
tolerance = fs*win_msec//1000 #samples at a distance <tolrance are matched
nbr = NearestNeighbors(n_neighbors=1).fit(qrs_positions)
distances, indices = nbr.kneighbors(np.array([[win_pos]]).reshape(-1,1))
#label
if distances[0][0] <= tolerance:
label = 1
else:
label = 0
print(distances[0],tolerance,label)
return beat, label

AttributeError: 'list' object has no attribute 'i_sd'Which function can be used to get values from a Callback Class

The callback is called when specific events occur in an environment (e.g. at the beginning/end of a reset and beginning/end of a step).
I have written a stub of a GEM-Callback and added it to the environment.
That can be utilized to log data at every step.
Now, which function can be called to get data from that class in List Format / .CSV Format.
Code of callback class is attached below:
class CurrentLoggingCallback(gem.core.Callback):
def __init__(self):
super().__init__ ()
self._i_sd_idx = None
self._i_sq_idx = None
self._i_sd_max = None
self._i_sq_max = None
def set_env(self, env):
super().set_env(env)
assert 'i_sd'in env.state_names, 'the environment has no state "i_sd".'
assert 'i_sq'in env.state_names, 'the environment has no state "i_sq".'
self._i_sd_idx = env.state_names.index('i_sd')
self._i_sq_idx = env.state_names.index('i_sq')
#Environments Observations are normalized by their limits.
#for current values in Amperes, the observations have to be multiplied with their limits.
self._i_sd_max = env.limits[self._i_sd_idx]
self._i_sq_max = env.limits[self._i_sq_idx]
def on_step_end(self, k, state, reference, reward, done):
""" Gets called at the end of each step"""
i_sd = state[self._i_sd_idx] * self._i_sd_max
i_sq = state[self._i_sq_idx] * self._i_sq_max
i_sd_ref = reference[self._i_sd_idx] * self._i_sd_max
i_sq_ref = reference[self._i_sq_idx] * self._i_sq_max
# Append to list or store to file...
def on_reset_end(self, state, reference):
"""Gets called at the end of each reset """
i_sd = state[self._i_sd_idx] * self._i_sd_max
i_sq = state[self._i_sq_idx] * self._i_sq_max
i_sd_ref = reference[self._i_sd_idx] * self._i_sd_max
i_sq_ref = reference[self._i_sq_idx] * self._i_sq_max
# Append to list or store to file...
current_logger = CurrentLoggingCallback()
Now, i want to log values of i_sd and i_sq in list/.CSV .

Using simple averaging for reinforcment learning

I am currently reading Reinforcement Learning: An Introduction (RL:AI) and try to reproduce the first example with an n-armed bandit and simple reward averaging.
Averaging
new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)
In order reproduce the graph from the PDF, I am generating 2000 bandit-plays and let different agents play 2000 bandits for 1000 steps (as described in the PDF) and then average the reward as well as the percentage of optimal actions.
In the PDF, the result looks like this:
However, I am not able to reproduce this. If I am using simple averaging, all the agents with exploration (epsilon > 0) actually play worse than an agent without exploration. This is weird because the possibility of exploration should allow agents to leave the local optimum more often and reach out to better actions.
As you can see below, this is not the case for my implementation. Also note that I have added agents which use weighted-averaging. These work but even in that case, raising epsilon results in a degradation of the agents performance.
Any ideas what's wrong in my code?
The code (MVP)
from abc import ABC
from typing import List
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from multiprocessing.pool import Pool
class Strategy(ABC):
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
raise NotImplementedError()
class Averaging(Strategy):
def __str__(self):
return 'avg'
def update_estimates(self, step: int, estimates: np.ndarray, action: int, reward: float):
current = estimates[action]
return current + 1.0 / step * (reward - current)
class WeightedAveraging(Strategy):
def __init__(self, alpha):
self.alpha = alpha
def __str__(self):
return 'weighted-avg_alpha=%.2f' % self.alpha
def update_estimates(self, step: int, estimates: List[float], action: int, reward: float):
current = estimates[action]
return current + self.alpha * (reward - current)
class Agent:
def __init__(self, nb_actions, epsilon, strategy: Strategy):
self.nb_actions = nb_actions
self.epsilon = epsilon
self.estimates = np.zeros(self.nb_actions)
self.strategy = strategy
def __str__(self):
return ','.join(['eps=%.2f' % self.epsilon, str(self.strategy)])
def get_action(self):
best_known = np.argmax(self.estimates)
if np.random.rand() < self.epsilon and len(self.estimates) > 1:
explore = best_known
while explore == best_known:
explore = np.random.randint(0, len(self.estimates))
return explore
return best_known
def update_estimates(self, step, action, reward):
self.estimates[action] = self.strategy.update_estimates(step, self.estimates, action, reward)
def reset(self):
self.estimates = np.zeros(self.nb_actions)
def play_bandit(agent, nb_arms, nb_steps):
agent.reset()
bandit_rewards = np.random.normal(0, 1, nb_arms)
rewards = list()
optimal_actions = list()
for step in range(1, nb_steps + 1):
action = agent.get_action()
reward = bandit_rewards[action] + np.random.normal(0, 1)
agent.update_estimates(step, action, reward)
rewards.append(reward)
optimal_actions.append(np.argmax(bandit_rewards) == action)
return pd.DataFrame(dict(
optimal_actions=optimal_actions,
rewards=rewards
))
def main():
nb_tasks = 2000
nb_steps = 1000
nb_arms = 10
fig, (ax_rewards, ax_optimal) = plt.subplots(2, 1, sharex='col', figsize=(8, 9))
pool = Pool()
agents = [
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=Averaging()),
Agent(nb_actions=nb_arms, epsilon=0.00, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.01, strategy=WeightedAveraging(0.5)),
Agent(nb_actions=nb_arms, epsilon=0.10, strategy=WeightedAveraging(0.5)),
]
for agent in agents:
print('Agent: %s' % str(agent))
args = [(agent, nb_arms, nb_steps) for _ in range(nb_tasks)]
results = pool.starmap(play_bandit, args)
df_result = sum(results) / nb_tasks
df_result.rewards.plot(ax=ax_rewards, label=str(agent))
df_result.optimal_actions.plot(ax=ax_optimal)
ax_rewards.set_title('Rewards')
ax_rewards.set_ylabel('Average reward')
ax_rewards.legend()
ax_optimal.set_title('Optimal action')
ax_optimal.set_ylabel('% optimal action')
ax_optimal.set_xlabel('steps')
plt.xlim([0, nb_steps])
plt.show()
if __name__ == '__main__':
main()
In the formula for the update rule
new_estimate = current_estimate + 1.0 / step * (reward - current_estimate)
the parameter step should be the number of times that the particular action has been taken, not the overall step number of the simulation. So you need to store that variable alongside the action values in order to use it for the update.
This can also be seen from the pseudo-code box at the end of chapter 2.4 Incremental Implementation:
(source: Richard S. Sutton and Andrew G. Barto: Reinforcement Learning - An Introduction, second edition, 2018, Chapter 2.4 Incremental Implementation)

Questions about real time audio signal processing with PyAudio and PyQtGraph

I need to do some real time audio signal processing with Python, i.e. analyze the signal in the frequency domain by framing, windowing and computing the FFT, and then apply some filters depending on the analysis results. I've been using PyAudio for audio acquisition and PyQtGraph for waveform and FFT visualization, as suggested in this and this code.
For now my code only detects the N power spectrum bins with the highest value and highlights them by drawing vertical lines on the FFT plot. Here is what is looks like :
import pyaudio
import numpy as np
from scipy.signal import argrelextrema
import pyqtgraph as pg
from pyqtgraph.Qt import QtCore
##Some settings
FORMAT = pyaudio.paFloat32
CHANNELS = 1
FS = 44100
CHUNK = 256
NFFT = 2048
OVERLAP = 0.5
PLOTSIZE = 32*CHUNK
N = 4
freq_range = np.linspace(10, FS/2, NFFT//2 + 1)
df = FS/NFFT
HOP = NFFT*(1-OVERLAP)
##Some preliminary functions
def db_spectrum(data) : #computes positive frequency power spectrum
fft_input = data*np.hanning(NFFT)
spectrum = abs(np.fft.rfft(fft_input))/NFFT
spectrum[1:-1] *= 2
return 20*np.log10(spectrum)
def highest_peaks(spectrum) : #finds peaks (local maxima) and picks the N highest ones
peak_indices = argrelextrema(spectrum, np.greater)[0]
peak_values = spectrum[peak_indices]
highest_peak_indices = np.argpartition(peak_values, -N)[-N:]
return peak_indices[(highest_peak_indices)]
def detection_plot(peaks) : #formats data for vertical line plotting
x = []
y = []
for peak in peaks :
x.append(peak*df)
x.append(peak*df)
y.append(-200)
y.append(0)
return x, y
##Main class containing loop and UI
class SpectrumAnalyzer(pg.GraphicsWindow) :
def __init__(self) :
super().__init__()
self.initUI()
self.initTimer()
self.initData()
self.pa = pyaudio.PyAudio()
self.stream = self.pa.open(format = FORMAT,
channels = CHANNELS,
rate = FS,
input = True,
output = True,
frames_per_buffer = CHUNK)
def initUI(self) :
self.setWindowTitle("Microphone Audio Data")
audio_plot = self.addPlot(title="Waveform")
audio_plot.showGrid(True, True)
audio_plot.addLegend()
audio_plot.setYRange(-1,1)
self.time_curve = audio_plot.plot()
self.nextRow()
fft_plot = self.addPlot(title="FFT")
fft_plot.showGrid(True, True)
fft_plot.addLegend()
fft_plot.setLogMode(True, False)
fft_plot.setYRange(-140,0) #may be adjusted depending on your input
self.fft_curve = fft_plot.plot(pen='y')
self.detection = fft_plot.plot(pen='r')
def initTimer(self) :
self.timer = QtCore.QTimer()
self.timer.timeout.connect(self.update)
self.timer.start(0)
def initData(self) :
self.waveform_data = np.zeros(PLOTSIZE)
self.fft_data = np.zeros(NFFT)
self.fft_counter = 0
def closeEvent(self, event) :
self.timer.stop()
self.stream.stop_stream()
self.stream.close()
self.pa.terminate()
def update(self) :
raw_data = self.stream.read(CHUNK)
self.stream.write(raw_data, CHUNK)
self.fft_counter += CHUNK
sample_data = np.fromstring(raw_data, dtype=np.float32)
self.waveform_data = np.concatenate([self.waveform_data, sample_data]) #update plot data
self.waveform_data = self.waveform_data[CHUNK:] #
self.time_curve.setData(self.waveform_data)
self.fft_data = np.concatenate([self.fft_data, sample_data]) #update fft input
self.fft_data = self.fft_data[CHUNK:] #
if self.fft_counter == HOP :
spectrum = db_spectrum(self.fft_data)
peaks = highest_peaks(spectrum)
x, y = detection_plot(peaks)
self.detection.setData(x, y, connect = 'pairs')
self.fft_curve.setData(freq_range, spectrum)
self.fft_counter = 0
if __name__ == '__main__':
spec = SpectrumAnalyzer()
The code works fine but I still have some questions :
I understand that by calling timer.start() with 0 as an argument, the update method is being called again as soon as possible. How does my script know that the update method needs be called only when the next audio chunk is received and not before ?
In the codes I linked above, the closeEvent method is not modified in order to stop the timer and the stream when closing the PyQtGraph window. What used to happen for me is that even after closing the window, the update method was being called and my audio recorded. Was that normal behavior ?
I've read that when using PyQt GUIs, I should always start by calling a QtGui.QApplication instance and call the exec method. Why is that and why is my code working even though I'm not doing it ?
In the future I will need to implement analysis that is much more demanding than just detecting the N highest peaks. Given the actual structure of my code, if I add such analysis in the update method, I understand that the CPU will have to compute everything before the next audio chunk is received, while it could wait for the next FFT input data to be ready. The hop size being larger than the chunk size, this will give the CPU more time to compute everything. How can I achieve this ? Multi-threading ?

Python milk library: object weights issue

I'm trying to use one_vs_one composition of decision trees for multiclass classification. The problem is, when I pass different object weights to a classifier, the result stays the same.
Do I misunderstand something with weights, or do they just work incorrectly?
Thanks for your replies!
Here is my code:
class AdaLearner(object):
def __init__(self, in_base_type, in_multi_type):
self.base_type = in_base_type
self.multi_type = in_multi_type
def train(self, in_features, in_labels):
model = AdaBoost(self.base_type, self.multi_type)
model.learn(in_features, in_labels)
return model
class AdaBoost(object):
CLASSIFIERS_NUM = 100
def __init__(self, in_base_type, in_multi_type):
self.base_type = in_base_type
self.multi_type = in_multi_type
self.classifiers = []
self.weights = []
def learn(self, in_features, in_labels):
labels_number = len(set(in_labels))
self.weights = self.get_initial_weights(in_labels)
for iteration in xrange(AdaBoost.CLASSIFIERS_NUM):
classifier = self.multi_type(self.base_type())
self.classifiers.append(classifier.train(in_features,
in_labels,
weights=self.weights))
answers = []
for obj in in_features:
answers.append(self.classifiers[-1].apply(obj))
err = self.compute_weighted_error(in_labels, answers)
print err
if abs(err - 0.) < 1e-6:
break
alpha = 0.5 * log((1 - err)/err)
self.update_weights(in_labels, answers, alpha)
self.normalize_weights()
def apply(self, in_features):
answers = {}
for classifier in self.classifiers:
answer = classifier.apply(in_features)
if answer in answers:
answers[answer] += 1
else:
answers[answer] = 1
ranked_answers = sorted(answers.iteritems(),
key=lambda (k,v): (v,k),
reverse=True)
return ranked_answers[0][0]
def compute_weighted_error(self, in_labels, in_answers):
error = 0.
w_sum = sum(self.weights)
for ind in xrange(len(in_labels)):
error += (in_answers[ind] != in_labels[ind]) * self.weights[ind] / w_sum
return error
def update_weights(self, in_labels, in_answers, in_alpha):
for ind in xrange(len(in_labels)):
self.weights[ind] *= exp(in_alpha * (in_answers[ind] != in_labels[ind]))
def normalize_weights(self):
w_sum = sum(self.weights)
for ind in xrange(len(self.weights)):
self.weights[ind] /= w_sum
def get_initial_weights(self, in_labels):
weight = 1 / float(len(in_labels))
result = []
for i in xrange(len(in_labels)):
result.append(weight)
return result
As you can see, it is just a simple AdaBoost (I instantiated it with in_base_type = tree_learner, in_multi_type = one_against_one) and it worked the same way no matter how many base classifiers were engaged. It just acted as one multiclass decision tree.
Then I've made a hack. I chose a random sample of objects on the each iteration with respect to their weights and trained classifiers with a random subset of objects without any weights. And that worked as it was supposed to.
The default tree criterion, namely information gain, does not take the weights into account. If you know of a formula which would do it, I'll implement it.
In the meanwhile, using neg_z1_loss will do it correctly. By the way, there was a slight bug in that implementation, so you will need to use the most current github master.

Categories

Resources