Error in Threading SARIMAX model - python

I am using threading library for the first time inorder to speed up the training time of my SARIMAX model. But the code keeps failing with the following error
Bad direction in the line search; refresh the lbfgs memory and restart the iteration.
This problem is unconstrained.
This problem is unconstrained.
This problem is unconstrained.
Following is my code:
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.tsa.api as smt
from threading import Thread
def process_id(ndata):
train = ndata[0:-7]
test = ndata[len(train):]
try:
model = smt.SARIMAX(train.asfreq(freq='1d'), exog=None, order=(0, 1, 1), seasonal_order=(0, 1, 1, 7)).fit()
pred = model.get_forecast(len(test))
fcst = pred.predicted_mean
fcst.index = test.index
mapelist = []
for i in range(len(fcst)):
mapelist.insert(i, (np.absolute(test[i] - fcst[i])) / test[i])
mape = np.mean(mapelist) * 100
print(mape)
except:
mape = 0
pass
return mape
def process_range(ndata, store=None):
if store is None:
store = {}
for id in ndata:
store[id] = process_id(ndata[id])
return store
def threaded_process_range(nthreads,ndata):
store = {}
threads = []
# create the threads
k = 0
tk = ndata.columns
for i in range(nthreads):
dk = tk[k:len(tk)/nthreads+k]
k = k+len(tk)/nthreads
t = Thread(target=process_range, args=(ndata[dk],store))
threads.append(t)
[ t.start() for t in threads ]
[ t.join() for t in threads ]
return store
outdata = threaded_process_range(4,ndata)
Few things I would like to mention:
Data is daily stock time series in a dataframe
Threading works for ARIMA model
SARIMAX model works when done in a for loop
Any insights would be highly appreciated thanks!

I got the same error with lbfgs, I'm not sure why lbfgs fails to do gradient evaluations, but I tried changing the optimizer. You can try this too, choose among any of these optimizers
’newton’ for Newton-Raphson, ‘nm’ for Nelder-Mead
’bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
’lbfgs’ for limited-memory BFGS with optional box constraints
’powell’ for modified Powell’s method
’cg’ for conjugate gradient
’ncg’ for Newton-conjugate gradient
’basinhopping’ for global basin-hopping solver
change this in your code
model = smt.SARIMAX(train.asfreq(freq='1d'), exog=None, order=(0, 1, 1), seasonal_order=(0, 1, 1, 7)).fit(method='cg')
It's an old question but still I'm answering it in case someone in future faces the same problem.

Related

running multiple ray Tuning in parallel using a search algorithm

I want to queue 200+ tuning jobs to my ray cluster, they each need to be guided by a search algorithm, as my actual objective function has 40+ parameters.
I can do this for a single job like this:
import ray
from ray import tune
from ray.tune import Tuner, TuneConfig
from ray.tune.search.optuna import OptunaSearch
ray.init()
def objective(config):
ground_truth = [1,2,3,4]
yhat = [i*config['factor'] + config['constant'] for i in range(4)]
abs_err = [abs(gt - yh) for gt, yh in zip(ground_truth, yhat)]
mae = sum(abs_err)/len(abs_err)
tune.report(mean_accuracy = mae)
config = {
'factor': tune.quniform(0,3,1),
'constant': tune.quniform(0,3,1)
}
algo = OptunaSearch()
tuner = tune.Tuner(
objective,
tune_config=TuneConfig(
metric="mean_accuracy",
mode="min",
search_alg=algo,
num_samples=100
),
param_space=config
)
results = tuner.fit()
This works and gives the desired result for 1 of the 200 jobs.
Now I want to queue up to 200 jobs from a single run of a single script:
As far as I understood the documentation this is how that should work:
import ray
from ray import tune
ray.init()
def objective(config):
ground_truth = [1,2,3,4]
yhat = [i*config['factor'] + config['constant'] for i in range(4)]
abs_err = [abs(gt - yh) for gt, yh in zip(ground_truth, yhat)]
mae = sum(abs_err)/len(abs_err)
tune.report(mean_accuracy = mae)
config = {
'factor': tune.quniform(0,3,1),
'constant': tune.quniform(0,3,1)
}
experiments = []
for i in range(3):
experiment_spec = tune.Experiment(
name=f'{i}',
run=objective,
stop={"mean_accuracy": 0},
config=config,
num_samples=10
)
experiments.append(experiment_spec)
out = tune.run_experiments(experiments)
When I run this I get the message: Running with multiple concurrent experiments. All experiments will be using the same SearchAlgorithm..
I need to be able to specify the search algorithm, but I don't understand how. Additionally, these experiments appear to be part of one large optimization out is a list of 30 objective objects. The parameter values chosen are from a uniform distribution, without the q. However all 30 values fall in the specified range.
I must've misunderstood the purpose of run_experiments, please help.

Reinforcement learning on pandas dataframe. ValueError: setting an array element with a sequence

I am trying to create a reinforcement learning algorithm which optimizes Pulltimes (timestamp) on a flight schedule, this happens by the agent subtracting a number between 30-60 from the current STD (timestamp). Once it has iterated through the entire dataframe a reward is calculated based on the bottlenecks created by these new pulltimes that have been created. The goal is to minimize the bottlenecks. So in essence I am using the Pulltime column to free up bottlenecks which happen in the STD column due to a lot of simultaneous flights.
The reward part of the code has been created and works, however I am constantly running into error in regards to the observation space and observations.
I have a dataframe consisting of STD's and Pulltimes with the following datetime format "2022-07-27 22:00:00" which are sorted by earliest to latest timestamp.
import gym
from gym import spaces
import numpy as np
from typing import Optional
import numpy as np
from datetime import date, timedelta, time
from reward_calculation import calc_total_reward
import os
import pandas as pd
from stable_baselines3 import DQN, A2C
from stable_baselines3.common.env_checker import check_env
class PTOPTEnv(gym.Env):
def __init__(self, df):
super(PTOPTEnv, self).__init__()
self.render_mode = None # Define the attribute render_mode in your environment
self.df = df
self.df_length = len(df.index)-1
self.curr_progress = 0
self.action_space = spaces.Discrete(30)
#self.observation_space = spaces.Box(low=np.array([-np.inf]), high=np.array([np.inf]), dtype=np.int)
self.observation_space = spaces.Box(low=0, high=np.inf, shape = (5,))
#Pulltimes = self.df.loc[:, "STD"].to_numpy()
def step(self, action):
STD = self.df.loc[self.curr_progress, "STD"]
print(action, action+30)
self.df.loc[self.curr_progress, "Pulltime"] = self.df.loc[self.curr_progress, "STD"]-timedelta(minutes=action+30)
# An episode is done if the agent has reached the target
done = True if self.curr_progress==self.df_length else False
reward = 100000-calc_total_reward(self.df) if done else 0 # Binary sparse rewards
observation = self._get_obs()
info = {}
self.curr_progress += 1
return observation, reward, done, info
def reset(self):
self.curr_progress = 0
observation = self._get_obs()
info = self._get_info()
return observation
def _get_obs(self):
# Get the data points for the previous entries
frame = np.array([
self.df.loc[0: self.curr_progress, 'Pulltime'].values,
self.df.loc[:, 'Pulltime'].values,
self.df.loc[self.curr_progress: , 'Pulltime'].values,
], dtype='datetime64')
obs = np.append(frame, [[self.curr_progress, 0], [0]], axis=0)
print(obs)
print(obs.shape)
print(type(obs))
return obs
def _get_info(self):
return {"Test": 0}
dir_path = os.path.dirname(os.path.realpath(__file__))
df_use = pd.read_csv(dir_path + "\\Flight_schedule.csv", sep=";", decimal=",")
df_use["STD"] = pd.to_datetime(df_use["STD"], format='%Y-%m-%d %H:%M:%S')
df_use["Pulltime"] = 0
df_use = df_use.drop(['PAX'], axis=1)
env = PTOPTEnv(df=df_use)
check_env(env)
The issue arises when doing check_env which provides the following error:
"ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part."
I've tried replacing the np.array with one consisting of 0's just to see if that would get me any further but that just throws me "AssertionError: The observation returned by reset() method must be a numpy array".
So how do I go about this, i've tried everything I could find on google but it all surrounds cartpole and other RL enviorments which have nothing to do with a pandas dataframe.
Per request, I have uploaded a repo with all the corresponding files here: github.com/sword134/Pandas-flight-RL

how can I fix "1 leaked semaphore objects to clean up at shutdown" error on Mac M1

I am using Apple Mac M1:
OS: MacOS Monterey
Python 3.9.13
I want to implement a semantic search using SentenceTransformer.
Here is my code:
from sentence_transformers import SentenceTransformer
import faiss
from pprint import pprint
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def load_index():
index = faiss.read_index("movie_plot.index")
return index
def fetch_movie_info(dataframe_idx):
info = df.iloc[dataframe_idx]
meta_dict = dict()
meta_dict['Title'] = info['Title']
meta_dict['Plot'] = info['Plot'][:500]
return meta_dict
def search(query, top_k, index, model):
print("starting search!")
t=time.time()
query_vector = model.encode([query])
top_k = index.search(query_vector, top_k)
print('>>>> Results in Total Time: {}'.format(time.time()-t))
top_k_ids = top_k[1].tolist()[0]
top_k_ids = list(np.unique(top_k_ids))
results = [fetch_movie_info(idx) for idx in top_k_ids]
return results
def main():
# GET MODEL
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')
print("model set!")
#GET INDEX
index = load_index()
print("index loaded!")
query="Artificial Intelligence based action movie"
results=search(query, top_k=5, index=index, model=model)
print("\n")
for result in results:
print('\t',result)
main()
when i run the code above it gets stuck at this error
/opt/homebrew/Caskroom/miniforge/base/envs/searchapp/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
what is causing this and how can I fix it?
I had the same problem, upgrading to Python 3.9 solved it in my case.
It looks like it is an official bug: https://bugs.python.org/issue45209
Check your memory utilisation. You might be out of memory.
Since you're using a pretrained model for inference/prediction, you can reduce the memory requirements of the model by avoiding computation of the gradients. Gradients are only used for training the model typically.
You can wrap your model call with torch.no_grad like so:
with torch.no_grad():
query_vector = model.encode([query])
You can also reduce the memory utilisation a model by reducing the batch size (i.e. number of samples passed to the model at one time), but that doesn't seem to apply here.

Pyomo time-dependant model?

I'm quite new to pyomo but I'm having a hard time figuring how to create a time dependant model and plot it on a graph. By time dependant I mean just a variable that assumes different values for each time step (like from 1 to T in this case).
I used this very simple model but when I run the script I receive in output only one solution. How can I change that?
I also have errors related to the constraint function but I'm not sure what's wrong
(ValueError: Constraint 'constraint[1]' does not have a proper value. Found . at 0x7f202b540850>' Expecting a tuple or equation.)
I'd like to show how the value of x(t) varies in all timesteps.
Any help is appreciated.
from __future__ import division
from pyomo.environ import *
from pyomo.opt import SolverFactory
import sys
model = AbstractModel()
model.n = Param()
model.T = RangeSet(1, model.n)
model.a = Param(model.T)
model.b = Param(model.T)
model.x = Var(model.T, domain= NonNegativeReals)
data = DataPortal()
data.load(filename='N.csv', range='N', param=model.n)
data.load(filename='A.csv', range= 'A', param=model.a)
data.load(filename='B.csv', range= 'B', param=model.b)
def objective(model):
return model.x
model.OBJ = Objective(rule=objective)
def somma(model):
return model.a[t]*model.x[t] for t in model.T) >= model.b[t] for t in model.T
model.constraint = Constraint(model.T, rule=somma)
instance = model.create_instance(data)
opt = SolverFactory('glpk')
results = opt.solve(instance)
You can build up lists of the values you would like to plot like this:
T_plot = list(instance.T)
x_plot = [value(instance.x[t]) for t in T_plot]
and then use your favorite Python plotting package to make the plots. I usually use Matplotlib.

Convert numpy function to theano

I am using PyMC3 to calculate something which I won't get into here but you can get the idea from this link if interested.
The '2-lambdas' case is basically a switch function, which needs to be compiled to a Theano function to avoid dtype errors and looks like this:
import theano
from theano.tensor import lscalar, dscalar, lvector, dvector, argsort
#theano.compile.ops.as_op(itypes=[lscalar, dscalar, dscalar], otypes=[dvector])
def lambda_2_distributions(tau, lambda_1, lambda_2):
"""
Return values of `lambda_` for each observation based on the
transition value `tau`.
"""
out = zeros(num_observations)
out[: tau] = lambda_1 # lambda before tau is lambda1
out[tau:] = lambda_2 # lambda after (and including) tau is lambda2
return out
I am trying to generalize this to apply to 'n-lambdas', where taus.shape[0] = lambdas.shape[0] - 1, but I can only come up with this horribly slow numpy implementation.
#theano.compile.ops.as_op(itypes=[lvector, dvector], otypes=[dvector])
def lambda_n_distributions(taus, lambdas):
out = zeros(num_observations)
np_tau_indices = argsort(taus).eval()
num_taus = taus.shape[0]
for t in range(num_taus):
if t == 0:
out[: taus[np_tau_indices[t]]] = lambdas[t]
elif t == num_taus - 1:
out[taus[np_tau_indices[t]]:] = lambdas[t + 1]
else:
out[taus[np_tau_indices[t]]: taus[np_tau_indices[t + 1]]] = lambdas[t]
return out
Any ideas on how to speed this up using pure Theano (avoiding the call to .eval())? It's been a few years since I've used it and so don't know the right approach.
Using a switch function is not recommended, as it breaks the nice geometry of the parameters space and makes sampling using modern sampler like NUTS difficult.
Instead, you can try model it using a continuous relaxation of a switch function. The main idea here would be to model the rate before the first switch point as a baseline; and add the prediction from a logistic function after each switch point:
def logistic(L, x0, k=500, t=np.linspace(0., 1., 1000)):
return L/(1+tt.exp(-k*(t_-x0)))
with pm.Model() as m2:
lambda0 = pm.Normal('lambda0', mu, sd=sd)
lambdad = pm.Normal('lambdad', 0, sd=sd, shape=nbreak-1)
trafo = Composed(pm.distributions.transforms.LogOdds(), Ordered())
b = pm.Beta('b', 1., 1., shape=nbreak-1, transform=trafo,
testval=[0.3, 0.5])
theta_ = pm.Deterministic('theta', tt.exp(lambda0 +
logistic(lambdad[0], b[0]) +
logistic(lambdad[1], b[1])))
obs = pm.Poisson('obs', theta_, observed=y)
trace = pm.sample(1000, tune=1000)
There are a few tricks I used here as well, for example, the composite transformation that is not on the PyMC3 code base yet. You can have a look at the full code here: https://gist.github.com/junpenglao/f7098c8e0d6eadc61b3e1bc8525dd90d
If you have more question, please post to https://discourse.pymc.io with your model and (simulated) data. I check and answer on the PyMC3 discourse much more regularly.

Categories

Resources