Dask delayed function call with non-passed parameters - python

I am seeking to better understand the following behavior when using dask.delayed to call a function that depends on parameters. The issue seems to arise when parameters are specified in a parameters file read by configparser. Here is a complete example:
parameter file:
#zpar.ini: parameter file for configparser
[my pars]
my_zpar = 2.
parser:
#zippy_parser
import configparser
def read(_rundir):
global rundir
rundir = _rundir
cp = configparser.ConfigParser()
cp.read(rundir + '/zpar.ini')
#[my pars]
global my_zpar
my_zpar = cp['my pars'].getfloat('my_zpar')
and the main python file:
# dask test with configparser
import dask
from dask.distributed import Client
import zippy_parser as zpar
def my_func(x, y):
# print stuff
print("parameter from main is: {}".format(main_par))
print("parameter from configparser is: {}".format(zpar.my_zpar))
# do stuff
return x + y
if __name__ == '__main__':
client = Client(n_workers = 4)
#read parameters from input file
rundir = '/path/to/parameter/file'
zpar.read(rundir)
#test zpar
print("zpar is {}".format(zpar.my_zpar))
#define parameter and call my_func
main_par = 5.
z = dask.delayed(my_func)(1., 2.)
z.compute()
client.close()
The first print statement in my_func() executes just fine, but the second print statement raises an exception. The output is:
zpar is 2.0
parameter from main is: 5.0
distributed.worker - WARNING - Compute Failed
Function: my_func
args: (1.0, 2.0)
kwargs: {}
Exception: AttributeError("module 'zippy_parser' has no attribute 'my_zpar'",)
I am new to dask. I suppose this has something to do with the serialization, which I do not understand. Can someone enlighten me and/or point to relevant documentation? Thanks!

I will try to keep this brief.
When a function is serialised in order to be sent to workers, python also sends local variables and functions needed by the function (its "closure"). However, it stores the modules it references by name, it does not try to serialise your whole runtime.
This means that zippy_parser is imported in the worker, not deserialised. Since the function read has never been called
in the worker, the global variable is never initialised.
So, you could call read in the workers as part of your function or otherwise, but probably the pattern or setting module-global variables from with a function isn't great. Dask's delayed mechanism prefers functional purity, that the result you get should not depend on the current state of the runtime.
(note that if you had created the client after calling read in the main script, the workers might have got the in-memory version, depending on how subprocesses are configured to be created on your system)

I encourage you to pass in all parameters to your dask delayed functions explicitly, rather than relying on the global namespace.

Related

Multiple scripts access the same module with the same data in python?

Recently I have been trying to make a makeshift "disk space" reader. I made a library that stores values in a list "the disk" and when I subprocess a new script to write to the "disk" to see if the values change on the display nothing happens. I realized that any time you import a module the module sort of clones itself to only that script.
I want to be able to have scripts import the same module and so that if 1 script changes a value another script can see that value.
Here is my code for the "disk" system
import time
ram = []
space = 256000
lastspace = 0
for i in range(0,space + 1):
ram.append('')
def read(location):
try:
if ram[int(location)] == '':
return "ERR_NO_VALUE"
else:
return ram[int(location)]
except:
return "ERR_OUT_OF_RANGE"
def write(location, value):
try:
ram[int(location)] = value
except:
return "ERR_OUT_OF_RANGE"
def getcontents():
contents = []
for i in range(0, 256001):
contents.append([str(i)+ '- ', ram[i]])
return contents
def getrawcontents():
contents = []
for i in range(0, 256001):
contents.append(ram[i])
return contents
def erasechunk(beg, end):
try:
for i in range(int(beg), int(end) + 1):
ram[i] = ''
except:
return "ERR_OUT_OF_RANGE"
def erase(location):
ram[int(location)] = ''
def reset():
ram = []
times = space/51200
tc = 0
for i in range(0,round(times)):
for x in range(0,51201):
ram.append('')
tc += 1
print("Byte " + str(tc) + " of " + " Bytes")
for a in range(0,100):
print('\a', end='')
return [len(ram), ' bytes']
def wipe():
for i in range(0,256001):
ram[i] = ''
return "WIPED"
def getspace():
x = 0
for i in range(0,len(ram)):
if ram[i] != "":
x += 1
return [x,256000]
The shortest answer to your question, which I'm understanding as "if I import the same function into two (or more) Python namespaces, can they interact with each other?", is no. What actually happens when you import a module is that Python uses the source script to 'build' those functions in the namespace you're importing them to; there is no sense of permanence in "where the module came from" since that original module isn't actually running in a Python process anywhere! When you import those functions into multiple scripts, it's just going to create those pseudo-global variables (in your case ram) with the function you're importing.
Python import docs: https://docs.python.org/3/reference/import.html
The whole page on Python's data model, including what __globals__ means for functions and modules: https://docs.python.org/3/reference/datamodel.html
Explanation:
To go into a bit more depth, when you import any of the functions from this script (let's assume it's called 'disk.py'), you'll get an object in that function's __globals__ dict called ram, which will indeed work as you expect for these functions in your current namespace:
from disk import read,write
write(13,'thing')
print(read(13)) #prints 'thing'
We might assume, since these functions are accurately accessing our ram object, that the ram object is being modified somehow in the namespace of the original script, which could then be accessed by a different script (a different Python process). Looking at the namespace of our current script using dir() might support that notion, since we only see read and write, and not ram. But the secret is that ram is hidden in those functions' __globals__ dict (mentioned above), which is how the functions are interacting with ram:
from disk import read,write
print(type(write.__globals__['ram'])) #<class 'list'>
print(write.__globals__['ram'] is read.__globals__['ram']) #True
write(13,'thing')
print(read(13)) #'thing'
print(read.__globals__['ram'][13]) #'thing'
As you can see, ram actually is a variable defined in the namespace of our current Python process, hidden in the functions' __globals__ dict, which is actually the exact same dictionary for any function imported from the same module; read.__globals__ is write.__globals__ evaluates to True (even if you don't import them at the same time!).
So, to wrap it all up, ram is contained in the __globals__ dict for the disk module, which is created separately in the namespace of each process you import into:
Python interpreter #1:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139775502955080 139775502955080
Python interpreter #2:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139797009773128 139797009773128
Solution hint:
There are many approaches on how to do this practically that are beyond the scope of this answer, but I will suggest that pickle is the standard way to send objects between Python interpreters using files, and has a really standard interface. You can just write, read, etc your ram object using a pickle file. To write:
import pickle
with open('./my_ram_file.pkl','wb') as ram_f:
pickle.dump(ram,ram_f)
To read:
import pickle
with open('./my_ram_file.pkl','rb') as ram_f:
ram = pickle.load(ram_f)

Python: Multiprocessing error on pandas data frame: Clients have non-trivial state that is local and unpickleable

I have a dataframe that I am dividing into multiple dataframes using groupby. Now I want to process each of these dataframes for which I have written a function process_s2id in parallel. I have the whole code in a class which I am executing using a main function in another file. But I am getting the following error:
"Clients have non-trivial state that is local and unpickleable.",
_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
Following is the code (we execute the main() function in this class):
import logging
import pandas as pd
from functools import partial
from multiprocessing import Pool, cpu_count
class TestClass:
def __init__(self):
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger()
def process_s2id(self, df, col, new_col):
dim2 = ['s2id', 'date', 'hours']
df_hour = df.groupby(dim2)[[col, 'orders']].sum().reset_index()
df_hour[new_col] = df_hour[col] / df_hour['orders']
df_hour = df_hour[dim2 + [new_col]]
return df_hour
def run_parallel(self, df):
series = [frame for keys, frame in df.groupby('s2id')]
p = Pool(cpu_count())
prod_x = partial(
self.process_s2id,
col ="total_supply",
new_col = "supply"
)
s2id_supply_list = p.map(prod_x, series)
p.close()
p.join()
s2id_supply = pd.concat(s2id_supply_list, axis=0)
return ms2id_bsl
def main(self):
data = pd.read_csv("data/interim/fs.csv")
out = self.run_parallel(data)
return out
I tried running this code in Spyder and it works fine. But when I am executing it from another file. I am getting an error. Following are the execution file code and error:
import TestClass
def main():
tc = TestClass()
data = tc.main()
if __name__ == '__main__':
main()
When I looked into the error traceback, I found that the error is occuring on the line s2id_supply_list = p.map(prod_x, series) where the function is starting to go parallel. I also tried running this in series and it worked. Also, I noticed that this particular error is coming from client.py from Google cloud package. There is a certain code in which I am uploading the data to Google cloud but that should be invariant to this code. I tried searching hard for this error but all the results are linked to Google cloud package related issues and not the multiprocessing package.
Can anyone help me out in understanding this error and how can I fix it?
Other information:
I have the following versions of packages:
python==3.7.7
pandas==1.0.5
google-cloud-storage==1.20.0
google-cloud-core==1.0.3
I am running this on macbook pro.
I figured out. When we are using Pool over a function to run it parallel, it expects the first argument to be the iterator. In other words, the function will run parallelly over different values of the first argument. When we have a non-static function in a class, we have the first argument as self or the class itself. But the stupid Pool function doesn't know how to iterate with the self because it's the wrong argument. The right argument is the second one.
We can solve this by either:
Taking the function out of the class and kicking the self out of the arguments.
Adding #staticmethod on top of the function and kicking the self out of the arguments.
I hope this helps someone who is struggling with a similar problem.

How to call CAPL general functions from Python 3.x?

Issue
I'm trying to call CAPL general functions (in my case timeNowNS) but I don't know if it's possible.
What I'm using?
I'm using Python 3.7 and Vector CANoe 11.0.
The connection is done using the .NET CANoe API.
This is how i've accesed the DLLs.
import clr
sys.path.append("C:\Program Files\Vector CANoe 11.0\Exec64") # path to CANoe DLL Files
clr.AddReference('Vector.CANoe.Interop') # add reference to .NET DLL file
import CANoe # import namespace from DLL file
What I've tried?
I opened the CANoe simulation succesfully, started the measure and I'm having access to signals, env variables and sys variables.
Then I've created the CAPL Object and tried using the GetFunction method to obtain the CAPLFunction object so i could call it.
def begin_can(self, sCfgFile, fPrjInitFunc = None):
self.open_can()
self.load_can_configuration(sCfgFile)
self.start_can_measurement(fPrjInitFunc)
def open_can(self):
self.mCANoeApp = CANoe.Application()
self.mCANoeMeasurement = CANoe.Measurement(self.mCANoeApp.Measurement)
self.mCANoeEnv = CANoe.Environment(self.mCANoeApp.Environment)
self.mCANoeBus = CANoe.Bus(self.mCANoeApp.get_Bus("CAN"))
self.mCANoeSys = CANoe.System(self.mCANoeApp.System)
self.mCANoeNamespaces = CANoe.Namespaces(self.mCANoeSys.Namespaces)
self.mCANoeCAPL = CANoe.CAPL(self.mCANoeApp.CAPL)
self.mCANoeCAPL.Compile()
def getFunction(self):
function1 = self.mCANoeCAPL.GetFunction('timeNowNS')
# here I tried also CANoe.CAPLFunction(self.mCANoeCAPL.GetFunction('timeNowNS'))
# but i got attribute error: doesn't exist or something like that
result = function1.Call()
Expected results
I should get the current simulation time using this function.
Actual results
Using the above code I get:
**COMException**: Catastrophic failure (Exception from HRESULT: 0x8000FFFF (E_UNEXPECTED))
at CANoe.ICAPL5.GetFunction(String Name)
I've tried different variations of the code but didn't get anywhere.
Is it possible to be a hardware problem?
Should I do some settings in the CANoe Simulation?
If you need more information, please ask me! Thanks in advance
Update: I've added a photo of my measure setup after adding the CAPL block
You have to write a CAPL function in which you call timeNowNS. This CAPL function can then be called from Python in the way you have implemented.
GetFunction only works with (user-written) CAPL functions. You cannot call CAPL intrinsics (i.e. built-in CAPL functions) directly.
Put this into a CAPL file:
int MyFunc()
{
return timeNowNS();
}
and call like this from Python:
def getFunction(self):
function1 = self.mCANoeCAPL.GetFunction('MyFunc')
result = function1.Call()
After a long session of try and error and the help of #m-spiller I found the solution.
function2 = None
def open_can(self):
self.mCANoeApp = CANoe.Application()
self.mCANoeMeasurement = self.mCANoeApp.Measurement # change here: no cast necessary
self.mCANoeEnv = CANoe.Environment(self.mCANoeApp.Environment)
self.mCANoeBus = CANoe.Bus(self.mCANoeApp.get_Bus("CAN"))
self.mCANoeSys = CANoe.System(self.mCANoeApp.System)
self.mCANoeNamespaces = CANoe.Namespaces(self.mCANoeSys.Namespaces)
self.mCANoeCAPL = CANoe.CAPL(self.mCANoeApp.CAPL)
self.mCANoeMeasurement.OnInit += CANoe._IMeasurementEvents_OnInitEventHandler(self.OnInit)
# change here also: explained below
def OnInit(self):
global function2
function2 = CANoe.CAPLFunction(mCANoeCAPL.GetFunction('MyTime')) # cast here is necessary
def callFunction(self):
result = function2.Call()
What was the problem with the initial code?
The problem was that i tried to assign a function to a variable after the measure has started.
As stated here in chapter 2.7, the assignment of a CAPL Function to a variable can only be done in the OnInit event handler of the Measurement object.
I've added this line, studying the documentation:
self.mCANoeMeasurement.OnInit += CANoe._IMeasurementEvents_OnInitEventHandler(self.OnInit)
After adding it, on init the OnInit function was executed and the CAPL function was assigned to a variable and afterwards i could use that variable to call the function.
Thank you again, #m-spiller !

Python 2: function alias or macro

I am wondering if there is a way to create macros or aliases for functions in Python 2.7.
Example: I am trying to use the logging module and create aliases/macros for functions logging.debug, logging.info, logging.error, etc. If I use those functions as they are in the place where I want the log, everything works fine. But if I try to create an 'alias' function wrapper like this:
def debugLog(message):
logging.debug(message)
... then the line number reporting no longer works as intended, the line reported always states the location of the wrapper and not the actual log, which isn't any real use.
I did find this solution:
import logging
from logging import info as infoLog
from logging import debug as debugLog
from logging import error as errorLog
....
... but it is not suitable for me since I also create my own logging severity:
logging.addLevelName(60, "NORMAL")
... and I'd like to create an alias/macro like normalLog(message)=logging.log(60, message) for it as well if it's possible? I couldn't find anything comprehensive in Python Docs or online.
You can use functools.partial:
import functools
import logging
normalLog = functools.partial(logging.log, 60)
It works pretty well:
normalLog("Hey!!")
Level 60:root:Hey!!
partial binds arguments to function calls and return a partial object (a callable object that holds the necesary information), so you can also use it in the addLevelName method:
activateLevel = functools.partial(logging.addLevelName, 60, "NORMAL")
activateLevel()
Here you have a live working example, notice that the log line is properly reported.
You can use a frame object to get the line number. You can get a frame object in a number of ways, in the example below I use sys._getframe(), the parameter 1 gives the previous stack frame. sys._getframe() is not guaranteed to be present on all Python non-C implementations. Several other functions return frame objects, including the inspect module.
import sys
def debugLog(message):
line = sys._getframe(1).f_lineno
print line, ':', message
x = 42
print x
debugLog("A")
y = x + 1
print y
debugLog("B")
Gives:
42
10 : A
43
13 : B

Running rpy2 in parallel using multiprocessing raises weird exception that cannot be caught

So this is a problem that I have not been able to solve, and neither do I know of a good way to make a MCVE out of. Essentially, it has been briefly discussed here, but as the comments show, there was some disagreement, and the final verdict is still out. Hence I am posting a similar question again, hoping to get a better answer.
Background
I have sensor data from a couple of thousand sensors, that I get every minute. My interest lies in forecasting the data. For this I am using the ARIMA family of forecasting models. Long story short, after discussion with the rest of my research group, we decided to use the Arima function available in the R package forecast, instead of the statsmodels implementation of the same.
Problem Definition
Since, I have data from a few thousand sensors, for which I would like to at least analyse a whole week's worth of data (to begin with), and since a week has 7 days, I have 7 times the number of sensors data with me. Essentially a some 14k sensor-day combinations. Finding the best ARIMA order (which minimizes BIC) and forecasting the next day of week data takes about 1 minute for each sensor-day combination. Which means upwards of 11 days to just process one week data on a single core!
This is obviously a waste, when I have 15 more cores just idling away the whole time. So, obviously, this is a problem for parallel processing. Note that each sensor-day combination does not influence any other sensor-day combination. Also, the rest of my code is fairly well profiled, and optimized.
Issue
The issue is that I get this weird error that I cannot catch anywhere. Here is the error reproduced:
Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/kartik/miniconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/kartik/miniconda3/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/pool.py", line 429, in _handle_results
task = get()
File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
File "/home/kartik/miniconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py", line 55, in _reduce_robjectmixin
rinterface_level=rinterface_factory(rdumps, rtypeof)
ValueError: Mismatch between the serialized object and the expected R type (expected 6 but got 24)
Here are a few characteristics of this error that I have discovered:
It is raised in the rpy2 package
It has something to do with Thread 3. Since Python is zero indexed, I am guessing this is the fourth thread. Therefore, 4x6 = 24, which adds up to the numbers shown in the final error statement
rpy2 is being used in only one place in my code where it might have to recode returned values into Python types. Protecting that line in try: ... except: ... does not catch that exception
The exception is not raised when I ditch the multiprocessing and call the function within a loop
The exception does not crash the program, just suspends it forever (till I Ctrl+C it into terminating)
All that I tried till now, have had no effect in resolving the error
Things Tried
I have tried everything from extreme procedural coding, with functions to deal with the least cases (that is only one function to be called in parallel), to extreme encapsulation, where the executable block in the if __name__ == '__main__': calls a function which reads in the data, does the necessary grouping, then passes the groups to another function, which imports multiprocessing and calls another function in parallel, which imports the processing module that imports rpy2, and passes the data to the Arima function in R.
Basically, it doesn't matter if rpy2 is called and initialized deep inside function nests, such that it has no idea another instance might be initialized, or if it is called and initialized once, globally, the error is raised if multiprocessing is involved.
Pseudo Code
Here is an attempt to present at least some basic pseudo code such that the error might be reproduced.
import numpy as np
import pandas as pd
def arima_select(y, order):
from rpy2 import robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
forecast = importr('forecast')
res = forecast.Arima(y, order=ro.FloatVector(order))
return res
def arima_wrapper(data):
data = data[['tstamp', 'val']]
data.set_index('tstamp', inplace=True)
return arima_select(data, (1,1,1))
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
def wrapper():
df = pd.read_csv('file.csv', parse_dates=[1], infer_datetime_format=True)
df['day'] = df['tstamp'].dt.day
res = applyParallel(df.groupby(['sensor', 'day']), arima_wrapper)
print(res)
Obviously, the above code can be encapsulated further, but I think it should reproduce the error quite accurately.
Data Sample
Here is the output of print(data.head(6)) when placed immediately below data.set_index('tstamp', inplace=True) in arima_wrapper from the pseudo code above:
Or alternatively, data for a sensor, for a whole week can be generated simply with:
def data_gen(start_day):
r = pd.Series(pd.date_range('2016-09-{}'.format(str(start_day)),
periods=24*60, freq='T'),
name='tstamp')
d = pd.Series(np.random.randint(10, 80, 1440), name='val')
s = pd.Series(['sensor1']*1440, name='sensor')
return pd.concat([s, r, d], axis=1)
df = pd.concat([data_gen(day) for day in range(1,8)], ignore_index=True)
Observations and Questions
The first observation is that this error is only raised when multiprocessing is involved, not when the function (arima_wrapper) is called in a loop. Therefore, it must be associated somehow with multiprocessing issues. R is not very multiprocess friendly, but when written in the way shown in the pseudo code, each instance of R should not know about the existence of the other instances.
The way the pseudo code is structured, there must be an initialization of rpy2 for each call inside the multiple subprocesses spawned by multiprocessing. If that were true, each instance of rpy2 should have spawned its own instance of R, which should just execute one function, and terminate. That would not raise any errors, because it would be similar to the single threaded operation. Is my understanding here accurate, or am I completely or partially missing the point?
Were all instances of rpy2 to somehow share an instance of R, then I might reasonably expect the error. What is true: is R shared among all instances of rpy2, or is there an instance of R for each instance of rpy2?
How might this issue be overcome?
Since SO hates question threads with multiple questions in them, I will prioritize my questions such that partial answers will be accepted. Here is my priority list:
How might this issue be overcome? A working code example that does not raise the issue will be accepted as answer even if it does not answer any other question, provided no other answer does better, or was posted earlier.
Is my understanding of Python imports accurate, or am I missing the point about multiple instances of R? If I am wrong, how should I edit the import statements such that a new instance is created within each subprocess? Answers to this question are likely to point me towards a probable solution, and will be accepted, provided no answer does better, or was posted earlier
Is R shared among all instances of rpy2 or is there an instance of R for each instance of rpy2? Answers to this question will be accepted only if they lead to a resolution of the problem.
(...) Long story short (...)
Really ?
How might this issue be overcome? A working code example that does not
raise the issue will be accepted as answer even if it does not answer
any other question, provided no other answer does better, or was
posted earlier.
Answers may leave a quite bit of work on your end...
Is my understanding of Python imports accurate, or am
I missing the point about multiple instances of R? If I am wrong, how
should I edit the import statements such that a new instance is
created within each subprocess? Answers to this question are likely to
point me towards a probable solution, and will be accepted, provided
no answer does better, or was posted earlier
Python packages/modules are "uniquely" imported across your process which means that all code using the package/module within the process is using the same single import (you don't have a copy per import in a given block).
Because of this, I'd recommend to use an initialization function when creating your Pool rather than repeatedly import rpy2 and setup the conversion each time a task is sent to a worker. You may also gain in performance if each task is short.
def arima_select(y, order):
# FIXME: check whether the rpy2.robjects package
# should be (re) imported as ro to be visible
res = forecast.Arima(y, order=ro.FloatVector(order))
return res
forecast = None
def worker_init():
from rpy2 import robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
global forecast
forecast = importr('forecast')
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count(), worker_init) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
Is R shared among all
instances of rpy2 or is there an instance of R for each instance of
rpy2? Answers to this question will be accepted only if they lead to a
resolution of the problem.
rpy2 is making R available by linking its C shared library. One such library per Python process, and that's as a stateful library (R not able to handle concurrency). I think that your issue has more to do with object serialization (see http://rpy2.readthedocs.io/en/version_2.8.x/robjects_serialization.html#object-serialization) than with concurrency.
What is happening is some apparent confusion when reconstructing the R objects after Python pickled the rpy2 object. More specifically, when looking that the R object types mentioned in the error message:
>>> from rpy2.rinterface import str_typeint
>>> str_typeint(6)
'LANGSXP'
>>> str_typeint(24)
'RAWSXP'
I am guessing that the R object returned by forecast.Arima contains an unevaluated R expression (for example the call that lead to that result object) and when serializing and unserializing it is coming back as something different (a raw vector of bytes). This is possibly a bug with R's own serialization mechanism (since rpy2 is using it behind the hood). For now, and solve your issue, you may want to extract what forecast.Arima what you care most about and only return that from the function call ran by the worker.
The following changes to the arima_select function in the pesudo code presented in the question work:
import numpy as np
import pandas as pd
from rpy2 import rinterface as ri
ri.initr()
def arima_select(y, order):
def rimport(packname):
as_environment = ri.baseenv['as.environment']
require = ri.baseenv['require']
require(ri.StrSexpVector([packname]),
quiet = ri.BoolSexpVector((True, )))
packname = ri.StrSexpVector(['package:' + str(packname)])
pack_env = as_environment(packname)
return pack_env
frcst = rimport("forecast")
args = (('y', ri.FloatSexpVector(y)),
('order', ri.FloatSexpVector(order)),
('include.constant', ri.StrSexpVector(const)))
return frcst['Arima'].rcall(args, ri.globalenv)
Keeping the rest of the pseudo code the same. Note that I have since optimized the code further, and it does not require all the functions presented in the question. Basically, the following is necessary and sufficient:
import numpy as np
import pandas as pd
from rpy2 import rinterface as ri
ri.initr()
def arima(y, order=(1,1,1)):
# This is the same as arima_select above, just renamed to arima
...
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count(), worker_init) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
def main():
# Create your df in your favorite way:
def data_gen(start_day):
r = pd.Series(pd.date_range('2016-09-{}'.format(str(start_day)),
periods=24*60, freq='T'),
name='tstamp')
d = pd.Series(np.random.randint(10, 80, 1440), name='val')
s = pd.Series(['sensor1']*1440, name='sensor')
return pd.concat([s, r, d], axis=1)
df = pd.concat([data_gen(day) for day in range(1,8)], ignore_index=True)
applyParallel(df.groupby(['sensor', pd.Grouper(key='tstamp', freq='D')]),
arima) # Note one may use partial from functools to pass order to arima
Note that I also do not call arima directly from applyParallel since my goal is to find the best model for the given series (for a sensor and day). I use a function arima_wrapper to iterate through the order combinations, and call arima at each iteration.

Categories

Resources