How to use IPython.parallel for functions with multiple inputs? - python

This is my first attempt at using IPython.parallel so please bear with me.
I read this question
Parfor for Python
and am having trouble implementing a simple example as follows:
import gmpy2 as gm
import numpy as np
from IPython.parallel import Client
rc = Client()
lview = rc.load_balanced_view()
lview.block = True
a = 1
def L2(ii,jj):
out = []
out.append(gm.fac(ii+jj+a))
return out
Nloop = 100
ii = range(Nloop)
jj = range(Nloop)
R2 = lview.map(L2, zip(ii, jj))
The problems I have are:
a is defined outside the loop and I think I need to do something like "push" but am a bit confused by that. Do I need to "pull" after?
there are two arguments that are required for the function and I don't know how to pass them correctly. I tried things like zip(ii,jj) but got some errors.
Also,, I assume the fact that I'm using a random library gmpy2 shouldn't affect things. Is this correct? Do I need to do anything special for this?
Ideally I would like your help so on this simple example the code runs error free.
If you think it would be beneficial to post my failed attempts at #2 let me know. I'm in the dark with #1.

I found two ways that make this work:
One is pushing the variable to the cores. There is no need to pull it. The variable will simply be defined in the namespace of each process-engine.
rc.client[:].push({'a':a})
R2 = lview.map(L2, ii, jj)
The other way is as to redefine L2 to take a as an input and pass an array of a's to the map function:
def L2(ii,jj,a):
out = []
out.append(gm.fac(ii+jj+a))
return out
R2 = lview.map(L2, ii, jj, [a]*Nloop)
With regards to the import as per this website:
http://ipython.org/ipython-doc/dev/parallel/parallel_multiengine.html#non-blocking-execution
You simply import the required libraries in the function:
Note the import inside the function. This is a common model, to ensure
that the appropriate modules are imported where the task is run. You
can also manually import modules into the engine(s) namespace(s) via
view.execute('import numpy')().
Or you can do as per this link
http://ipython.org/ipython-doc/dev/parallel/parallel_multiengine.html#remote-imports

Related

Properly organising code in Python: using the same data/variables in a multitude of modules

I am struggling with properly organising Python code. My struggle stems from namespaces in Python. More specifically, as I understand, every module has its own specific namespace and if not stated specifically, elements from one namespace will not be available in another namespace.
To give an example what I mean, consider the function subfile1.py:
def foo();
print(a)
Now if I want the foo() function to print a from the namespace of another module/script, I will have to set it explicitly for that subfile1 module; e.g. like in main.py:
import subfile1
a = 4
subfile1.a = 4
subfile1.foo()
When I code (in R), it’s generally to provide data analytics reports. So I have to import data, clean data and perform analyses on this data. My coding projects (to be re-run every couple of months) are then organised as follows: I have a main script from which I run other sub-scripts. I find this very handy I have separate sub-scripts for:
Setting paths
Loading data
Data cleaning
Specific analyses
Simply looking at the names of the sub-scripts listed in the main script gives me then a good overview of what is being done and where I need to add additional sub-scripts if needed.
With Python I don’t find this an easy way of working because of the namespaces. If I would like to keep this way of working in Python, it seems to me that I would have to import all sub-scripts as modules and set the necessary variables for each module explicitly (either in the module or in the main script as in the example above).
So I thought that maybe my way of working is simply not optimal, at least in a Python setting. Could anyone tell me how best to organise the project I have in mind? Or put differently, what is a good way of organising code if the same data/variables has to be used at various places in the code?
Update:
After reading the comments, I made another example. Now with classes. Any comments on whether this is an appropriate way of structuring code or not would be very welcome. My main aim is to split the code in bits and pieces so a third party can quickly gauge what is going on and make changes to the code where necessary.
I have a master_file.py in which I run the code:
from sub_file_1 import DoSomething
from sub_file_2 import DoSomethingElse
from sub_file_3 import CombineTheStuff
x1 = DoSomething.c #could e.g. be import from a data source and transform the data
x2 = DoSomethingElse.d #could e.g. be import from another data source and transform the data taking into its particularities
y = CombineTheStuff(x1, x2) # e.g. with the data imported and cleaned (x1 and x2) I can now estimate a particular model
print(x1)
print(x2)
print(y.fin_result())
sub_file_1.py contains (in reality these sub_files are very lengthy)
from param_file import GeneralPar
class DoSomething():
c = GeneralPar.a + GeneralPar.b
sub_file_2.py contains
from param_file import GeneralPar
class DoSomethingElse():
d = GeneralPar.a * 4
sub_file_3.py contains
class CombineTheStuff():
def __init__(self, input1, input2):
self.input1 = input1
self.input2 = input2
def fin_result(self):
k = self.input1 * self.input2 - 0.01
return(k)

How to always add all imports to new Python files?

I am working on a Python project in PyCharm.
Every time I create a new file, I have to import all of the basic libraries again. I keep getting kicked out of my workflow for a few seconds every time I notice that I forgot to import a library.
As far as I can tell, there is no reason not to just import every library I use anywhere in every single file, or is there? I would much rather import a library I don't need than risk losing my concentration because I forgot to import something and have to waste a few seconds on it. Since the auto-sorting of libraries doesn't seem to work properly, I even have to add the import statement manually.
Is there a way to give PyCharm a large list of libraries and just import those in absolutely every file by default?
For comparison, this is possible to do in Jupyter Lab: You can have one notebook that contains all import statements, and just call "%run import_everything.ipynb" in all the other notebooks. This has saved me a lot of time, and is also much more readable than having different imports in every notebook.
(I care specifically about how to do this in PyCharm, but if there is a more generic way to do it, that information would also be appreciated)
I wrote a short script to go over all my files and combine all basic import statements. It also wraps it in comments that say "BASIC IMPORTS", so in the future I can just use PyCharms "replace string in whole project" feature to mass-replace all imports by looking for these comments with a regex.
from pathlib import Path
import re
p = Path(__file__).absolute().parent.parent
excluded_files = ['__init__.py', 'setup.py']
affected_files = [a for a in p.glob('**/*.py') if a.name not in excluded_files]
acc = {}
for a in affected_files:
b = a.read_text()
b = b[:b.index("\n\n")]
finds = re.findall("(import |from )(.*)", b)
for c in finds:
acc[c] = True
assert "intnet" not in c[1], a
lst = list(acc.keys())
lst.sort(key=lambda a: a[1])
res = '\n'.join(f"{a[0]}{a[1]}" for a in lst)
res = "# BASIC IMPORTS\n" + res + "\n# / BASIC IMPORTS"
for a in affected_files:
b = a.read_text()
b = res + b[b.index("\n\n"):]
a.write_text(b)

how to generate the values for the Q function

I am trying to apply the Q-function values for a problem. I don't know the function available for it in Python.
What is the python equivalent for the following code in octave?
>> f=0:0.01:1;
>> qfunc(f)
The Q-function can be expressed in terms of the error function. Check here for more info. "scipy" has the error function, special.erf(), that can be used to calculate the Q-function.
import numpy as np
from scipy import special
f = np.linspace(0,1,101)
0.5 - 0.5*special.erf(f/np.sqrt(2)) # Q(f) = 0.5 - 0.5 erf(f/sqrt(2))
Take a look at this https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.norm.html
Looks like the norm.sf method (survival function) might be what you're looking for.
I've used this Q function for my code and it worked perfectly well,
from scipy import special as sp
def qfunc(x):
return 0.5-0.5*sp.erf(x/sqrt(2))
I'vent used this one but I think it should work,
def invQfunc(x):
return sqrt(2)*sp.erfinv(1-2x)
references:
https://mail.python.org/pipermail/scipy-dev/2016-February/021252.html
Python equivalent of MATLAB's qfuncinv()
Thanks #Anton for letting me know how to write a good answer

Getting a selection in 3ds Max into a list in Python

I am writing in Python, sometimes calling certain aspects of maxscript and I have gotten most of the basics to work. However, I still don't understand FPValues. I don't even understand while looking through the examples and the max help site how to get anything meaningful out of them. For example:
import MaxPlus as MP
import pymxs
MPEval = MP.Core.EvalMAXScript
objectList = []
def addBtnCheck():
select = MPEval('''GetCurrentSelection()''')
objectList.append(select)
print(objectList)
MPEval('''
try (destroyDialog unnamedRollout) catch()
rollout unnamedRollout "Centered" width:262 height:350
(
button 'addBtn' "Add Selection to List" pos:[16,24] width:88 height:38
align:#left
on 'addBtn' pressed do
(
python.Execute "addBtnCheck()"
)
)
''')
MP.Core.EvalMAXScript('''createDialog unnamedRollout''')
(I hope I got the indentation right, pretty new at this)
In the above code I successfully spawned my rollout, and used a button press to call a python function and then I try to put the selection of a group of objects in a variable that I can control through python.
The objectList print gives me this:
[<MaxPlus.FPValue; proxy of <Swig Object of type 'Autodesk::Max::FPValue *' at 0x00000000846E5F00> >]
When used on a selection of two objects. While I would like the object names, their positions, etc!
If anybody can point me in the right direction, or explain FPValues and how to use them like I am an actual five year old, I would be eternally grateful!
Where to start, to me the main issue seems to be the way you're approaching it:
why use MaxPlus at all, that's an low-level SDK wrapper as unpythonic (and incomplete) as it gets
why call maxscript from python for things that can be done in python (getCurrentSelection)
why use maxscript to create UI, you're in python, use pySide
if you can do it in maxscript, why would you do it in python in the first place? Aside from faster math ops, most of the scene operations will be orders of magnitude slower in python. And if you want, you can import and use python modules in maxscript, too.
import MaxPlus as MP
import pymxs
mySel = mp.SelectionManager.Nodes
objectList = []
for each in mySel:
x = each.Name
objectList.append(x)
print objectList
The easiest way I know is with the
my_selection = rt.selection
command...
However, I've found it works a little better for me to throw it into a list() function as well so I can get it as a Python list instead of a MAXscript array. This isn't required but some things get weird when using the default return from rt.selection.
my_selection = list(rt.selection)
Once you have the objects in a list you can just access its attributes by looking up what its called for MAXscript.
for obj in my_selection:
print(obj.name)

Running rpy2 in parallel using multiprocessing raises weird exception that cannot be caught

So this is a problem that I have not been able to solve, and neither do I know of a good way to make a MCVE out of. Essentially, it has been briefly discussed here, but as the comments show, there was some disagreement, and the final verdict is still out. Hence I am posting a similar question again, hoping to get a better answer.
Background
I have sensor data from a couple of thousand sensors, that I get every minute. My interest lies in forecasting the data. For this I am using the ARIMA family of forecasting models. Long story short, after discussion with the rest of my research group, we decided to use the Arima function available in the R package forecast, instead of the statsmodels implementation of the same.
Problem Definition
Since, I have data from a few thousand sensors, for which I would like to at least analyse a whole week's worth of data (to begin with), and since a week has 7 days, I have 7 times the number of sensors data with me. Essentially a some 14k sensor-day combinations. Finding the best ARIMA order (which minimizes BIC) and forecasting the next day of week data takes about 1 minute for each sensor-day combination. Which means upwards of 11 days to just process one week data on a single core!
This is obviously a waste, when I have 15 more cores just idling away the whole time. So, obviously, this is a problem for parallel processing. Note that each sensor-day combination does not influence any other sensor-day combination. Also, the rest of my code is fairly well profiled, and optimized.
Issue
The issue is that I get this weird error that I cannot catch anywhere. Here is the error reproduced:
Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/kartik/miniconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/home/kartik/miniconda3/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/pool.py", line 429, in _handle_results
task = get()
File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv
return ForkingPickler.loads(buf.getbuffer())
File "/home/kartik/miniconda3/lib/python3.5/site-packages/rpy2/robjects/robject.py", line 55, in _reduce_robjectmixin
rinterface_level=rinterface_factory(rdumps, rtypeof)
ValueError: Mismatch between the serialized object and the expected R type (expected 6 but got 24)
Here are a few characteristics of this error that I have discovered:
It is raised in the rpy2 package
It has something to do with Thread 3. Since Python is zero indexed, I am guessing this is the fourth thread. Therefore, 4x6 = 24, which adds up to the numbers shown in the final error statement
rpy2 is being used in only one place in my code where it might have to recode returned values into Python types. Protecting that line in try: ... except: ... does not catch that exception
The exception is not raised when I ditch the multiprocessing and call the function within a loop
The exception does not crash the program, just suspends it forever (till I Ctrl+C it into terminating)
All that I tried till now, have had no effect in resolving the error
Things Tried
I have tried everything from extreme procedural coding, with functions to deal with the least cases (that is only one function to be called in parallel), to extreme encapsulation, where the executable block in the if __name__ == '__main__': calls a function which reads in the data, does the necessary grouping, then passes the groups to another function, which imports multiprocessing and calls another function in parallel, which imports the processing module that imports rpy2, and passes the data to the Arima function in R.
Basically, it doesn't matter if rpy2 is called and initialized deep inside function nests, such that it has no idea another instance might be initialized, or if it is called and initialized once, globally, the error is raised if multiprocessing is involved.
Pseudo Code
Here is an attempt to present at least some basic pseudo code such that the error might be reproduced.
import numpy as np
import pandas as pd
def arima_select(y, order):
from rpy2 import robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
forecast = importr('forecast')
res = forecast.Arima(y, order=ro.FloatVector(order))
return res
def arima_wrapper(data):
data = data[['tstamp', 'val']]
data.set_index('tstamp', inplace=True)
return arima_select(data, (1,1,1))
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
def wrapper():
df = pd.read_csv('file.csv', parse_dates=[1], infer_datetime_format=True)
df['day'] = df['tstamp'].dt.day
res = applyParallel(df.groupby(['sensor', 'day']), arima_wrapper)
print(res)
Obviously, the above code can be encapsulated further, but I think it should reproduce the error quite accurately.
Data Sample
Here is the output of print(data.head(6)) when placed immediately below data.set_index('tstamp', inplace=True) in arima_wrapper from the pseudo code above:
Or alternatively, data for a sensor, for a whole week can be generated simply with:
def data_gen(start_day):
r = pd.Series(pd.date_range('2016-09-{}'.format(str(start_day)),
periods=24*60, freq='T'),
name='tstamp')
d = pd.Series(np.random.randint(10, 80, 1440), name='val')
s = pd.Series(['sensor1']*1440, name='sensor')
return pd.concat([s, r, d], axis=1)
df = pd.concat([data_gen(day) for day in range(1,8)], ignore_index=True)
Observations and Questions
The first observation is that this error is only raised when multiprocessing is involved, not when the function (arima_wrapper) is called in a loop. Therefore, it must be associated somehow with multiprocessing issues. R is not very multiprocess friendly, but when written in the way shown in the pseudo code, each instance of R should not know about the existence of the other instances.
The way the pseudo code is structured, there must be an initialization of rpy2 for each call inside the multiple subprocesses spawned by multiprocessing. If that were true, each instance of rpy2 should have spawned its own instance of R, which should just execute one function, and terminate. That would not raise any errors, because it would be similar to the single threaded operation. Is my understanding here accurate, or am I completely or partially missing the point?
Were all instances of rpy2 to somehow share an instance of R, then I might reasonably expect the error. What is true: is R shared among all instances of rpy2, or is there an instance of R for each instance of rpy2?
How might this issue be overcome?
Since SO hates question threads with multiple questions in them, I will prioritize my questions such that partial answers will be accepted. Here is my priority list:
How might this issue be overcome? A working code example that does not raise the issue will be accepted as answer even if it does not answer any other question, provided no other answer does better, or was posted earlier.
Is my understanding of Python imports accurate, or am I missing the point about multiple instances of R? If I am wrong, how should I edit the import statements such that a new instance is created within each subprocess? Answers to this question are likely to point me towards a probable solution, and will be accepted, provided no answer does better, or was posted earlier
Is R shared among all instances of rpy2 or is there an instance of R for each instance of rpy2? Answers to this question will be accepted only if they lead to a resolution of the problem.
(...) Long story short (...)
Really ?
How might this issue be overcome? A working code example that does not
raise the issue will be accepted as answer even if it does not answer
any other question, provided no other answer does better, or was
posted earlier.
Answers may leave a quite bit of work on your end...
Is my understanding of Python imports accurate, or am
I missing the point about multiple instances of R? If I am wrong, how
should I edit the import statements such that a new instance is
created within each subprocess? Answers to this question are likely to
point me towards a probable solution, and will be accepted, provided
no answer does better, or was posted earlier
Python packages/modules are "uniquely" imported across your process which means that all code using the package/module within the process is using the same single import (you don't have a copy per import in a given block).
Because of this, I'd recommend to use an initialization function when creating your Pool rather than repeatedly import rpy2 and setup the conversion each time a task is sent to a worker. You may also gain in performance if each task is short.
def arima_select(y, order):
# FIXME: check whether the rpy2.robjects package
# should be (re) imported as ro to be visible
res = forecast.Arima(y, order=ro.FloatVector(order))
return res
forecast = None
def worker_init():
from rpy2 import robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
global forecast
forecast = importr('forecast')
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count(), worker_init) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
Is R shared among all
instances of rpy2 or is there an instance of R for each instance of
rpy2? Answers to this question will be accepted only if they lead to a
resolution of the problem.
rpy2 is making R available by linking its C shared library. One such library per Python process, and that's as a stateful library (R not able to handle concurrency). I think that your issue has more to do with object serialization (see http://rpy2.readthedocs.io/en/version_2.8.x/robjects_serialization.html#object-serialization) than with concurrency.
What is happening is some apparent confusion when reconstructing the R objects after Python pickled the rpy2 object. More specifically, when looking that the R object types mentioned in the error message:
>>> from rpy2.rinterface import str_typeint
>>> str_typeint(6)
'LANGSXP'
>>> str_typeint(24)
'RAWSXP'
I am guessing that the R object returned by forecast.Arima contains an unevaluated R expression (for example the call that lead to that result object) and when serializing and unserializing it is coming back as something different (a raw vector of bytes). This is possibly a bug with R's own serialization mechanism (since rpy2 is using it behind the hood). For now, and solve your issue, you may want to extract what forecast.Arima what you care most about and only return that from the function call ran by the worker.
The following changes to the arima_select function in the pesudo code presented in the question work:
import numpy as np
import pandas as pd
from rpy2 import rinterface as ri
ri.initr()
def arima_select(y, order):
def rimport(packname):
as_environment = ri.baseenv['as.environment']
require = ri.baseenv['require']
require(ri.StrSexpVector([packname]),
quiet = ri.BoolSexpVector((True, )))
packname = ri.StrSexpVector(['package:' + str(packname)])
pack_env = as_environment(packname)
return pack_env
frcst = rimport("forecast")
args = (('y', ri.FloatSexpVector(y)),
('order', ri.FloatSexpVector(order)),
('include.constant', ri.StrSexpVector(const)))
return frcst['Arima'].rcall(args, ri.globalenv)
Keeping the rest of the pseudo code the same. Note that I have since optimized the code further, and it does not require all the functions presented in the question. Basically, the following is necessary and sufficient:
import numpy as np
import pandas as pd
from rpy2 import rinterface as ri
ri.initr()
def arima(y, order=(1,1,1)):
# This is the same as arima_select above, just renamed to arima
...
def applyParallel(groups, func):
from multiprocessing import Pool, cpu_count
with Pool(cpu_count(), worker_init) as p:
ret_list = p.map(func, [group for _, group in groups])
return pd.concat(ret_list, keys=[name for name, _ in groups])
def main():
# Create your df in your favorite way:
def data_gen(start_day):
r = pd.Series(pd.date_range('2016-09-{}'.format(str(start_day)),
periods=24*60, freq='T'),
name='tstamp')
d = pd.Series(np.random.randint(10, 80, 1440), name='val')
s = pd.Series(['sensor1']*1440, name='sensor')
return pd.concat([s, r, d], axis=1)
df = pd.concat([data_gen(day) for day in range(1,8)], ignore_index=True)
applyParallel(df.groupby(['sensor', pd.Grouper(key='tstamp', freq='D')]),
arima) # Note one may use partial from functools to pass order to arima
Note that I also do not call arima directly from applyParallel since my goal is to find the best model for the given series (for a sensor and day). I use a function arima_wrapper to iterate through the order combinations, and call arima at each iteration.

Categories

Resources