Proper handling of spark broadcast variables in a Python class - python

I've been implementing a model with spark via a python class. I had some headaches calling class methods on a RDD defined in the class (see this question for details), but finally have made some progress. Here is an example of a class method I'm working with:
#staticmethod
def alpha_sampler(model):
# all the variables in this block are numpy arrays or floats
var_alpha = model.params.var_alpha
var_rating = model.params.var_rating
b = model.params.b
beta = model.params.beta
S = model.params.S
Z = model.params.Z
x_user_g0_inner_over_var = model.x_user_g0_inner_over_var
def _alpha_sampler(row):
feature_arr = row[2]
var_alpha_given_rest = 1/((1/var_alpha) + feature_arr.shape[0]*(1/var_rating))
i = row[0]
items = row[1]
O = row[3] - np.inner(feature_arr,b) - beta[items] - np.inner(S[i],Z[items])
E_alpha_given_rest = var_alpha_given_rest * (x_user_g0_inner_over_var[i] + O.sum()/var_rating)
return np.random.normal(E_alpha_given_rest,np.sqrt(var_alpha_given_rest))
return _alpha_sampler
As you can see, to avoid serialization errors, I define a static method that returns a function that is in turn applied to each row of an RDD (model is the parent class here, and this is called from within another method of model):
# self.grp_user is the RDD
self.params.alpha = np.array(self.grp_user.map(model.alpha_sampler(self)).collect())
Now, this all works fine, but is not leveraging Spark's broadcast variables at all. Ideally, all the variables I'm passing in this function (var_alpha, beta, S, etc.) could first be broadcast to the workers, so that I wasn't redundantly passing them as part of the map. But I'm not sure how to do this.
My question, then, is the following: How/where should I make these into broadcast variables such that they are available to the alpha_sampler function that I map to grp_user? One thing I believe will work would be to make them globals, e.g.
global var_alpha
var_alpha = sc.broadcast(model.params.var_alpha)
# and similarly for the other variables...
Then the alpha_sampler could be much simplified:
#staticmethod
def _alpha_sampler(row):
feature_arr = row[2]
var_alpha_given_rest = 1/((1/var_alpha.value) + feature_arr.shape[0]*(1/var_rating.value))
i = row[0]
items = row[1]
O = row[3] - np.inner(feature_arr,b.value) - beta.value[items] - np.inner(S.value[i],Z.value[items])
E_alpha_given_rest = var_alpha_given_rest * (x_user_g0_inner_over_var.value[i] + O.sum()/var_rating.value)
return np.random.normal(E_alpha_given_rest,np.sqrt(var_alpha_given_rest))
But of course this is really dangerous use of globals that I would like to avoid. Is there a better way that lets me leverage broadcast variables?

Assuming that variables you use here are simply scalars there is probably nothing to gain here from a performance perspective and using broadcast variables will make you code less readable but you can either pass a broadcast variable as an argument to the static method:
class model(object):
#staticmethod
def foobar(a_model, mu):
y = a_model.y
def _foobar(x):
return x - mu.value + y
return _foobar
def __init__(self, sc):
self.sc = sc
self.y = -1
self.rdd = self.sc.parallelize([1, 2, 3])
def get_mean(self):
return self.rdd.mean()
def run_foobar(self):
mu = self.sc.broadcast(self.get_mean())
self.data = self.rdd.map(model.foobar(self, mu))
or initialize it there:
class model(object):
#staticmethod
def foobar(a_model):
mu = a_model.sc.broadcast(a_model.get_mean())
y = a_model.y
def _foobar(x):
return x - mu.value + y
return _foobar
def __init__(self, sc):
self.sc = sc
self.y = -1
self.rdd = self.sc.parallelize([1, 2, 3])
def get_mean(self):
return self.rdd.mean()
def run_foobar(self):
self.data = self.rdd.map(model.foobar(self))

Related

What's the best way to design a class that calls sequences of its methods?

I have a class similar to the one below where it goes through a series of methods using variables in the class. The code used to be a massive series of functions, passing variables around and this felt more structured and easy to work with/test. However, it still feels as though there's a better way.
Is there a better design pattern or approach for situations like this? Or is going the object route a mistake?
In terms of testing process() and other methods, I can just mock the methods called and assert_called_once. But, ultimately, it leads to ugly testing code with tons of mocks. So, it makes me wonder about question #1 again.
class Analyzer:
def __init__(self):
self.a = None
self.b = None
def process(self):
self.gather_data()
self.build_analysis()
self.calc_a()
self.calc_b()
self.build_output()
self.export_data()
...
def gather_data(self):
self.get_a()
self.get_b()
self.get_c()
...
def build_analysis(self):
self.do_d()
self.do_e()
self.do_f
...
As for testing, and I know this code isn't technically right, but I just wanted to illustrate how it gets hard to read/sloppy.
class TestAnalyzer:
#patch.object(Analyzer, 'gather_data')
#patch.object(Analyzer, 'build_analysis')
#patch.object(Analyzer, 'calc_a')
#patch.object(Analyzer, 'calc_b')
#patch.object(Analyzer, 'build_output')
#patch.object(Analyzer, 'export_data')
def test_process(self, m_gather_data, m_build_analysis, m_calc_a,
m_calc_b, m_build_output, m_export_data):
analyzer.process()
m_gather_data.assert_called_once()
m_build_analysis.assert_called_once()
m_calc_a.assert_called_once()
...
Any insight or thoughts would be appreciated. Thank you!
Maybe the information expert design principle can help you
Assign responsibility to the class that has the information needed to fulfill it
In your example it seems like you can split your class into different ones with better defined responsibilities. Assuming that you have to perform the following functions in order:
Gather data
Preprocess it
Analyse it
I would create a class for each of them. Here you have some example code that generates some data and performs some basic calculations:
from random import randint
from dataclasses import dataclass
#dataclass
class DataGatherer:
path: str
def get_data(self):
"""Generate fake x, y data"""
return randint(0, 1), randint(0, 1)
#dataclass
class Preprocessing:
x: int
y: int
def prep_x(self):
return self.x + 1
def prep_y(self):
return self.y + 2
#dataclass
class Calculator:
x: int
y: int
def calc(self):
return self.x + self.y
All I have done is to divide your code into blocks, and assigned methods and attributes to fullfill the functions of those blocks. Finally, you can create an Analysis class whose only responsibility is to put the whole process together:
#dataclass
class Analysis:
data_path: str
def process(self, save_path: str):
x, y = DataGatherer(self.data_path).get_data()
print("Data gathered x =", x, "y =", y)
prep = Preprocessing(x, y)
x, y = prep.prep_x(), prep.prep_y()
print("After preprocessing x =", x, "y =", y)
calc = Calculator(x, y)
result = calc.calc()
print("Result =", result)
self.save(result, save_path)
def save(self, result, save_path: str):
print(f"saving result to {save_path}")
In the end this is all you have to do:
>>> Analysis("data_path_here").process("save_path_here")
Data gathered x = 0 y = 1
After preprocessing x = 1 y = 3
Result = 4
saving result to save_path_here
When it comes to testing I use pytest. You can create a test file for each of your classes (eg test_datagatherer.py, test_preprocessing.py ...) and have a unit test function for each of your methods, for example:
from your_module import Preprocessing
def test_prep_x():
prep = Preprocessing(1, 2)
assert type(prep.prep_x()) is int
def test_prep_y():
prep = Preprocessing(1, 2)
assert type(prep.prep_y()) is int
I will leave you to pytest documentation for more details.

Function as class attribute

I am writing a class where I would like to pass function as a class attribute and later use it, like that:
class Nevronska_mreza:
def __init__(self, st_vhodni, st_skriti, st_izhod, prenosna_funkcija=pf.sigmoid):
self.mreza = []
self.st_vhodni = st_vhodni
self.st_skriti = st_skriti
self.st_izhodni = st_izhod
self.prenosna_funckija = prenosna_funkcija
self.mreza.append([{'utezi': [random() for i in range(st_vhodni + 1)]} for j in range(st_skriti)])
self.mreza.append([{'utezi': [random() for i in range(st_skriti + 1)]} for j in range(st_izhod)])
def razsirjanje_naprej(self, vhod):
for sloj in self.mreza:
nov_vhod = []
for nevron in sloj:
nevron['izhod'] = self.prenosna_funkcija(self.aktivacijska_funkcija(nevron['utezi'], vhod))
nov_vhod.append(nevron['izhod'])
vhod = nov_vhod
return vhod
but it seems like this isn't the right way, I get the following error:
AttributeError: 'Nevronska_mreza' object has no attribute 'prenosna_funkcija'
Is it possible to do something like that?
Yes you can pass a function around as an argument however you have made a couple of mistakes.
Firstly you have used the word function, although not a reserved word it should be avoided as a name of an entity such as a variable.
Secordly you have used an optional parameter before mandatory parameters which will cause an error such as:
File "test.py", line 5
def __init__(self, function=fun1, data1, data2):
^
SyntaxError: non-default argument follows default argument
Thirdly when calling the method you have not specified the scope, the function name is in the self scope of the object.
Taking all of these into account the following is working code
def fun1(x):
return x+1
class A:
def __init__(self, data1, data2, fn=fun1):
self.fn = fn
self.data1 = data1
self.data2 = data2
def some_method(self):
y = self.fn(self.data1)
print(y)
b = A(1, 2, fun1)
b.some_method()
After posting your full code I can see that you currently have self.prenosna_funckija instead of prenosna_funkcija in the following line:
self.prenosna_funckija = prenosna_funkcija
This would explain the attribute error as when you are calling self.prenosna_funkcija it genuinely does not exist.
You're close:
def fun1(x):
return x+1
class A:
def __init__(self, function=fun1, data1=None, data2=None):
self.function = function
self.data1 = data1
self.data2 = data2
def some_method(self):
y = self.function(self.data1)
return y
a = A(data1 = 41)
result = a.some_method()
print(result)
prints
42

Python function replacing part of variable

I am writing a code for a project in particle physics (using pyroot).
In my first draft, I use the following line
for i in MyTree:
pion.SetXYZM(K_plus_PX, K_plus_PY, K_plus_PZ,K_plus_MM)
This basically assigns to the pion the values of variables in the parenthesis, ie momenta and inv. mass of the kaon.
Physics aside, I would like to write a function "of the form":
def myfunc(particle):
return %s_PX % particle
I know this is wrong. What I would like to achieve is to write a function that allows, for a given particle, to set particle_PX, particle_PY etc to be the arguments of SetXYZM.
Thank you for your help,
B
To access class attributes from string variables you can use python's getattr:
import ROOT
inputfile = ROOT.TFile.Open("somefile.root","read")
inputtree = inputfile.Get("NameOfTTree")
inputtree.Print()
# observe that there are branches
# K_plus_PX
# K_plus_PY
# K_plus_PZ
# K_plus_MM
# K_minus_PX
# K_minus_PY
# K_minus_PZ
# K_minus_MM
# pi_minus_PX
# pi_minus_PY
# pi_minus_PZ
# pi_minus_MM
def getx(ttree,particlename):
return getattr(ttree,particlename+"_PX")
def gety(ttree,particlename):
return getattr(ttree,particlename+"_PY")
def getz(ttree,particlename):
return getattr(ttree,particlename+"_PZ")
def getm(ttree,particlename):
return getattr(ttree,particlename+"_MM")
def getallfour(ttree,particlename):
x = getattr(ttree,particlename+"_PX")
y = getattr(ttree,particlename+"_PY")
z = getattr(ttree,particlename+"_PZ")
m = getattr(ttree,particlename+"_MM")
return x,y,z,m
for entry in xrange(inputtree.GetEntries()):
inputtree.GetEntry(entry)
pion1 = ROOT.TLorentzVector()
x = getx(inputtree,"K_plus")
y = gety(inputtree,"K_plus")
z = getz(inputtree,"K_plus")
m = getm(inputtree,"K_plus")
pion2.SetXYZM(x,y,z,m)
x,y,z,m = getallfour(inputtree,"pi_minus")
pion2 = ROOT.TLorentzVector()
pion2.SetXYZM(x,y,z,m)
As linked by Josh Caswell, you can similarly access variable names:
def getx(particlename):
x = globals()[partilcename+"_PX"]
though that might get nasty quickly as of whether your variables are global or local and for local, in which context.

Sharing a piece of code with methods inside a class in Python

I started making a draft for one of the classes that are supposed to be used in my programm and I first wrote this piece of code:
import math
import numpy as np
R = 6.371e6
phi_src, theta_src = 10, 40
phi_det,theta_det = -21, 10
depth_src, depth_det = 0,0 # both on the surface
l = 0
class Trajectory:
def __init__(self,
phi_src,
theta_src,
phi_det,
theta_det,
depth_src,
depth_det,
l):
self.phi_src = phi_src
self.theta_src = theta_src
self.phi_det = phi_det
self.theta_det = theta_det
self.depth_src = depth_src
self.depth_det = depth_det
self.l = l
#property
def r(self):
r_src = R - self.depth_src
r_det = R - self.depth_det
x_src = r_src * math.cos(self.phi_src) * math.cos(self.theta_src)
y_src = r_src * math.cos(self.phi_src) * math.sin(self.theta_src)
z_src = r_src * math.sin(self.phi_src)
x_det = r_det * math.cos(self.phi_det) * math.cos(self.theta_det)
y_det = r_det * math.cos(self.phi_det) * math.sin(self.theta_det)
z_det = r_det * math.sin(self.phi_det)
coord_src = np.array((x_src, y_src, z_src))
coord_det = np.array((x_det, y_det, z_det))
L = np.linalg.norm(coord_src - coord_det)
return math.sqrt(r_src**2 + self.l * (1.0 - L - (r_src - r_det) * (r_src + r_det)/L))
def phi(r):
pass
trajectory = Trajectory(phi_src,theta_src,phi_det,theta_det,depth_src,depth_det,l)
print(trajectory.r)
But then realized that the
r_src = R - self.depth_src
r_det = R - self.depth_det
x_src = r_src * math.cos(self.phi_src) * math.cos(self.theta_src)
y_src = r_src * math.cos(self.phi_src) * math.sin(self.theta_src)
z_src = r_src * math.sin(self.phi_src)
x_det = r_det * math.cos(self.phi_det) * math.cos(self.theta_det)
y_det = r_det * math.cos(self.phi_det) * math.sin(self.theta_det)
z_det = r_det * math.sin(self.phi_det)
coord_src = np.array((x_src, y_src, z_src))
coord_det = np.array((x_det, y_det, z_det))
L = np.linalg.norm(coord_src - coord_det)
part is common for all the methods of the class and hence there's no point in calculating it numerous times in every method, this piece should be shared with all the methods.
What would be the best way to do that? Do I have to put it into the __init__ method? I've heard it's not a good practice to make any calculations inside the __init__ method.
The common way of declaring a function in a class that does not depend on the state of the object itself is to use the #staticmethod decorator, followed by the function definition. You only pass the function parameters.
If you need to use class level parameters, use #classmethod instead and note that you pass cls instead of self to the function (one could use any variable, so really it doesn't matter. The point is that you are now accessing class attributes and methods instead of those of the object).
class Trajectory:
c = 10 # <<< Class level property.
def __init__(self):
self.c = 5 # <<< Object level property.
#staticmethod
def foo(a, b):
return a * b
#classmethod
def bar(cls, a, b):
return cls.foo(a, b) * cls.c # <<< References class level method and property.
def baz(self, a, b):
return self.foo(a, b) * self.c # <<< References object level method and property.
t = Trajectory()
>>> t.foo(3, 5)
15
>>> t.bar(3, 5)
150
>>> t.baz(3, 5)
75
Hmmm, not totally sure if I get what you want, but quoting you a bit...
def r(self):
r_src = R - self.depth_src
r_det = R - self.depth_det
....
L = np.linalg.norm(coord_src - coord_det)
This is common, you say because methods like def r(self) always some of these variables, like r_src, L:
def r(self):
return math.sqrt(r_src**2 + self.l * (1.0 - L - (r_src - r_det) * (r_src + r_det)/L))
This, imho, tells me that, if you want to reuse those computations then they should be part of __init__ (or called from __init__). But mostly, you need to set all those variables to self.
...whereever you compute them in a common location...
self.r_src = R - self.depth_src
self.r_det = R - self.depth_det
....
self.L = np.linalg.norm(coord_src - coord_det)
Note that as you depend on instance variables such as self.depth_src in the above, this method can't be a class method, it needs to be an instance method.
Now, change your other methods to point to those precomputed attributes.
def r(self):
return math.sqrt(self.r_src**2 + self.l * (1.0 - self.L ....
Now, you could get fancy and only compute those attributes on demand, via properties. But if you are asking a fairly basic Python question, which I think you are, then worry about optimization later and do the easiest for now. I.e. compute them all in the __init__ or from a method called from there.
Now, there are perfectly good reasons to break them out of init, but it mostly has to do with code clarity and modularity. If that chunk of code has some specific math/business domain meaning, then create a method that is named appropriately and call it from main.
On the other hand, some IDEs and code analyzers are better at figuring instance variables when they see them assigned in __init__ and with Python being as dynamic as it is the poor things need all the help they can get.

Using class to define multiple variables in python

I'm still new to python and this is probably going be one of those (stupid) boring questions. However, any help will be much appreciated. I'm programing something that involves many variables and I've decided to use a class to encapsulate all variables (hopefully making it easier to "read" for me as time passes), but it's not working as I thought it will. So, without further ado here is a part of the class that captures the gist.
import numpy as np
class variable:
def __init__(self, length):
self.length = length # time length`
def state_dynamic(self):
length = self.length
return np.zeros((2, np.size(length)))
def state_static(self):
length = self.length
return np.zeros((2, np.size(length)))
def control_dynamic(self):
length = self.length
return np.zeros((2, np.size(length)))
def control_static(self):
length = self.length
return np.zeros((2, np.size(length)))
def scheduling(self):
length = self.length
return np.zeros(np.size(length))
def disturbance(self):
length = self.length
dummy = np.random.normal(0., 0.1, np.size(length))
for i in range(20):
dummy[i+40] = np.random.normal(0., 0.01) + 1.
dummy[80:100] = 0.
return dummy
I've also tried this one:
import numpy as np
class variable:
def __init__(self, type_1, type_2, length):
self.type_1 = type_1 # belongs to set {state, control, scheduling, disturbance}
self.type_2 = type_2 # belongs to set {static, dynamic, none}
self.length = length # time length
def type_v(self):
type_1 = self.type_1
type_2 = self.type_2
length = self.length
if type_1 == 'state' and type_2 == 'dynamic':
return np.zeros((2, np.size(length)))
elif type_1 == 'state' and type_2 == 'static':
return np.zeros((2, np.size(length)))
elif type_1 == 'control' and type_2 == 'dynamic':
return np.zeros((2, np.size(length)))
elif type_1 == 'control' and type_2 == 'static':
return np.zeros((2, np.size(length)))
elif type_1 == 'scheduling' and type_2 == 'none':
return np.zeros(np.size(length))
elif type_1 == 'disturbance' and type_2 == 'none':
dummy = np.random.normal(0., 0.1, np.size(length))
for i in range(20):
dummy[i+40] = np.random.normal(0., 0.01) + 1.
dummy[80:100] = 0.
return dummy
Now, using the first one (the outcome is the same for the second class as well), when I write the following, say:
In [2]: time = np.linspace(0,10,100)
In [5]: v = variable(time)
In [6]: v1 = v.state_dynamic
In [7]: v1.size
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/home/<ipython-input-7-e6a5d17aeb75> in <module>()
----> 1 v1.size
AttributeError: 'function' object has no attribute 'size'
In [8]: v2 = variable(np.size(time)).state_dynamic
In [9]: v2
Out[9]: <bound method variable.state_dynamic of <__main__.variable instance at 0x3ad0a28>>
In [10]: v1[0,0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/<ipython-input-10-092bc2b9f982> in <module>()
----> 1 v1[0,0]
TypeError: 'instancemethod' object has no attribute '__getitem__'
I was hoping that by writing
variable(length).state_dynamic
I'll access
np.zeros((2, np.size(length)))
Anyway, if I made something utterly stupid please let me know :) and feel free to give any kind of advice. Thank you in advance for your time and kind attention. Best regards.
EDIT #1:
#wheaties:
Thank you for a quick reply and help :)
What I'm currently trying to do is the following. I have to plot several "variables", e.g., state, control, dropouts, scheduling and disturbances. All the variables depend on three parameters, namely, dynamic, static and horizon. Further, state and control are np.zeros((2, np.size(length))), dropouts and scheduling are np.zeros(np.size(length)) and disturbance has specific form (see above). Initially, I declared them in the script and the list is very long and looks ugly. I use these variables to store responses of dynamical systems considered and to plot them. I don't know if this is a good way of doing this and if you have any suggestion please share.
Thanks again for your help.
Do you mean you want named access to a bunch of state information? The ordinary python idiom for class variables would look like this:
class Variable(object):
def __init__ (self, state_dynamic, state_static, control_static, control_dynamic, scheduling):
self.state_dynamic = state_dynamic
self.state_static = state_static
self.control_static = control_static
self.control_dynamic = control_dynamic
self.scheduling = control_dynamic
Which essentially creates a bucket with named fields that hold values you put in via the constructor. You can also create lightweight data classes using the namedtuple factory class, which avoids some of the boilerplate.
The other python idiom that might apply is to use the #property decorator as in #wheaties answer. This basically disguises a function call to make it look like a field. If what you're doing can be reduced to a functional basis this would make sense. This is an example of the idea (not based on your problem set, since I'm not sure I grok what you're doing in detail with all those identical variables) -- in this case I'm making a convenience wrapper for pulling individual flags out that are stored in a python number but really make a bit field:
class Bits(object):
def __init__(self, integer):
self.Integer = integer # pretend this is an integer between 0 and 8 representing 4 flags
#property
def locked(self):
# low bit = locked
return self.Integer & 1 == 1
#property
def available(self):
return self.Integer & 2 == 2
#property
def running_out_of_made_up_names(self):
return self.Integer & 4 == 4
#property
def really_desperate_now(self):
return self.Integer & 8 == 8
example = Bits(7)
print example.locked
# True
print example.really_desperate_now
# False
A method in Python is a function. If you want to get a value from a member function you have to end it with (). That said, some refactoring may help eliminate boilerplate and reduce the problem set size in your head. I'd suggest using a #property for some of these things, combined with a slight refactor
class variable:
def __init__(self, length):
self.length = length # time length`
#property
def state_dynamic(self):
return self.np_length
#property
def state_static(self):
return self.np_length
#property
def control_dynamic(self):
return self.np_length
#property
def control_static(self):
return self.np_length
#property
def scheduling(self):
return self.np_length
#property
def np_length(self):
return np.zeros(2, np.size(self.length))
That way you can use those functions as you would a member variable like you tried before:
var = variable(length).state_dynamic
What I can't tell from all this is what the difference is between all these variables? I don't see a single one. Are you assuming that you have to access them in order? If so, that's bad design and a problem. Never make that assumption.

Categories

Resources