PySpark broadcast variables from local functions

PySpark broadcast variables from local functions - python

I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers.
Let's say I have this setup:
def main():
sc = SparkContext()
SomeMethod(sc)
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value ### NameError: global name 'V' is not defined ###
However, if I instead eliminate the SomeMethod() middleman, it works fine.
def main():
sc = SparkContext()
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(worker)
def worker(element):
element *= V.value # works just fine
I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers?
Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?

I am not sure I completely understood the question but, if you need the V object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained:
def worker(V, element):
element *= V.value
Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function:
from functools import partial
def SomeMethod(sc):
someValue = rand()
V = sc.broadcast(someValue)
A = sc.parallelize().map(partial(worker, V=V))

Related

Global variables vs. parameters for read-only access

I'm wondering what the difference is between this code:
import multiprocessing
g = {} # global data dictionary
class Foo:
#staticmethod
def bar():
elems = [1,2,3,4,5]
g["important_data"] = get_important_data()
with multiprocessing.Pool(10) as p:
for res in p.imap(f, elems):
# do whatever
where f is a function that will use g["important_data"] in a read-only manner; and this code:
import multiprocessing
class Foo:
#staticmethod
def bar():
elems = [1,2,3,4,5]
important_data = get_important_data()
with multiprocessing.Pool(10) as p:
for res in p.imap(f, (elems, important_data)):
# do whatever
where f does exactly the same computation on elems as above, but is handed the additional data as a parameter important_data, rather than acessing it via the global variable g.
I'm currently working with code where the original authors wrote the comment # Globals for multiprocessing to prevent shared memory over g, but I don't know what this means. I know that different processes in python have their own copies of global variables by default (so the first implementation would no work at all if f were meant to write in g; however, as mentioned, this is read-only), but the second implementation seems to copy important_data as well, so where's the difference?

Python Class passing value to "self"

I'm programming an optimizer that has to run through several possible variations. The team wants to implement multithreading to get through those variants faster. This means I've had to put all my functions inside a thread-class. My problem is with my call of the wrapper function
class variant_thread(threading.Thread):
def __init__(self, name, variant, frequencies, fit_vals):
threading.Thread.__init__(self)
self.name = name
self.elementCount = variant
self.frequencies = frequencies
self.fit_vals = fit_vals
def run(self):
print("Running Variant:", self.elementCount) # display thread currently running
fitFunction = self.Wrapper_Function(self.elementCount)
self.popt, pcov, self.infoRes = curve_fit_my(fitFunction, self.frequencies, self.fit_vals)
def Optimize_Wrapper(self, frequencies, *params): # wrapper which returns values in manner which optimizer can work with
cut = int(len(frequencies)/2) <---- ERROR OCCURS HERE
freq = frequencies[:cut]
vals = (stuff happens here)
return (stuff in proper form for optimizer)
I've cut out as much as I could to simplify the example, and I hope you can understand what's going on. Essentially, after the thread is created it calls the optimizer. The optimizer sends the list of frequencies and the parameters it wants to change to the Optimize_Wrapper function.
The problem is that Optimize-Wrapper takes the frequencies-list and saves them to "self". This means that the "frequencies" variable becomes a single float value, as opposed to the list of floats it should be. Of course this throws an errorswhen I try to take len(frequencies). Keep in mind I also need to use self later in the function, so I can't just create a static method.
I've never had the problem that a class method saved any values to "self". I know it has to be declared explicitly in Python, but anything I've ever passed to the class method always skips "self" and saves to my declared variables. What's going on here?

Don't pass instance variables to methods. They are already accessible through self. And be careful about which variable is which. The first parameter to Wrapper_function is called "frequency", but you call it as self.Wrapper_Function(self.elementCount) - so you have a self.frequency and a frequency ... and they are different things. Very confusing!
class variant_thread(threading.Thread):
def __init__(self, name, variant, frequencies, fit_vals):
threading.Thread.__init__(self)
self.name = name
self.elementCount = variant
self.frequencies = frequencies
self.fit_vals = fit_vals
def run(self):
print("Running Variant:", self.elementCount) # display thread currently running
fitFunction = self.Wrapper_Function()
self.popt, pcov, self.infoRes = curve_fit_my(fitFunction, self.frequencies, self.fit_vals)
def Optimize_Wrapper(self): # wrapper which returns values in manner which optimizer can work with
cut = int(len(self.frequencies)/2) <---- ERROR OCCURS HERE
freq = self.frequencies[:cut]
vals = (stuff happens here)
return (stuff in proper form for optimizer)
You don't have to subclass Thread to run a thread. Its frequently easier to define a function and have Thread call that function. In your case, you may be able to put the variant processing in a function and use a thread pool to run them. This would save all the tedious handling of the thread object itself.
def run_variant(name, variant, frequencies, fit_vals):
cut = int(len(self.frequencies)/2) <---- ERROR OCCURS HERE
freq = self.frequencies[:cut]
vals = (stuff happens here)
proper_form = (stuff in proper form for optimizer)
return curve_fit_my(fitFunction, self.frequencies, self.fit_vals)
if __name__ == "__main__":
variants = (make the variants)
name = "name"
frequencies = (make the frequencies)
fit_vals = (make the fit_vals)
from multiprocessing.pool import ThreadPool
with ThreadPool() as pool:
for popt, pcov, infoRes in pool.starmap(run_variant,
((name, variant, frequencies, fit_vals) for variant in variants)):
# do the other work here

multiprocessing with a global variable (a function) defined inside another function and using local variable

I made some tests about this setting, that appeared unexpectedly as a quick fix for my problem:
I want to call a multiprocessing.Pool.map() from inside a main function (that sets up the parameters). However it is simpler for me to give a locally defined function as one of the args. Since the latter can't be pickled, I tried the laziest solution of declaring it as global. Should I expect some weird results? Would you advise a different strategy?
Here is an example (dummy) code:
#!/usr/bin/env python3
import random
import multiprocessing as mp
def processfunc(arg_and_func):
arg, func = arg_and_func
return "%7.4f:%s" %(func(arg), arg)
def main(*args):
# the content of var depends of main:
var = random.random()
# Now I need to pass a func that uses `var`
global thisfunc
def thisfunc(x):
return x+var
# Test regular use
for x in range(-5,0):
print(processfunc((x, thisfunc)))
# Test parallel runs.
with mp.Pool(2) as pool:
for r in pool.imap_unordered(processfunc, [(x, thisfunc) for x in range(20)]):
print(r)
if __name__=='__main__':
main()
PS: I know I could define thisfunc at module level, and pass the var argument through processfunc, but since my actual processfunc in real life already takes a lot of arguments, it seemed more readable to pass a single object thisfunc instead of many parameters...

What you have now looks OK, but might be fragile for later changes.
I might use partial in order to simplify the explicit passing of var to a globally defined function.
import random
import multiprocessing as mp
from functools import partial
def processfunc(arg_and_func):
arg, func = arg_and_func
return "%7.4f:%s" %(func(arg), arg)
def thisfunc(var, x):
return x + var
def main(*args):
# the content of var depends of main:
var = random.random()
f = partial(thisfunc, var)
# Test regular use
for x in range(-5,0):
print(processfunc((x, thisfunc)))
# Test parallel runs.
with mp.Pool(2) as pool:
for r in pool.imap_unordered(processfunc, [(x, f) for x in range(20)]):
print(r)
if __name__=='__main__':
main()

Python share variables between functions but not threads

I am writing some code that i have threaded, and am using various different functions at once. I have a variable called ref that is different for each thread.
ref is defined within a function within the threaded function, so when I use global ref, all the threads use the same value for ref (which I don't want). However when I don't use global ref, other functions can't use ref as it is not defined.
E.g.:
def threadedfunction():
def getref():
ref = [get some value of ref]
getref()
def useref():
print(ref)
useref()
threadedfunction()

If defining ref as global doesn't fit your needs then you don't have many other options...
Edit your function's parameters and returns. Possible solution:
def threadedfunction():
def getref():
ref = "Hello, World!"
return ref # Return the value of ref, so the rest of the world can know it
def useref(ref):
print(ref) # Print the parameter ref, whatever it is.
ref = getref() # Put in variable ref whatever function getref() returns
useref(ref) # Call function useref() with ref's value as parameter
threadedfunction()

How can I cleanly associate a constant with a function?

I have a series of functions that I apply to each record in a dataset to generate a new field I store in a dictionary (the records—"documents"—are stored using MongoDB). I broke them all up as they are basically unrelated, and tie them back together by passing them as a list to a function that iterates through each operation for each record and adds on the results.
What irks me is how I'm going about it in what seems like a fairly inelegant manner; semi-duplicating names among other things.
def _midline_length(blob):
'''Generate a midline sequence for *blob*'''
return 42
midline_length = {
'func': _midline_length,
'key': 'calc_seq_midlen'} #: Midline sequence key/function pair.
Lots of these...
do_calcs = [midline_length, ] # all the functions ...
Then called like:
for record in mongo_collection.find():
for calc in do_calcs:
record[calc['key']] = calc['func'](record) # add new data to record
# update record in DB
Splitting up the keys like this makes it easier to remove all the calculated fields in the database (pointless after everything is set, but while developing the code and methodology it's handy).
I had the thought to maybe use classes, but it seems more like an abuse:
class midline_length(object):
key = 'calc_seq_midlen'
#staticmethod
def __call__(blob):
return 42
I could then make a list of instances (do_calcs = [midline_length(), ...]) and run through that calling each thing or pulling out it's key member. Alternatively, it seems like I can arbitrarily add members to functions, def myfunc(): then myfunc.key = 'mykey'...that seems even worse. Better ideas?

You might want to use decorators for this purpose.
import collections
RecordFunc = collections.namedtuple('RecordFunc', 'key func')
def record(key):
def wrapped(func):
return RecordFunc(key, func)
return wrapped
#record('midline_length')
def midline_length(blob):
return 42
Now, midline_length is not actually a function, but it is a RecordFunc object.
>>> midline_length
RecordFunc(key='midline_length', func=<function midline_length at 0x24b92f8>)
It has a func attribute, which is the original function, and a key attribute.
If they get added to the same dictionary, you can do it in the decorator:
RECORD_PARSERS = {}
def record(key):
def wrapped(func):
RECORD_PARSERS[key] = func
return func
return wrapped
#record('midline_length')
def midline_length(blob):
return 42

This is a perfect job for a decorator. Something like:
_CALC_FUNCTIONS = {}
def calcfunc(orig_func):
global _CALC_FUNCTIONS
# format the db name from the function name.
key = 'calc_%s' % orig_func.__name__
# note we're using a set so these will
_CALC_FUNCTIONS[key] = orig_func
return orig_func
#calcfunc
def _midline_length(blob):
return 42
print _CALC_FUNCTIONS
# prints {'calc__midline_length': <function _midline_length at 0x035F7BF0>}
# then your document update is as follows
for record in mongo_collection.find():
for key, func in _CALC_FUNCTIONS.iteritems():
record[key] = func(record)
# update in db
Note that you could also store the attributes on the function object itself like Dietrich pointed out but you'll probably still need to keep a global structure to keep the list of functions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark broadcast variables from local functions - python

Related

Global variables vs. parameters for read-only access

Python Class passing value to "self"

multiprocessing with a global variable (a function) defined inside another function and using local variable

Python share variables between functions but not threads

How can I cleanly associate a constant with a function?

Categories

Resources