Global variables not recognized in lambda functions in Pyspark

Global variables not recognized in lambda functions in Pyspark - python

I am working in Pyspark with a lambda function like the following:
udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])
The implementation of the method1 is the next one:
def method1(value, dict_global):
result = len(dict_global)
if (value in dict_global):
result = dict_global[value]
return result
'dict_global' is a global dictionary that contains some values.
The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?

Finally I found a solution. I write it below:
Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.
Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).
I rewrite my code to show you how did I fix the problem:
broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])
Hope it helps!

Related

Python: sharing a dictionary using the multiprocessing capability of scipy.optimize.differential_evolution

I am running an optimisation problem using the module scipy.optimize.differential_evolution. The code I wrote is quite complex and I will try to summarise the difficulties I have:
the objective function is calculated with an external numerical model (i.e. I am not optimising an analytical function). To do that I created a specific function that runs the model and another one to post process the results.
I am constraining my problem with some constraints. The constraints are not constraining the actual parameters of the problem but for some dependent variables that can be obtained only at the end of the simulation of my external numerical model. Each constraint was defined with a separate function
The problem with 2. is that the external model might be run twice for the same set of parameters: the first time to calculate the objective function and the second time to calculate the dependent variables to be assessed for the constraints. To avoid that and speed up my code I created a global dictionary were I save the results of my dependent variables for each set of parameters (as a look up table) every time the external model is called. This will prevent the function that assesses the constraints to run the model again for the same set of parameters.
This works very well when I use a single CPU optimisation. However, it is my understanding that the function differential_evolution allows also multiprocessing by setting an appropriate value to the option "workers" (see here https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html#r108fc14fa019-1). My problems is that I have no idea how to update a global/shared variable when I enable the multiprocessing capability.
The webpage above states:
"If workers is an int the population is subdivided into workers sections and evaluated in parallel (uses multiprocessing.Pool) [...]"
So I deduced that I have to find a way to modify a shared variable when multiprocessing.pool is used. In this regard I found these solutions:
Shared variable in python's multiprocessing
multiprocessing.Pool with a global variable
Why multiprocessing.Pool cannot change global variable?
Python sharing a dictionary between parallel processes
I think the last one is appropriate for my case. However, I am not sure how I have to set up my code and the workers option of the differential_evolution function.
Any help will be appreciated.
My code is something like:
def run_external_model(q):
global dict_obj, dict_dep_var
....
obj, dep_var = post_process_model(q)
dict_var_dep[str(q)] = dep_var
dict_obj[str(q)] = obj
def obj(q):
global dict_obj
if str(q) not in list(dict_obj.keys()):
run_external_model(q)
return dict_obj[str(q)]
def constraint(q):
global dict_dep_var
if str(q) not in list(dict_dep_var.keys()):
run_external_model(q)
return dict_dep_var[str(q)]
dict_obj = {}
dict_dep_var = {}
nlcs = scipy.optimize.NonLinearConstraint(constraint, 0., np.inf)
q0 = np.array([q1, .... , qn])
b = np.array([(0, 100.)] * len(q0))
solution = scipy.optimize.differential_evolution(objective, bounds=(b), constraints=(nlcs), seed=1)
The code above works with a single core. I am trying a solution to share the dictionaries dict_obj and dict_dep_var

Accessing broadcast variables in user defined function (udf) in separate files

I have broadcast variable setup in a separate py file and I am then importing the same in a file that contains my UDFs. But when I try to use this variable in UDF, I see that broadcast variable is not initialized (NoneType) when used in the scope of some Dataframe transformation function. Here is the supporting code.
Broadcast model is in utils.py and defined as below,
class Broadcaster(object):
_map = {}
_bv = None
#staticmethod
def set_item(k, v):
Broadcaster._map[k] = v
#staticmethod
def broadcast(sc):
Broadcaster._bv = sc.broadcast(Broadcaster._map)
#staticmethod
def get_item(k):
val = Broadcaster._bv.value
return val.get(k)
Reason for doing this is to provide an interface where multiple k,v combinations can be set before broadcasting. Which means, in my main.py, I can now call Broadcaster.set_item(k, v) multiple times and then eventually call Broadcaster.broadcast(sc) which is working fine. But now, I want to use this broadcast variable in UDF which is in a separate file (say udfs.py). Note that these UDFs are used in my Dataframe processing. Below is a sample UDF,
def my_udf(col):
bv = Broadcaster._bv.value #this throws exception :-(
#more code
In my udfs.py file, I have no trouble accessing Broadcaster._bv.value. Just that when used within udf and when this udf is called from within Dataframe, I am getting NoneType doesn't have value exception. Basically worker nodes are unable to access broadcasted variable. Why can't I use the broadcast variable in cross files? I have seen examples where people are defining udf in the same file where broadcasted variable is present and it seem to be working fine. But I need to have these in separate files due to the volume of code. What are my options?
EDIT: I don't want to serialize the object, pass it to UDF during call and de-serialize within UDF. I believe that defeats the purpose of broadcast variable.

Use function parameter to construct name of object or dataframe

I would like to use a function's parameter to create dynamic names of dataframes and/or objects in Python. I have about 40 different names so it would be really elegant to do this in a function. Is there a way to do this or do I need to do this via 'dict'? I read that 'exec' is dangerous (not that I could get this to work). SAS has this feature for their macros which is where I am coming from. Here is an example of what I am trying to do (using '#' for illustrative purposes):
def TrainModels (mtype):
model_#mtype = ExtraTreesClassifier()
model_#mtype.fit(X_#mtype, Y_#mtype)
TrainModels ('FirstModel')
TrainModels ('SecondModel')

You could use a dictionary for this:
models = {}
def TrainModels (mtype):
models[mtype] = ExtraTreesClassifier()
models[mtype].fit()

First of all, any name you define within your TrainModels function will be local to that function, so won't be accessible in the rest of your program. So you have to define a global name.
Everything in Python is a dictionary, including the global namespace. You can define a new global name dynamically as follows:
my_name = 'foo'
globals()[my_name] = 'bar'
This is terrible and you should never do it. It adds too much indirection to your code. When someone else (or yourself in 3 months when the code is no longer fresh in your mind) reads the code and see 'foo' used elsewhere, they'll have a hard time figuring out where it came from. Code analysis tools will not be able to help you.
I would use a dict as Milkboat suggested.

Not able to update variable in Pyspark

I am trying to update a variable in pyspark and want to use the same in another method. I am using #property in class, when i tested it in python it is working as expected but when i am trying to implement it in pyspark it is not updating the variable. Please help me find out what i am doing wrong.
Code:
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
filenme = sc.wholeTextFiles("/user/root/CCDs")
hrk = Hrk("No Value")
def add_demo(filename):
pfname[]
plname[]
PDOB[]
gender[]
.......i have not mentioned my logic, i skipped that part......
hrk.hrkval = pfname[0]+"##"+plname[0]+PDOB[0]+gender[0]
return (str(hrk.hrkval))
def add_med(filename):
return (str(hrk.hrkval))
filenme.map(getname).map(add_demo).saveAsTextFile("/user/cloudera/Demo/")
filenme.map(getname).map(add_med).saveAsTextFile("/user/cloudera/Med/")
In my first method call (add_demo) i am getting the proper value but when i want to use the same variable in the second method i am getting No Value . I don't know why it is not updating the variable. Where as similar logic working fine in python.

You are trying to mutate the state of a global variable using the map API. This is not a recommended pattern for Spark. You try to use pure functions as much as possible, and use operations like .reduce or .reduceByKey or .foldLeft. The reason the following simplified example does not work is because when .map is called, spark creates a closure for the function f1, creates a copy of hrk object for each "partition" and applies this to the rows within each partition.
import pyspark
import pyspark.sql
number_cores = 2
memory_gb = 1
conf = (
pyspark.SparkConf()
.setMaster('local[{}]'.format(number_cores))
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
c = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SQLContext(sc)
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
hrk = Hrk("No Value")
print(hrk.hrkval)
# No Value
def f1(x):
hrk.hrkval = str(x)
return "str:"+str(hrk.hrkval)
data = sc.parallelize([1,2,3])
data.map(f1).collect()
# ['str:1', 'str:2', 'str:3']
print(hrk.hrkval)
# No Value
You can read more about closures in Understanding Closures section in the rdd programming guide on official spark docs, here are some important snippets:
One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-

Python - Pass variable handle to evaluate

I am writing some program using python and the z3py module.
What I am trying to do is the following: I extract a constraint of an if or a while statement from a function which is located in some other file. Additionally I extract the used variables in the statement as well as their types.
As I do not want to parse the constraint by hand into a z3py friendly form, I tried to use evaluate to do this for me. Therefore I used the tip of the following page: Z3 with string expressions
Now the problem is: I do not know how the variables in the constraint are called. But it seems as I have to name the handle of each variable like the actual variable. Otherwise evaluate won't find it. My code looks like this:
solver = Solver()
# Look up the constraint:
branch = bd.getBranchNum(0)
constr = branch.code
# Create handle for each variable, depending on its type:
for k in mapper.getVariables():
var = mapper.getVariables()[k]
if k in constr:
if var.type == "intNum":
Int(k)
else:
Real(k)
# Evaluate constraint, insert the result and solve it:
f = eval(constr)
solver.insert(f)
solve(f)
As you can see I saved the variables and constraints in classes. When executing this code I get the following error:
NameError: name 'real_x' is not defined
If I do not use the looping over the variables, but instead the following code, everything works fine:
solver = Solver()
branch = bd.getBranchNum(0)
constr = branch.code
print(constr)
real_x = Real('real_x')
int_y = Int('int_y')
f = eval(constr)
print(f)
solver.insert(f)
solve(f)
The problem is: I do not know, that the variables are called "real_x" or "int_y". Furthermore I do not know how many variables there are used, which means I have to use some dynamic thing like a loop.
Now my question is: Is there a way around this? What can I do to tell python that the handles already exist, but have a different name? Or is my approach completely wrong and I have to do something totally different?

This kind of thing is almost always a bad idea (see Why eval/exec is bad for more details), but "almost always" isn't "always", and it looks like you're using a library that was specifically designed to be used this way, in which case you've found one of the exceptions.
And at first glance, it seems like you've also hit one of the rare exceptions to the Keep data out of your variable names guideline (also see Why you don't want to dynamically create variables). But you haven't.
The only reason you need these variables like real_x to exist is so that eval can see them, right? But the eval function already knows how to look for variables in a dictionary instead of in your global namespace. And it looks like what you're getting back from mapper.getVariables() is a dictionary.
So, skip that whole messy loop, and just do this:
variables = mapper.getVariables()
f = eval(constr, globals=variables)
(In earlier versions of Python, globals is a positional-only argument, so just drop the globals= if you get an error about that.)
As the documentation explains, this gives the eval function access to your actual variables, plus the ones the mapper wants to generate, and it can do all kinds of unsafe things. If you want to prevent unsafe things, do this:
variables = dict(mapper.getVariables())
variables['__builtins__'] = {}
f = eval(constr, globals=variables)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Global variables not recognized in lambda functions in Pyspark - python

Related

Python: sharing a dictionary using the multiprocessing capability of scipy.optimize.differential_evolution

Accessing broadcast variables in user defined function (udf) in separate files

Use function parameter to construct name of object or dataframe

Not able to update variable in Pyspark

Python - Pass variable handle to evaluate

Categories

Resources