I have broadcast variable setup in a separate py file and I am then importing the same in a file that contains my UDFs. But when I try to use this variable in UDF, I see that broadcast variable is not initialized (NoneType) when used in the scope of some Dataframe transformation function. Here is the supporting code.
Broadcast model is in utils.py and defined as below,
class Broadcaster(object):
_map = {}
_bv = None
#staticmethod
def set_item(k, v):
Broadcaster._map[k] = v
#staticmethod
def broadcast(sc):
Broadcaster._bv = sc.broadcast(Broadcaster._map)
#staticmethod
def get_item(k):
val = Broadcaster._bv.value
return val.get(k)
Reason for doing this is to provide an interface where multiple k,v combinations can be set before broadcasting. Which means, in my main.py, I can now call Broadcaster.set_item(k, v) multiple times and then eventually call Broadcaster.broadcast(sc) which is working fine. But now, I want to use this broadcast variable in UDF which is in a separate file (say udfs.py). Note that these UDFs are used in my Dataframe processing. Below is a sample UDF,
def my_udf(col):
bv = Broadcaster._bv.value #this throws exception :-(
#more code
In my udfs.py file, I have no trouble accessing Broadcaster._bv.value. Just that when used within udf and when this udf is called from within Dataframe, I am getting NoneType doesn't have value exception. Basically worker nodes are unable to access broadcasted variable. Why can't I use the broadcast variable in cross files? I have seen examples where people are defining udf in the same file where broadcasted variable is present and it seem to be working fine. But I need to have these in separate files due to the volume of code. What are my options?
EDIT: I don't want to serialize the object, pass it to UDF during call and de-serialize within UDF. I believe that defeats the purpose of broadcast variable.
Related
I am writing a program in Python that communicates with a spectrometer from Avantes. There are some proprietary dlls available whose code I don't access to, but they have some decent documentation. I am having some trouble to find a good way to store the data received via callbacks.
The proprietary shared library
Basically, the dll contains a function that I have to call to start measuring and that receives a callback function that will be called whenever the spectrometer has finished a measurement. The function is the following:
int AVS_MeasureCallback(AvsHandle a_hDevice,void (*__Done)(AvsHandle*, int*),short a_Nmsr)
The first argument is a handle object that identifies the spectrometer, the second is the actual callback function and the third is the amount of measurements to be made.
The callback function will receive then receive another type of handle identifying the spetrometer and information about the amount of data available after a measurement.
Python library
I am using a library that has Python wrappers for many equipments, including my spectrometer.
def measure_callback(self, num_measurements, callback=None):
self.sdk.AVS_MeasureCallback(self._handle, callback, num_measurements)
And they also have defined the following decorator:
MeasureCallback = FUNCTYPE(None, POINTER(c_int32), POINTER(c_int32))
The idea is that when the callback function is finally called, this will trigger the get_data() function that will retrieve data from the equipment.
The recommended example is
#MeasureCallback
def callback_fcn(handle, info):
print('The DLL handle is:', handle.contents.value)
if info.contents.value == 0: # equals 0 if everything is okay (see manual)
print(' callback data:', ava.get_data())
ava.measure_callback(-1, callback_fcn)
My problem
I have to store the received data in a 2D numpy array that I have created somewhere else in my main code, but I can't figure out what is the best way to update this array with the new data available inside the callback function.
I wondered if I could pass this numpy array as an argument for the callback function, but even in this case I cannot find a good way to do this since it is expected that the callback function will have only those 2 arguments.
Edit 1
I found a possible solution here but I am not sure it is the best way to do it. I'd rather not create a new class just to hold a single numpy array inside.
Edit 2
I actually changed my mind about my approach, because inside my callback I'd like to do many operations with the received data and save the results in many different variables. So, I went back to the class approach mentioned here, where I would basically have a class with all the variables that will somehow be used in the callback function and that would also inherit or have an object of the class ava.
However, as shown in this other question, the self parameter is a problem in this case.
If you don't want to create a new class, you can use a function closure:
# Initialize it however you want
numpy_array = ...
def callback_fcn(handle, info):
# Do what you want with the value of the variable
store_data(numpy_array, ...)
# After the callback is called, you can access the changes made to the object
print(get_data(numpy_array))
How this works is that when the callback_fcn is defined, it keeps a reference to the value of the variable numpy_array, so when it's called, it can manipulate it, as if it were passed as an argument to the function. So you get the effect of passing it in, without the callback caller having to worry about it.
I finally managed to solve my problem with a solution envolving a new class and also a closure function to deal with the self parameter that is described here. Besides that, another problem would appear by garbage collection of the new created method.
My final solution is:
class spectrometer():
def measurement_callback(self,handle,info):
if info.contents.value >= 0:
timestamp,spectrum = self.ava.get_data()
self.spectral_data[self.spectrum_index,:] = np.ctypeslib.as_array(spectrum[0:pixel_amount])
self.timestamps[self.spectrum_index] = timestamp
self.spectrum_index += 1
def __init__(self,ava):
self.ava = ava
self.measurement_callback = MeasureCallback(self.measurement_callback)
def register_callback(self,scans,pattern_amount,pixel_amount):
self.spectrum_index = 0
self.timestamps = np.empty((pattern_amount),dtype=np.uint32)
self.spectral_data = np.empty((pattern_amount,pixel_amount),dtype=np.float64)
self.ava.measure_callback(scans, self.measurement_callback)
I am working in Pyspark with a lambda function like the following:
udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])
The implementation of the method1 is the next one:
def method1(value, dict_global):
result = len(dict_global)
if (value in dict_global):
result = dict_global[value]
return result
'dict_global' is a global dictionary that contains some values.
The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?
Finally I found a solution. I write it below:
Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.
Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).
I rewrite my code to show you how did I fix the problem:
broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])
Hope it helps!
I am trying to update a variable in pyspark and want to use the same in another method. I am using #property in class, when i tested it in python it is working as expected but when i am trying to implement it in pyspark it is not updating the variable. Please help me find out what i am doing wrong.
Code:
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
filenme = sc.wholeTextFiles("/user/root/CCDs")
hrk = Hrk("No Value")
def add_demo(filename):
pfname[]
plname[]
PDOB[]
gender[]
.......i have not mentioned my logic, i skipped that part......
hrk.hrkval = pfname[0]+"##"+plname[0]+PDOB[0]+gender[0]
return (str(hrk.hrkval))
def add_med(filename):
return (str(hrk.hrkval))
filenme.map(getname).map(add_demo).saveAsTextFile("/user/cloudera/Demo/")
filenme.map(getname).map(add_med).saveAsTextFile("/user/cloudera/Med/")
In my first method call (add_demo) i am getting the proper value but when i want to use the same variable in the second method i am getting No Value . I don't know why it is not updating the variable. Where as similar logic working fine in python.
You are trying to mutate the state of a global variable using the map API. This is not a recommended pattern for Spark. You try to use pure functions as much as possible, and use operations like .reduce or .reduceByKey or .foldLeft. The reason the following simplified example does not work is because when .map is called, spark creates a closure for the function f1, creates a copy of hrk object for each "partition" and applies this to the rows within each partition.
import pyspark
import pyspark.sql
number_cores = 2
memory_gb = 1
conf = (
pyspark.SparkConf()
.setMaster('local[{}]'.format(number_cores))
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
c = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SQLContext(sc)
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
hrk = Hrk("No Value")
print(hrk.hrkval)
# No Value
def f1(x):
hrk.hrkval = str(x)
return "str:"+str(hrk.hrkval)
data = sc.parallelize([1,2,3])
data.map(f1).collect()
# ['str:1', 'str:2', 'str:3']
print(hrk.hrkval)
# No Value
You can read more about closures in Understanding Closures section in the rdd programming guide on official spark docs, here are some important snippets:
One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-
I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)
I'm new to Python and using Anaconda (editor: Spyder) to write some simple functions. I've created a collection of 20 functions and saved them in separate .py files (file names are the same as function names).
For example
def func1(X)
Y=...
return Y
I have another function that takes as input a function name as string (one of those 20 functions), calls it, does some calculations and return the output.
def Main(String,X)
Z=...
W=String(Z)
V=...
return V
How can I choose the function based on string input?
More details:
The Main function calculates the Sobol Indices of a given function. I write the Main function. My colleagues write their own functions (each might be more than 500 lines of codes) and just want to use Main to get the Sobol indices. I will give Main to other people so I do NOT know what Main will get as a function in the future. I also do not want the user of Main to go through the trouble of making a dictionary.
Functions are objects in Python. This means you can store them in dictionaries. One approach is to dispatch the function calls by storing the names you wish to call as keys and the functions as values.
So for example:
import func1, func2
operation_dispatcher = {
"func1": getattr(func1, "func1"),
"func2": getattr(func2, "func2"),
}
def something_calling_funcs(func_name, param):
"""Calls func_name with param"""
func_to_call = operation_dispatcher.get(func_name, None)
if func_to_call:
func_to_call(param)
Now it might be possible to generate the dispatch table more automatically with something like __import__ but there might be a better design in this case (perhaps consider reorganizing your imports).
EDIT took me a minute to fully test this because I had to set up a few files, you can potentially do something like this if you have a lot of names to import and don't want to have to specify each one manually in the dictionary:
import importlib
func_names = ["func1", "func2"]
operation_dispatch = {
name : getattr(importlib.import_module(name), name)
for name in func_names}
#usage
result = operation_dispatch[function_name](param)
Note that this assumes that the function names and module names are the same. This uses importlib to import the module names from the strings provided in func_names here.
You'll just want to have a dictionary of functions, like this:
call = {
"func1": func1,
"functionOne": func1,
"func2": func2,
}
Note that you can have multiple keys for the same function if necessary and the name doesn't need to match the function exactly, as long as the user enters the right key.
Then you can call this function like this:
def Main(String,X)
Z=...
W=call[String](Z)
V=...
return V
Though I recommend catching an error when the user fails to enter a valid key.
def Main(String,X)
Z=...
try:
W=call[String](Z)
except KeyError:
raise(NameError, String + " is not a valid function key")
V=...
return V