Broadcast variables and mapPartitions

Broadcast variables and mapPartitions - python

Context
In pySpark I broadcast a variable to all nodes with the following code:
sc = spark.sparkContext # Get context
# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "foo", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])
# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)
After broadcasting the stopwords I want to make it accessible in mapPartitions:
# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])
# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
"""
Removes stopwords from "text"-column.
#partition: iterator-object of partition.
#passed_stopwords: Lookup-table for stopwords.
"""
# Now the broadcast is passed
passed_stopwords = passed_broadcast.value
for row in partition:
yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]
# re-partitioning in order to get mapPartitions working
df = df.repartition(2)
# Now apply the method
df = df.select("text").rdd \
.mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) \
.toDF()
# Result
df.show()
#Result:
+------------+
| text |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+
Questions
Even though it works I'm not quite sure if this is the right usage of broadcasting variables. So my questions are:
Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that's why I've chosen to also ask this question.

Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.
Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.
Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.
Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.
Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().
Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
Its documented that there is only an advantage if serialization yields any value for the task at hand.
The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
This might be the case when you have a large reference to broadcast to your cluster:
[...] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
If this applies to your broadcast, broadcasting has an advantage.

Related

Receiving data in python callback function from dll

I am writing a program in Python that communicates with a spectrometer from Avantes. There are some proprietary dlls available whose code I don't access to, but they have some decent documentation. I am having some trouble to find a good way to store the data received via callbacks.
The proprietary shared library
Basically, the dll contains a function that I have to call to start measuring and that receives a callback function that will be called whenever the spectrometer has finished a measurement. The function is the following:
int AVS_MeasureCallback(AvsHandle a_hDevice,void (*__Done)(AvsHandle*, int*),short a_Nmsr)
The first argument is a handle object that identifies the spectrometer, the second is the actual callback function and the third is the amount of measurements to be made.
The callback function will receive then receive another type of handle identifying the spetrometer and information about the amount of data available after a measurement.
Python library
I am using a library that has Python wrappers for many equipments, including my spectrometer.
def measure_callback(self, num_measurements, callback=None):
self.sdk.AVS_MeasureCallback(self._handle, callback, num_measurements)
And they also have defined the following decorator:
MeasureCallback = FUNCTYPE(None, POINTER(c_int32), POINTER(c_int32))
The idea is that when the callback function is finally called, this will trigger the get_data() function that will retrieve data from the equipment.
The recommended example is
#MeasureCallback
def callback_fcn(handle, info):
print('The DLL handle is:', handle.contents.value)
if info.contents.value == 0: # equals 0 if everything is okay (see manual)
print(' callback data:', ava.get_data())
ava.measure_callback(-1, callback_fcn)
My problem
I have to store the received data in a 2D numpy array that I have created somewhere else in my main code, but I can't figure out what is the best way to update this array with the new data available inside the callback function.
I wondered if I could pass this numpy array as an argument for the callback function, but even in this case I cannot find a good way to do this since it is expected that the callback function will have only those 2 arguments.
Edit 1
I found a possible solution here but I am not sure it is the best way to do it. I'd rather not create a new class just to hold a single numpy array inside.
Edit 2
I actually changed my mind about my approach, because inside my callback I'd like to do many operations with the received data and save the results in many different variables. So, I went back to the class approach mentioned here, where I would basically have a class with all the variables that will somehow be used in the callback function and that would also inherit or have an object of the class ava.
However, as shown in this other question, the self parameter is a problem in this case.

If you don't want to create a new class, you can use a function closure:
# Initialize it however you want
numpy_array = ...
def callback_fcn(handle, info):
# Do what you want with the value of the variable
store_data(numpy_array, ...)
# After the callback is called, you can access the changes made to the object
print(get_data(numpy_array))
How this works is that when the callback_fcn is defined, it keeps a reference to the value of the variable numpy_array, so when it's called, it can manipulate it, as if it were passed as an argument to the function. So you get the effect of passing it in, without the callback caller having to worry about it.

I finally managed to solve my problem with a solution envolving a new class and also a closure function to deal with the self parameter that is described here. Besides that, another problem would appear by garbage collection of the new created method.
My final solution is:
class spectrometer():
def measurement_callback(self,handle,info):
if info.contents.value >= 0:
timestamp,spectrum = self.ava.get_data()
self.spectral_data[self.spectrum_index,:] = np.ctypeslib.as_array(spectrum[0:pixel_amount])
self.timestamps[self.spectrum_index] = timestamp
self.spectrum_index += 1
def __init__(self,ava):
self.ava = ava
self.measurement_callback = MeasureCallback(self.measurement_callback)
def register_callback(self,scans,pattern_amount,pixel_amount):
self.spectrum_index = 0
self.timestamps = np.empty((pattern_amount),dtype=np.uint32)
self.spectral_data = np.empty((pattern_amount,pixel_amount),dtype=np.float64)
self.ava.measure_callback(scans, self.measurement_callback)

Circular references in dask with out-of-core computations

I am writing a library that implements various out of core algorithms and have run into the issue that it is possible to build a circularly dependent computation graph by source and storing to the same memmapped object. For example
import numpy as np
import dask.array as da
array_shape = (8000, 8000)
chunk_shape = (100, 100)
### create the example data
adisk = np.memmap('/tmp/a.npy', dtype=np.float32, shape=array_shape)
bdisk = np.memmap('/tmp/b.npy', dtype=np.float32, shape=array_shape)
a = da.ones(mainshape, chunks=chunkshape, dtype=np.float32)
b = 2*da.ones(mainshape, chunks=chunkshape, dtype=np.float32)
a.store(adisk)
b.store(bdisk)
adisk.flush()
bdisk.flush()
### Begin demonstration of issue
c = da.from_array(adisk, chunks=chunkshape)
d = da.from_array(bdisk, chunks=chunkshape)
e = c#d
e.store(adisk)
# Assert fails because source data is overwritten before re-read
assert np.all(adisk[:] == adisk[0,0])
My testing suggests that if the data is too large to cache in memory the dask will complete the operation but the behaviour is undefined as source data can be overwritten by result data before it is reused for other parts of the computation. Thus the above code ceases to produce correct results above a certain (machine dependent) matrix size.
Does dask provide any methods to help detect and mitigate such circular dependencies?
I have looked into potential solutions and I am currently thinking that a plugin function could be useful here to check that the target of a store operation is not the same as any source further back in the chain.

I believe that dask.array tokenizes memmapped objects by filename and last modification time. Cases like this where it is yet to be written to seem like a fail case.
I would expect this to fail somewhere in the task optimization phase.
Short term I recommend explicitly providing a name= keyword to the da.from_array function to provide your own unique ID.

I created a solution for my purposes that can be used to check that there are no circular dependencies. In my case because of the code structure I can easily guard the potential issues with e.g assert no_circular_dependencies(e, adisk) but in general this would be tricky to achieve without Monkeypatching Dask.
Because each dask array keeps a list of all tasks in the taskgraph it is straightforward to check the graph to find original data sources and ensure that they are not the same as the destination array.
def get_true_base(ndarr):
if not isinstance(ndarr, np.ndarray):
raise TypeError("expected numpy array")
base = ndarr
while True:
try:
base = base.base
except AttributeError:
break
return base
def get_memmap_bases(dsk):
# better to build a set of names and then grab arrays later
nameset = set()
for name, task in dsk.items():
if isinstance(task[0], np.memmap):
nameset.add(name)
bases = [get_true_base(dsk[name][0]) for name in nameset]
return bases
def no_circular_dependency(dask_arr, target):
if get_true_base(target) in get_memmap_bases(dask_arr.dask):
return False
return True

Can the map function supplied to `tf.data.Dataset.from_generator(...)` resolve a tensor object?

I'd like to create a tf.data.Dataset.from_generator(...) dataset. I need to pass in a Python generator.
I would like to pass in a property of a previous dataset to the generator like so:
dataset = dataset.interleave(
map_func=lambda x: tf.data.Dataset.from_generator(generator=lambda: gen(x), output_types=tf.int64),
cycle_length=2
)
Where I define gen(...) to take a value (which is a pointer to some data such as a filename which gen knows how to access).
This fails because gen receives a tensor object, not a python/numpy value.
Is there a way to resolve the tensor object to a value inside of gen(...)?
The reason for interleaving the generators is so I can manipulate the list of data-pointers/filenames with other dataset operations such as .shuffle() and .repeat() without the need to bake those into the gen(...) function, which would be necessary if I started with the generator directly from the list of data-pointers/filenames.
I want to use the generator because a large number of data values will be generated per data-pointer/filename.

TensorFlow now supports passing tensor arguments to the generator:
def map_func(tensor):
dataset = tf.data.Dataset.from_generator(generator, tf.float32, args=(tensor,))
return dataset

The answer is indeed no. Here is a reference to a couple of relevant git issues (open as of the time of this writing) for further developments on the question:
https://github.com/tensorflow/tensorflow/issues/13101
https://github.com/tensorflow/tensorflow/issues/16343

Tensorflow: how to create a new collection for summaries?

I want to create two different collection for summaries. One is for training summary and one is for validation summary.
So i can use two different merge_all operation to store the value
merge_all(key=tf.GraphKeys.SUMMARIES)
The function scalar can be added to a collection.
tf.summary.scalar(
name,
tensor,
collections=None,
family=None
)
how to create a new collection for summaries?

It should be possible to use an arbitrary string value as the key. It might look something like this:
tf.summary.scalar('tag_a', ...)
tf.summary.scalar('tag_b', ..., collections=["foo"])
merged_a = tf.summary.merge_all()
merged_b = tf.summary.merge_all(key="foo")
writer_a = tf.summary.FileWriter(log_dir + '/collection_a')
writer_b = tf.summary.FileWriter(log_dir + '/collection_b')
for step in range(1000):
summary_a, summary_b = sess.run([merged_a, merged_b], ...)
writer_a.add_summary(summary_a, step)
writer_b.add_summary(summary_b, step)
It's worth mentioning that normally people will configure things in a way where there's one merge_all operation and multiple run calls. For example: https://github.com/tensorflow/tensorflow/blob/cf7c008ab150ac8e5edb3ed053d38b2919699796/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py#L142 Even if the summary series are broken up using the collections parameter, TensorBoard would still visually represent them as if they're different runs. Please also note that the name parameter corresponds to each chart (tag) within a run.

Access Shared DataFrame in Multiprocessing Map

I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
Pass a DataFrame instead of integers ID arguments to avoid the sessions.loc... line of code. This approach slow down the script a lot.
Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.

You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map you can use the functools.partial function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Broadcast variables and mapPartitions - python

Related

Receiving data in python callback function from dll

Circular references in dask with out-of-core computations

Can the map function supplied to `tf.data.Dataset.from_generator(...)` resolve a tensor object?

Tensorflow: how to create a new collection for summaries?

Access Shared DataFrame in Multiprocessing Map

Categories

Resources