For performance monitoring I would like to keep an eye on the currently queued example. I am balancing the amount of threads I'm using for filling the queue and the optimal maximum size of the queue.
How do I obtain this information? I am using a tf.train.batch(), but I guess the information might be somewhere down in the FIFOQueue?
I would have expected this to be a local variable but I haven't found it.
tldr: if your queue is created by tf.batch, you can get size with sess.run("batch/fifo_queue_Size:0")
A FIFOQueue object provides a size() method which creates an op that gives number of elements on queue. However, if you are using tf.batch, FIFOQueue is created inside the method and this object is not exposed externally.
In particular you see this in input.py
queue = _which_queue(dynamic_pad)(
capacity=capacity, dtypes=types, shapes=shapes, shared_name=shared_name)
print("Enqueueing: ", enqueue_many, tensor_list, shapes)
_enqueue(queue, tensor_list, num_threads, enqueue_many)
summary.scalar("queue/%s/fraction_of_%d_full" % (queue.name, capacity),
math_ops.cast(queue.size(), dtypes.float32) *
(1. / capacity))
Since queue is local, you can't get a hold of its size() method. However since the size() has been called in order to construct the summary, the appropriate size op is in the graph and you can call it by name. You can find the name of the node by doing something like this
x = tf.constant(1)
q = tf.train.batch([x], 2)
tf.get_default_graph().as_graph_def()
You will see
node {
name: "batch/fifo_queue_Size"
op: "QueueSize"
input: "batch/fifo_queue"
attr {
key: "_class"
value {
list {
From this you can tell that batch/fifo_queue_Size is the name of the op, and hence batch/fifo_queue_Size:0 is the name of the first output, so you can get the size by doing something like this:
sess.run("batch/fifo_queue_Size:0")
If you have multiple batch ops, the names will be automatically deduped into batch_1/fifo_queue_Size, batch_2/fifo_queue_Size, etc
Alternatively you can call your node with tf.batch(...name="mybatch") and then the name of tensor will be mybatch/fifo_queue_Size:0
Related
Context
In pySpark I broadcast a variable to all nodes with the following code:
sc = spark.sparkContext # Get context
# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "foo", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])
# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)
After broadcasting the stopwords I want to make it accessible in mapPartitions:
# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])
# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
"""
Removes stopwords from "text"-column.
#partition: iterator-object of partition.
#passed_stopwords: Lookup-table for stopwords.
"""
# Now the broadcast is passed
passed_stopwords = passed_broadcast.value
for row in partition:
yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]
# re-partitioning in order to get mapPartitions working
df = df.repartition(2)
# Now apply the method
df = df.select("text").rdd \
.mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) \
.toDF()
# Result
df.show()
#Result:
+------------+
| text |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+
Questions
Even though it works I'm not quite sure if this is the right usage of broadcasting variables. So my questions are:
Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that's why I've chosen to also ask this question.
Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.
Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.
Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.
Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.
Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().
Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
Its documented that there is only an advantage if serialization yields any value for the task at hand.
The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
This might be the case when you have a large reference to broadcast to your cluster:
[...] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
If this applies to your broadcast, broadcasting has an advantage.
I have a dataset, which I call with batch['data'] and get my image output MxM. After I get my image I want to process it with some numpy operations. In this process I want my dataset to give me the image with GPU and changing the outputs device to CPU after that.
My question is, is concetanation of functions in Python being executed in an order? And can I make this process with
base = batch['data'].cuda().function().cpu()
And is this the same as:
base = batch['data'].cuda().function()
base.cpu()
Thanks in advance!
Well, the CPU(s) will do the same work, but the result is not the same.
base = batch['data'].cuda().cpu()
After that line, you have the output of cpu() stored in the variable called base.
base = batch['data'].cuda()
base.cpu()
After these two lines, you have the output of cuda() stored in the variable called base and you have forgotten the result of cpu().
is concatenation of functions in Python being executed in an order?
Yes, of course: the first method returns some object, and the next one is called on that returned object.
No, these pieces of code are not the same:
The first one assigns the return value of cpu to base
The second one throws this value away
Also, if you need the object returned by batch['data'].cuda(), then the first code will call cpu on it and potentially throw it away afterwards. The second one saves that object but gets rid of the result of calling cpu, which may not be desirable
Same thing is with writing batch['data'].cuda() or tmp = batch['data']; base = tmp.cuda(): batch['data'] returns some object, and then .cuda can be called on that object.
As long as functions return object that have the methods you want to call, you can chain as many methods as you want to: thing().a().b().c().d()
I used a loop to iterate over the TensorFlow graph and retrieve the values of constant Tensors.
I tried to find a similar way to retrieve the values of Variable tensors by iterating through the elements but I did not find any solution.
Here is a sample code:
The session run method is already invoked.
This loop iterates over the graph and retrieves the values of Tensor Constants.
for n in tf.get_default_graph().as_graph_def().node:
if 'Const' in n.name:
if not n.attr["value"].tensor.tensor_shape.dim:
const[n.name] = n.attr.get('value').tensor.int_val[0]
else:
const[n.name] = tensor_util.MakeNdarray(n.attr['value'].tensor)
The snippet:
n.attr.get('value').tensor.int_val[0]
gets the value of constant if it is a single number
otherwise the statement bellow retrieves the values of the tensor and stores them into a ndarray.
tensor_util.MakeNdarray(n.attr['value'].tensor)
So I tried this:
if 'Variable' in n.name:
var = tensor_util.MakeNdarray(n.attr['value'].tensor)
I am aware that I can retrieve the values of the variables with session run() or eval() methods for the specified elements.
But here I would like to loop over the graph elements.
Related links:
How do I get the current value of a Variable?
https://www.tensorflow.org/guide/variables
How to access tensor_content values in TensorProto in TensorFlow?
Resolved:
After debugging and observing the TensorBoard graph I realized that only the Variable_weights/initial_value node has the actual values after session run.
So the solution above is working since Variable_weights/initial_value is a constant TensorFlow node that is automatically created and is the input to the assigned node of TensorFlow Variable.
Every Variable in the TensorFlow graph has 4 operations/nodes. The Variable, Identity, Assign and Const which is the initial_value.
At first, I thought that the 'Variable' node will return the values.
But eventually, I invoked the snippet bellow in "initial_value" node and gets the values.
Solution:
if 'Variable_weights/initial_value' in n.name:
var = tensor_util.MakeNdarray(n.attr['value'].tensor)
I'd like to create a tf.data.Dataset.from_generator(...) dataset. I need to pass in a Python generator.
I would like to pass in a property of a previous dataset to the generator like so:
dataset = dataset.interleave(
map_func=lambda x: tf.data.Dataset.from_generator(generator=lambda: gen(x), output_types=tf.int64),
cycle_length=2
)
Where I define gen(...) to take a value (which is a pointer to some data such as a filename which gen knows how to access).
This fails because gen receives a tensor object, not a python/numpy value.
Is there a way to resolve the tensor object to a value inside of gen(...)?
The reason for interleaving the generators is so I can manipulate the list of data-pointers/filenames with other dataset operations such as .shuffle() and .repeat() without the need to bake those into the gen(...) function, which would be necessary if I started with the generator directly from the list of data-pointers/filenames.
I want to use the generator because a large number of data values will be generated per data-pointer/filename.
TensorFlow now supports passing tensor arguments to the generator:
def map_func(tensor):
dataset = tf.data.Dataset.from_generator(generator, tf.float32, args=(tensor,))
return dataset
The answer is indeed no. Here is a reference to a couple of relevant git issues (open as of the time of this writing) for further developments on the question:
https://github.com/tensorflow/tensorflow/issues/13101
https://github.com/tensorflow/tensorflow/issues/16343
I want to create two different collection for summaries. One is for training summary and one is for validation summary.
So i can use two different merge_all operation to store the value
merge_all(key=tf.GraphKeys.SUMMARIES)
The function scalar can be added to a collection.
tf.summary.scalar(
name,
tensor,
collections=None,
family=None
)
how to create a new collection for summaries?
It should be possible to use an arbitrary string value as the key. It might look something like this:
tf.summary.scalar('tag_a', ...)
tf.summary.scalar('tag_b', ..., collections=["foo"])
merged_a = tf.summary.merge_all()
merged_b = tf.summary.merge_all(key="foo")
writer_a = tf.summary.FileWriter(log_dir + '/collection_a')
writer_b = tf.summary.FileWriter(log_dir + '/collection_b')
for step in range(1000):
summary_a, summary_b = sess.run([merged_a, merged_b], ...)
writer_a.add_summary(summary_a, step)
writer_b.add_summary(summary_b, step)
It's worth mentioning that normally people will configure things in a way where there's one merge_all operation and multiple run calls. For example: https://github.com/tensorflow/tensorflow/blob/cf7c008ab150ac8e5edb3ed053d38b2919699796/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py#L142 Even if the summary series are broken up using the collections parameter, TensorBoard would still visually represent them as if they're different runs. Please also note that the name parameter corresponds to each chart (tag) within a run.