I want to create two different collection for summaries. One is for training summary and one is for validation summary.
So i can use two different merge_all operation to store the value
merge_all(key=tf.GraphKeys.SUMMARIES)
The function scalar can be added to a collection.
tf.summary.scalar(
name,
tensor,
collections=None,
family=None
)
how to create a new collection for summaries?
It should be possible to use an arbitrary string value as the key. It might look something like this:
tf.summary.scalar('tag_a', ...)
tf.summary.scalar('tag_b', ..., collections=["foo"])
merged_a = tf.summary.merge_all()
merged_b = tf.summary.merge_all(key="foo")
writer_a = tf.summary.FileWriter(log_dir + '/collection_a')
writer_b = tf.summary.FileWriter(log_dir + '/collection_b')
for step in range(1000):
summary_a, summary_b = sess.run([merged_a, merged_b], ...)
writer_a.add_summary(summary_a, step)
writer_b.add_summary(summary_b, step)
It's worth mentioning that normally people will configure things in a way where there's one merge_all operation and multiple run calls. For example: https://github.com/tensorflow/tensorflow/blob/cf7c008ab150ac8e5edb3ed053d38b2919699796/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py#L142 Even if the summary series are broken up using the collections parameter, TensorBoard would still visually represent them as if they're different runs. Please also note that the name parameter corresponds to each chart (tag) within a run.
Related
I am currently working on a MILP formulation that I want to solve using Gurobi with a branch-and-cut approach. My model is a variation of a classic Pickup and Delivery Problem with Time Windows (PDPTW), for which several classes of valid inequalities are defined. As the branch-and-bound solver runs, I want to add those inequalities (i.e., I want to add cuts), if certain conditions in the current node are met. My issue is as follows:
My variables are defined as dictionaries, which makes it easy to use them when formulating constraints because I can easily use their original indexing. An example of how I define variables is provided below
tauOD = {}
# Start- End-Service time of trucks
for i in range(0,Nt):
tauOD[i,0]=model.addVar(lb=0.0, ub=truckODTime[i][0],
vtype=GRB.CONTINUOUS,name='tauOD[%s,%s]'%(i,0))
tauOD[i,1]=model.addVar(lb=0.0, ub=truckODTime[i][1],
vtype=GRB.CONTINUOUS,name='tauOD[%s,%s]'%(i,1))
Once my model is defined in terms of variables, constraints, and cost function, in a classic branch-and-bound problem I would simply use model.optimize() to start the process. In this case, I am using the command model.optimize(my_callback), where my_callback is the callback function I defined to add cuts. My issue is that the callback function, for some reasons, does not like model variables defined as dictionaries. The only workaround I found is as follows:
model._vars = model.getVars() #---> added this call right before the optimization starts
model.optimize(mycallback)
and then inside the callback I can now retrieve variables using their ordering, not their indices as follows:
def mycallback(model,where):
if where == GRB.Callback.MIPNODE:
status = model.cbGet(GRB.Callback.MIPNODE_STATUS)
# If current node was solved to optimality, add cuts to strenghten
# linear relaxation
if status == GRB.OPTIMAL:
this_Sol = model.cbGetNodeRel(model._vars) # Get variables of current solution
# Adding a cut
model.cbCut(lhs=this_Sol[123]+this_Sol[125],sense=GRB.LESS_EQUAL,rhs=1) #---> Dummy cut just
# for illustration
# purposes
The aforementioned cut is just a dummy example to show that I can add cuts using the order variables are sequenced in my solution, and not their indexing. As example, I would like to be able to write a constraint inside my callback as
x[0,3,0] + x[0,5,0] <= 1
but the only thing I can do is to write
this_Sol[123] + this_Sol[125] <= 1 (assuming x[0,3,0] is the 124-th variable of my solution vector, and x[0,5,0] is the 126-th). Although knowing the order of variables is doable, because it depends on how I create them when setting up the model, it is a much more challenging process (and error-prone) rather than being able to use the indices, as I do when defining the original constraints of my model (see below for an example):
###################
### CONSTRAINTS ###
###################
# For each truck, one active connection from origin depot
for i in range(0,Nt):
thisLHS = LinExpr()
for j in range(0,sigma):
thisLHS += x[0,j+1,i]
thisLHS += x[0,2*sigma+1,i]
model.addConstr(lhs=thisLHS, sense=GRB.EQUAL, rhs=1,
name='C1_'+str(i))
Did any of you experience a similar problem? A friend of mine told me that Gurobi, for some reasons, does not like variables defined as dictionaries inside a callback function, but I do not know how to circumvent this.
Any help would be greatly appreciated.
Thanks!
Alessandro
You should make a copy of the variables by their dicts.
To get the variable index, you also have to make a copy of the lists os indexes.
Try this:
model._I = model.I
model._J = model.J
model._K = model.K
model._x = model.x
You need theses indexes lists so you can loop each target variable x to verify some condition. As you would do writing a normal constraint for your model.
Then inside your callback you can make the index iterations:
def mycallback(model,where):
if where == GRB.Callback.MIPNODE:
x = model.cbGetSolution(model._x)
for i in model._I:
if sum([x[i,j,k] for j in model._J for k in model._K]) > 1:
Add_the_cut()
Context
In pySpark I broadcast a variable to all nodes with the following code:
sc = spark.sparkContext # Get context
# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "foo", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])
# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)
After broadcasting the stopwords I want to make it accessible in mapPartitions:
# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])
# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
"""
Removes stopwords from "text"-column.
#partition: iterator-object of partition.
#passed_stopwords: Lookup-table for stopwords.
"""
# Now the broadcast is passed
passed_stopwords = passed_broadcast.value
for row in partition:
yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]
# re-partitioning in order to get mapPartitions working
df = df.repartition(2)
# Now apply the method
df = df.select("text").rdd \
.mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) \
.toDF()
# Result
df.show()
#Result:
+------------+
| text |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+
Questions
Even though it works I'm not quite sure if this is the right usage of broadcasting variables. So my questions are:
Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that's why I've chosen to also ask this question.
Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.
Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.
Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.
Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.
Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().
Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
Its documented that there is only an advantage if serialization yields any value for the task at hand.
The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
This might be the case when you have a large reference to broadcast to your cluster:
[...] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
If this applies to your broadcast, broadcasting has an advantage.
I'd like to create a tf.data.Dataset.from_generator(...) dataset. I need to pass in a Python generator.
I would like to pass in a property of a previous dataset to the generator like so:
dataset = dataset.interleave(
map_func=lambda x: tf.data.Dataset.from_generator(generator=lambda: gen(x), output_types=tf.int64),
cycle_length=2
)
Where I define gen(...) to take a value (which is a pointer to some data such as a filename which gen knows how to access).
This fails because gen receives a tensor object, not a python/numpy value.
Is there a way to resolve the tensor object to a value inside of gen(...)?
The reason for interleaving the generators is so I can manipulate the list of data-pointers/filenames with other dataset operations such as .shuffle() and .repeat() without the need to bake those into the gen(...) function, which would be necessary if I started with the generator directly from the list of data-pointers/filenames.
I want to use the generator because a large number of data values will be generated per data-pointer/filename.
TensorFlow now supports passing tensor arguments to the generator:
def map_func(tensor):
dataset = tf.data.Dataset.from_generator(generator, tf.float32, args=(tensor,))
return dataset
The answer is indeed no. Here is a reference to a couple of relevant git issues (open as of the time of this writing) for further developments on the question:
https://github.com/tensorflow/tensorflow/issues/13101
https://github.com/tensorflow/tensorflow/issues/16343
I started using hypothesis to write my tests. I like it, but I am stuck to generate some kind of data.
I have a test that use list of data, which can be constructed from tuple(key, value).
Key can be text, integer or float, value can be anything comparable.
For one test, all keys must be the same type, and all values must be the same type.
The only way I found to generate data I want is something like that:
#given(
st.one_of(
st.lists(st.tuples(st.integers(), st.integers())),
st.lists(st.tuples(st.integers(), st.floats())),
st.lists(st.tuples(st.integers(), st.text())),
st.lists(st.tuples(st.floats(), st.integers())),
st.lists(st.tuples(st.floats(), st.floats())),
st.lists(st.tuples(st.floats(), st.text())),
#...
))
def test_stuff(lst):
data = [Mydata(k, v) for k, v in lst]
#...
Is there a better way to generate all data type combination I want to test ?
My preferred way to do this is to choose the key and value strategies in #given, then construct your strategy and draw from it inside your test. The "all keys must be of the same one of these types" thing is an unusual requirement, but interactive data is very powerful:
#given(
st.data(),
key_st=st.sampled_from([st.integers(), st.floats(), st.text()]),
value_st=st.sampled_from([st.integers(), st.floats(), st.text()]),
)
def test_stuff(data, key_st, value_st):
test_data = data.draw(st.lists(st.builds(Mydata, key_st, value_st)))
... # TODO: assert things about test_data
I've also used st.builds() rather than going via tuples - since we're calling it inside the test, any exceptions in Mydata will be (minimised) test failures rather than errors.
For performance monitoring I would like to keep an eye on the currently queued example. I am balancing the amount of threads I'm using for filling the queue and the optimal maximum size of the queue.
How do I obtain this information? I am using a tf.train.batch(), but I guess the information might be somewhere down in the FIFOQueue?
I would have expected this to be a local variable but I haven't found it.
tldr: if your queue is created by tf.batch, you can get size with sess.run("batch/fifo_queue_Size:0")
A FIFOQueue object provides a size() method which creates an op that gives number of elements on queue. However, if you are using tf.batch, FIFOQueue is created inside the method and this object is not exposed externally.
In particular you see this in input.py
queue = _which_queue(dynamic_pad)(
capacity=capacity, dtypes=types, shapes=shapes, shared_name=shared_name)
print("Enqueueing: ", enqueue_many, tensor_list, shapes)
_enqueue(queue, tensor_list, num_threads, enqueue_many)
summary.scalar("queue/%s/fraction_of_%d_full" % (queue.name, capacity),
math_ops.cast(queue.size(), dtypes.float32) *
(1. / capacity))
Since queue is local, you can't get a hold of its size() method. However since the size() has been called in order to construct the summary, the appropriate size op is in the graph and you can call it by name. You can find the name of the node by doing something like this
x = tf.constant(1)
q = tf.train.batch([x], 2)
tf.get_default_graph().as_graph_def()
You will see
node {
name: "batch/fifo_queue_Size"
op: "QueueSize"
input: "batch/fifo_queue"
attr {
key: "_class"
value {
list {
From this you can tell that batch/fifo_queue_Size is the name of the op, and hence batch/fifo_queue_Size:0 is the name of the first output, so you can get the size by doing something like this:
sess.run("batch/fifo_queue_Size:0")
If you have multiple batch ops, the names will be automatically deduped into batch_1/fifo_queue_Size, batch_2/fifo_queue_Size, etc
Alternatively you can call your node with tf.batch(...name="mybatch") and then the name of tensor will be mybatch/fifo_queue_Size:0