Access number of queued items in the TensorFlow Dataset API - python

I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could monitor the actual filled queue size by accessing a raw counter in the graph, e.g. as follows:
queue = tf.train.shuffle_batch(..., name="training_batch_queue")
queue_size_op = "training_batch_queue/random_shuffle_queue_Size:0"
queue_size = session.run(queue_size_op)
However, with the new Dataset API I can't seem to find any variables in the graph related to the queues / datasets, so my old code doesn't work anymore. Is there any way to obtain the number of items in the queue using the new Dataset API (e.g. in the tf.Dataset.prefetch or tf.Dataset.shuffle queue)?
It is important for me to monitor the number of items in the queue, as that tells me a lot about the behaviour of the pre-processing in the queues, including whether the pre-processing or the remainder (e.g. a neural network) is the speed bottleneck.

As a work around it is possible to keep a counter to indicate how many items are in the queue. Here's how to define the counter:
queue_size = tf.get_variable("queue_size", initializer=0,
trainable=False, use_resource=True)
Then, when pre-processing data (e.g. in the dataset.map function), we can increment that counter:
def pre_processing():
data_size = ... # compute this (could be just '1')
queue_size_op = tf.assign_add(queue_size, data_size) # adding items
with tf.control_dependencies([queue_size_op]):
# do the actual pre-processing here
We can then decrement the counter every-time we run our model with a batch of data:
def model():
queue_size_op = tf.assign_add(queue_size, -batch_size) # removing items
with tf.control_dependencies([queue_size_op]):
# define the actual model here
Now, all we need to do is run the queue_size tensor in our training loop to find out what the current queue size is, i.e. the number of items in the queue at this moment:
current_queue_size = session.run(queue_size)
It's a bit less elegant compared to the old way (before the Dataset API), but it does the trick.

Related

is there a way to workaround ''recursion depth reach in comparison'' in parallelizing a loop within a loop?

What is the problem about:
I am building an agent-based model with mesa & networkx in python. In one line, the model tries to model how changes in an agent's attitude can influence whether or not they take the decision to adopt a new technology. I m currently attempting to parallelize a part of it, to speed up run time. The number of agents currently are 4000. I keep hitting the error message as follows:
'if i < 256
Recursion depth reached in comparison'
The pseudo-code below outlines the process (after which I explain what I've tried and failed).
Initializes a model of 4000 agents
Gives each agent a set of agents to interact with at every time step at two levels: a) geographic neighbhours, b) 3 social circles.
From each interaction pair in the list, agents' attitudes are compared, some modifications to attitudes are made.
This process repeats for several time-steps, with results of one step carrying over to another.
import pandas as pd
import multiprocessing as mp
import dill
from pathos.multiprocessing import ProcessingPool
def model_initialization():
df = pd.read_csv(path+'4000_household_agents.csv')
for agent in df:
model.schedule.add(agent)
#assign three circles of influence
agent.social_circle1 = social_circle1
agent.social_circle2 = social_circle2
agent.social_circle3 = social_circle3
def assign_interactions():
for agent in schedule.agents:
#geograhic neighbhours
neighbours = agent.get_neighbhours()
interaction_list.append(agent,neighbhour)
#interaction in circles of influence
interaction_list.append(agent, social_circle1)
interaction_list.append(agent, social_circle2)
interaction_list.append(agent, social_circle3)
return interaction_list
def attitude_change(agent1,agent2):
#compare attitudes
if agent1.attitude > agent2.attitude:
# make some change to attitudes
agent1.attitude -= 0.2
agent2.attitude += 0.2
return agent1.attitude,agent2.attitude
def interactions(interaction):
agent1 = interaction[0]
agent2 = interaction[1]
agent1.attitude,agent2.attitude = attitude_change(agent1,agent2)
def main():
model_initialization()
interaction_list= assign_interactions()
#pool = mp.Pool(10)
pool = ProcessingPool(10)
#interaction list can have over and above 89,000 interactions atleast
results = pool.map(interactions, [interaction for interaction in interaction_list])
# run this process several times
for i in range(12):
main()
What I've tried
Because the model step is sequential, the only part I can parallelize is the interactions( ) function. Because I thought the interaction loop is called more 90,000 times, I reset the sys.setrecursionlimit( ) to about 100,000. Fails.
I have broken the interactions_list to several chunks of 500 each and pooled the processes for each chunk. Same error.
To see if something was absolutely wrong, I only took the first 35 elements (a small number) of the interactions list and only ran that. It still hits recursion depth.
Can anyone help me see which part of the code hits recursion depth? I tried both dill + multiprocessing as well as multiprocessing alone. The latter gives 'pickling error'.

multiprocessing a large dataset through a complex class function with an iterator

I've searched probably 10 threads on multiprocessing looking but nothing seems to fit perfectly to my usecase. Here is a general idea of what I want to parallelize.
class foo():
def boo():
filename = 'path to the data file'
with reader(filename) as fileReader:
for id, feature in fileReader:
boo2(id, feature)
def boo2(id, feature):
*process feature then save the output to a folder*
Here I want to parallelize the call to boo2() where fileReader is an iterator (a sequentialMatrixReader from pykaldi) with tens of thousands of rows of id and feature where id is a string and each feature is a matrix (hundreds of row x tens of columns). boo2 will compute a smaller matrix and save the result to a folder based on id. Each call to boo2 are independent from one another so I want to parallelize it.
From my understanding I can't use multiprocessing.Pool since boo2 is a class function and I can't pull that out of the class due to it's complexity.
I don't know how to use multiprocessing.Process since the number of cores is much less than the number of rows of the iterator and I am unsure how to queue new calls to boo2 once I've start() and join() processes (I've tried to split the fileReader into n batches and set a Process per batch however I'd much prefer to queue the calls in one-line vs multiple batchs)
I've also looked into pathos module since it doesn't have problems with class functions. However from sample use-cases the closest fit to my need is:
pathos.threading.ThreadPoolpool.imap(boo2, [feature for feature in fileReader])
But because of how large fileReader is I am unable to fit [feature for feature in fileReader] in memory.
Any and all help is appreciated. Thank you.
You won't be able to use the multiprocessing because of the class members, you need a separate function for that -- you're right about that.
Regarding using threads, I'd suggest you not using a simple comprehension [feature for feature in fileReader], but read the features from fileReader in batches according to the CPU threads you have available, then run threads, wait for the completion and then read the next batch, etc.
Something like:
def make_next_batch( fileReader ) :
batch = []
for feature in fileReader :
if len(batch) == BATCH_SIZE :
yield batch
batch = []
batch.append( feature )
if len(batch) :
yield batch
Then you have to keep only BATCH_SIZE features in memory at the same time.

Submit dask arrays to distributed client while using results at the same time

I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this:
my_frames = [[arr1f1, arr1f2, arr1f3], [arr2f1, arr2f2, arr2f3], ...]
So each internal list represents the frames for one video (or product). I'm looking for the best way to send/submit frames to be computed while also writing frames to imageio as they complete (in order). To make it more complicated the internal lists above are actually generators and can be 100s or 1000s of frames. Also keep in mind that because of how imageio works I think it needs to exist in one single process. Here is a simplified version of what I have working so far:
for frame_arrays in frames_to_write:
# 'frame_arrays' is [arr1f1, arr2f1, arr3f1, ...]
future_list = _client.compute(frame_arrays)
# key -> future
future_dict = dict(zip(frame_keys, future_list))
# write the current frame
# future -> key
rev_future_dict = {v: k for k, v in future_dict.items()}
result_iter = as_completed(future_dict.values(), with_results=True)
for future, result in result_iter:
frame_key = rev_future_dict[future]
# get the writer for this specific video and add a new frame
w = writers[frame_key]
w.append_data(result)
This works and my actual code is reorganized from the above to submit the next frame while writing the current frame so there is some benefit I think. I'm thinking of a solution where the user says "I want to process X frames at a time" so I send 50 frames, write 50 frames, send 50 more frames, write 50 frames, etc.
My questions after working on this for a while:
When does result's data live in local memory? When it is returned by the iterator or when it is completed?
Is it possible to do something like this with the dask-core threaded scheduler so a user doesn't have to have distributed installed?
Is it possible to adapt how many frames are sent based on number of workers?
Is there a way to send a dictionary of dask arrays and/or use as_completed with the "frame_key" being included?
If I load the entire series of frames and submit them to the client/cluster I would probably kill the scheduler right?
Is using get_client() followed by Client() on ValueError the preferred way of getting the client (if not provided by the user)?
Is it possible to give dask/distributed one or more iterators that it pulls from as workers become available?
Am I being dumb? Overcomplicating this?
Note: This is kind of an extension to this issue that I made a while ago, but is slightly different.
After following a lot of the examples here I got the following:
try:
# python 3
from queue import Queue
except ImportError:
# python 2
from Queue import Queue
from threading import Thread
def load_data(frame_gen, q):
for frame_arrays in frame_gen:
future_list = client.compute(frame_arrays)
for frame_key, arr_future in zip(frame_keys, future_list):
q.put({frame_key: arr_future})
q.put(None)
input_q = Queue(batch_size if batch_size is not None else 1)
load_thread = Thread(target=load_data, args=(frames_to_write, input_q,))
remote_q = client.gather(input_q)
load_thread.start()
while True:
future_dict = remote_q.get()
if future_dict is None:
break
# write the current frame
# this should only be one element in the dictionary, but this is
# also the easiest way to get access to the data
for frame_key, result in future_dict.items():
# frame_key = rev_future_dict[future]
w = writers[frame_key]
w.append_data(result)
input_q.task_done()
load_thread.join()
This answers most of my questions that I had and seems to work the way I want in general.

Multiplex between Iterators in TensorFlow

What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
Another approach to tackle this problem with only 1 iterator for example?
If not, how to speed up Iterator creation?

Setting App Engine mapreduce shard size

Does the App Engine Mapreduce API decide compute shard size according to its own logic in the final reduce job?
I am using the App Engine mapreduce API and have supplied the shard_size
kwarg to set my mapreduce shard size.
The shard size is particularly important in my mapreduce job because I don't want to batch too many results into any one given execution of the final step of my reduce function. In other words, I'm hardcoding the shard size to evenly divide the users up according to an external constraint on the system.
The map job seems to shard out just fine, but the reducer uses only a fraction of the shards I've designated.
Here is a rough outline of the sort of code I am dealing with:
SHARD_SIZE = 42
def map_fun(entity):
shard_key = random.randint(1, SHARD_SIZE)
yield (
shard_key,
db.model_to_protobuf(entity).SerializeToString().encode('base64')
)
def reduce_fun(key, entities):
batch = []
for entity in entities:
#check for stuff
batch.append(entity)
expensive_side_effect(batch)
class MyGreatPipeline(base_handler.PipelineBase):
def run(self, *args, **kw):
yield mapreduce_pipeline.MapreducePipeline(
'label'
'path.to.map_fun',
'path.to.reduce_fun',
'mapreduce.input_readers.DatastoreInputReader',
'mapreduce.output_writers.BlobstoreOutputWriter',
mapper_params={
'entity_kind': 'path.to.entity',
'queue_name': 'coolQueue'
},
reducer_params={},
shard_size = SHARD_SIZE
)
map_fun specifically assigns each entity a shard that is determined randomly according to the shard size. I'm confused about why my reducer would have fewer shards than SHARD_SIZE given that there are many entities and it is exceedingly unlikely that the same integers were picked repeatedly.
I'm puzzling over what you're doing here. Using the map phase to group stuff onto a small, sharded key, later processing those keys at reduce time looks odd. You're going to end up with too much work to do per key, even if you do engage as many reduce workers as you do mapper workers.
The 'batch' being processing is randomly arbitrary, so I assume that expensive_side_effect() isn't dependent on the content of the batch. Why not do that work instead at map time, emitting something that a reduced could pass through to the output writer?

Categories

Resources