What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
Another approach to tackle this problem with only 1 iterator for example?
If not, how to speed up Iterator creation?
Related
I am trying to write a multiprocessed program and it seems I have done it and I have verified with the System Monitor app that the Python processes are created. But the thing is that it appears almost all of them are not utilized in reality. In my program I am trying to split audio files in chunks, so I don't consider it to be a "trivial computational load" as I have read in other threads.
A minimal example that shows the same behavior for me:
import os, random, time
from tqdm import tqdm
from multiprocessing import Pool
def myfunc(myli):
print(len(myli))
for item in myli:
x = item*item*item
time.sleep(2)
return
mylist = [random.randint(1,10000) for _ in range (0, 19999)]
with Pool(processes=8) as p, tqdm(total=len(mylist)) as pbar:
for _ in p.imap_unordered(func=myfunc, iterable=(mylist,)):
pbar.update()
As you see I have added a print() inside the func used, and every time it prints the length of the entire array. As if no splitting is happening.
I have naively tried using different chunksizes and removing tqdm (as if it plays any role).
If you could give me any insight, I would appreciate it.
The code is doing what you told it to do: you passed an iterable of length 1, a tuple containing a single item (mylist). So it passes that single item to a single worker to process.
But you can't do iterable=mylist instead, because myfunc() expects to get a sequence, not an integer. Whatever the iterable is, multiprocessing passes it to the worker one element at a time. chunksize has nothing to do with that. Whether chunksize is 1 or a billion, the worker functions see one element at a time. chunksize is an under-the-covers optimization, purely to reduce the number of expensive interprocess communication calls required.
If you want to split a sequence into chunks and use worker functions that expect chunks, then you have to do the "chunking" yourself. For example, add
# Generate slices of `xs` of length (at most) `n`.
def chunkit(xs, n):
start = 0
while start < len(xs):
yield xs[start : start + n]
start += n
and pass iterable=chunkit(mylist, 40). Then all 8 processes will be busy. One will work on mylist[0:40], another on mylist[40:80], another on mylist[80:120], and so on, until mylist is exhausted.
I am new here but I wanted to ask something regarding multiprocessing.
So I have some huge raster tiles that I process and extract information and I found that delivering tons of pickle files is faster than appending to a dataframe. The point is that I loop over each of my tiles for processing and I create pools inside a for loop
#This creates a directory for my pickle files
if not os.path.exists('pkl_tmp'):
os.mkdir('pkl_tmp')
Here I start looping over each one of my tiles and create a pool with the grid cells that I want to process, then I use the map function to apply all my nasty processing to each cell of my grid.
for GHSL_tile in ROI.iloc[4:].itertuples():
ct += 1
L18_cells = GHSL_query(GHSL_tile, L18_grid)
vector_tile = poligonize_tile(GHSL_tile)
print(datetime.today())
subdir = './pkl_tmp/{}/'.format(ct)
if not os.path.exists(subdir):
os.mkdir(subdir)
if vector_tile is not None:
# assign how many cores will be used
num_processes = int(multiprocessing.cpu_count() - 15)
chunk_size = 1 # chunk size set to 1 to return cell like outputs
# break the dataframe as a list
chunks = [L18_cells.iloc[i:i + chunk_size, :] for i in range(0, L18_cells.shape[0], chunk_size)]
pool = multiprocessing.Pool(processes=num_processes)
result = pool.map(process_cell, chunks)
del result
else:
print('Tile # {} skipped'.format(ct))
print('GHSL database created')
In fact this has not errors, it takes around 2 days executing due to the size of my data and sometimes I had many iddle cores (specially towards the end of a tile).
My question is:
I tried using map_async instead of map and it was creating files really fast, even sometimes processes multiple tiles at the same time which is wonderful, the problem is that when it creates the directory for my last tile, the code gets out of the for loop and many tasks end up not being executed. What am I doing wrong? How can I make the map_async function work better or how can I avoid iddle cores (slow down) when I use the map function?
Thank you in advance
PC resources is definitely not a problem
I've searched probably 10 threads on multiprocessing looking but nothing seems to fit perfectly to my usecase. Here is a general idea of what I want to parallelize.
class foo():
def boo():
filename = 'path to the data file'
with reader(filename) as fileReader:
for id, feature in fileReader:
boo2(id, feature)
def boo2(id, feature):
*process feature then save the output to a folder*
Here I want to parallelize the call to boo2() where fileReader is an iterator (a sequentialMatrixReader from pykaldi) with tens of thousands of rows of id and feature where id is a string and each feature is a matrix (hundreds of row x tens of columns). boo2 will compute a smaller matrix and save the result to a folder based on id. Each call to boo2 are independent from one another so I want to parallelize it.
From my understanding I can't use multiprocessing.Pool since boo2 is a class function and I can't pull that out of the class due to it's complexity.
I don't know how to use multiprocessing.Process since the number of cores is much less than the number of rows of the iterator and I am unsure how to queue new calls to boo2 once I've start() and join() processes (I've tried to split the fileReader into n batches and set a Process per batch however I'd much prefer to queue the calls in one-line vs multiple batchs)
I've also looked into pathos module since it doesn't have problems with class functions. However from sample use-cases the closest fit to my need is:
pathos.threading.ThreadPoolpool.imap(boo2, [feature for feature in fileReader])
But because of how large fileReader is I am unable to fit [feature for feature in fileReader] in memory.
Any and all help is appreciated. Thank you.
You won't be able to use the multiprocessing because of the class members, you need a separate function for that -- you're right about that.
Regarding using threads, I'd suggest you not using a simple comprehension [feature for feature in fileReader], but read the features from fileReader in batches according to the CPU threads you have available, then run threads, wait for the completion and then read the next batch, etc.
Something like:
def make_next_batch( fileReader ) :
batch = []
for feature in fileReader :
if len(batch) == BATCH_SIZE :
yield batch
batch = []
batch.append( feature )
if len(batch) :
yield batch
Then you have to keep only BATCH_SIZE features in memory at the same time.
I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could monitor the actual filled queue size by accessing a raw counter in the graph, e.g. as follows:
queue = tf.train.shuffle_batch(..., name="training_batch_queue")
queue_size_op = "training_batch_queue/random_shuffle_queue_Size:0"
queue_size = session.run(queue_size_op)
However, with the new Dataset API I can't seem to find any variables in the graph related to the queues / datasets, so my old code doesn't work anymore. Is there any way to obtain the number of items in the queue using the new Dataset API (e.g. in the tf.Dataset.prefetch or tf.Dataset.shuffle queue)?
It is important for me to monitor the number of items in the queue, as that tells me a lot about the behaviour of the pre-processing in the queues, including whether the pre-processing or the remainder (e.g. a neural network) is the speed bottleneck.
As a work around it is possible to keep a counter to indicate how many items are in the queue. Here's how to define the counter:
queue_size = tf.get_variable("queue_size", initializer=0,
trainable=False, use_resource=True)
Then, when pre-processing data (e.g. in the dataset.map function), we can increment that counter:
def pre_processing():
data_size = ... # compute this (could be just '1')
queue_size_op = tf.assign_add(queue_size, data_size) # adding items
with tf.control_dependencies([queue_size_op]):
# do the actual pre-processing here
We can then decrement the counter every-time we run our model with a batch of data:
def model():
queue_size_op = tf.assign_add(queue_size, -batch_size) # removing items
with tf.control_dependencies([queue_size_op]):
# define the actual model here
Now, all we need to do is run the queue_size tensor in our training loop to find out what the current queue size is, i.e. the number of items in the queue at this moment:
current_queue_size = session.run(queue_size)
It's a bit less elegant compared to the old way (before the Dataset API), but it does the trick.
I've got a function that takes a node id of a graph as input and calculate something in the graph(without altering the graph object), then it saves the results on the filesystem, my code looks like this:
...
# graph file is being loaded
g = loadGraph(gfile='data/graph.txt')
# list of nodeids is being loaded
nodeids = loadSeeds(sfile='data/seeds.txt')
import multiprocessing as mp
# parallel part of the code
print ("entering the parallel part ..")
num_workers = mp.cpu_count() # 4 on my machine
p = mp.Pool(num_workers)
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
p.map(_myParallelFunction, nodeids)
p.close()
...
The problem is when I load the graph into Python it takes lots of memory(about 2G, it's a big graph with thousands of nodes actually), but when it starts to go into the parallel part of the code(the parallel map function execution) it seems that every process is given a separate copy of g and I simply run out of memory on my machine(it's got 6G ram and 3G swap), so I wanted to see that is there a way to give each process the same copy of g so that only the memory to hold one copy of it would be required? any suggestions are appreciated and thanks in advance.
If dividing the graph into smaller parts does not work, you may be able to find a solution using this or multiprocessing.sharedctypes, depending on what kind of object your graph is.
Your comment indicates that you are processing a single node at a time:
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
I would create a generator function that returns a single node from the graph file each time it's called, and pass that generator to the p.map() function instead of the entire list of nodeids.