Does the App Engine Mapreduce API decide compute shard size according to its own logic in the final reduce job?
I am using the App Engine mapreduce API and have supplied the shard_size
kwarg to set my mapreduce shard size.
The shard size is particularly important in my mapreduce job because I don't want to batch too many results into any one given execution of the final step of my reduce function. In other words, I'm hardcoding the shard size to evenly divide the users up according to an external constraint on the system.
The map job seems to shard out just fine, but the reducer uses only a fraction of the shards I've designated.
Here is a rough outline of the sort of code I am dealing with:
SHARD_SIZE = 42
def map_fun(entity):
shard_key = random.randint(1, SHARD_SIZE)
yield (
shard_key,
db.model_to_protobuf(entity).SerializeToString().encode('base64')
)
def reduce_fun(key, entities):
batch = []
for entity in entities:
#check for stuff
batch.append(entity)
expensive_side_effect(batch)
class MyGreatPipeline(base_handler.PipelineBase):
def run(self, *args, **kw):
yield mapreduce_pipeline.MapreducePipeline(
'label'
'path.to.map_fun',
'path.to.reduce_fun',
'mapreduce.input_readers.DatastoreInputReader',
'mapreduce.output_writers.BlobstoreOutputWriter',
mapper_params={
'entity_kind': 'path.to.entity',
'queue_name': 'coolQueue'
},
reducer_params={},
shard_size = SHARD_SIZE
)
map_fun specifically assigns each entity a shard that is determined randomly according to the shard size. I'm confused about why my reducer would have fewer shards than SHARD_SIZE given that there are many entities and it is exceedingly unlikely that the same integers were picked repeatedly.
I'm puzzling over what you're doing here. Using the map phase to group stuff onto a small, sharded key, later processing those keys at reduce time looks odd. You're going to end up with too much work to do per key, even if you do engage as many reduce workers as you do mapper workers.
The 'batch' being processing is randomly arbitrary, so I assume that expensive_side_effect() isn't dependent on the content of the batch. Why not do that work instead at map time, emitting something that a reduced could pass through to the output writer?
Related
I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.
I've searched probably 10 threads on multiprocessing looking but nothing seems to fit perfectly to my usecase. Here is a general idea of what I want to parallelize.
class foo():
def boo():
filename = 'path to the data file'
with reader(filename) as fileReader:
for id, feature in fileReader:
boo2(id, feature)
def boo2(id, feature):
*process feature then save the output to a folder*
Here I want to parallelize the call to boo2() where fileReader is an iterator (a sequentialMatrixReader from pykaldi) with tens of thousands of rows of id and feature where id is a string and each feature is a matrix (hundreds of row x tens of columns). boo2 will compute a smaller matrix and save the result to a folder based on id. Each call to boo2 are independent from one another so I want to parallelize it.
From my understanding I can't use multiprocessing.Pool since boo2 is a class function and I can't pull that out of the class due to it's complexity.
I don't know how to use multiprocessing.Process since the number of cores is much less than the number of rows of the iterator and I am unsure how to queue new calls to boo2 once I've start() and join() processes (I've tried to split the fileReader into n batches and set a Process per batch however I'd much prefer to queue the calls in one-line vs multiple batchs)
I've also looked into pathos module since it doesn't have problems with class functions. However from sample use-cases the closest fit to my need is:
pathos.threading.ThreadPoolpool.imap(boo2, [feature for feature in fileReader])
But because of how large fileReader is I am unable to fit [feature for feature in fileReader] in memory.
Any and all help is appreciated. Thank you.
You won't be able to use the multiprocessing because of the class members, you need a separate function for that -- you're right about that.
Regarding using threads, I'd suggest you not using a simple comprehension [feature for feature in fileReader], but read the features from fileReader in batches according to the CPU threads you have available, then run threads, wait for the completion and then read the next batch, etc.
Something like:
def make_next_batch( fileReader ) :
batch = []
for feature in fileReader :
if len(batch) == BATCH_SIZE :
yield batch
batch = []
batch.append( feature )
if len(batch) :
yield batch
Then you have to keep only BATCH_SIZE features in memory at the same time.
Background
I need to send out a large batch of notifications to around ~1 mil devices and I'm building it out using Google Cloud Functions.
In the current setup I enqueue each device token as a PubSub message that:
stores a pending notification in DataStore, used for keeping track of retries and success status
attempts to send the notification
marks the notification as either successful or failed if it's retried enough and hasn't gone through
This works more or less fine and I get decent performance out of this, something 1.5K tokens processed per second.
Issue
I want to keep track of the current progress of the whole job. Given that I know how many notifications I'm expecting to process I want to do be able to report something like x/1_000_000 processed and then consider it done when the sum of failures + successes is as much as what I wanted to process.
The DataStore documentation suggests not running a count on the entities themselves because it won't be performant, which I can confirm. I implemented a counter following their example documentation of a sharded counter which I'm including at the end.
The issue I'm seeing is that it is both quite slow and very prone to returning 409 Contention errors which makes my function invocations retry which is not ideal given that the count itself is not essential to the process and there's only a limited retry budget per notification. In practice the thing that fails the most is incrementing the counter which happens at the end of the process which would increase load on notification reads to check their status on retry and means that I end up with a counter that is less than the actual successful notifications.
I ran a quick benchmark using wrk and seem to get around 400 RPS out of incrementing the counter with an average latency of 250ms. This is quite slow comparing to the notification logic itself that does around 3 DataStore queries per notification and is presumably more complex than incrementing a counter. When added to the contention errors I end up with an implementation that I don't consider stable. I understand that Datastore usually auto-scales with continuous heavy usage but the pattern of using this service is very rare and for the whole batch of tokens so there would not be any previous traffic to scale this up.
Questions
Is there something I'm missing about the counter implementation that could be improved to make it less slow?
Is there a different approach I should consider to get what I want?
Code
The code that interacts with datastore
DATASTORE_READ_BATCH_SIZE = 100
class Counter():
kind = "counter"
shards = 2000
#staticmethod
def _key(namespace, shard):
return hashlib.sha1(":".join([str(namespace), str(shard)]).encode('utf-8')).hexdigest()
#staticmethod
def count(namespace):
keys = []
total = 0
for shard in range(Counter.shards):
if len(keys) == DATASTORE_READ_BATCH_SIZE:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
keys = []
keys.append(client.key(Counter.kind, Counter._key(namespace, shard)))
if len(keys) != 0:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
return total
#staticmethod
def increment(namespace):
key = client.key(Counter.kind, Counter._key(namespace, random.randint(0, Counter.shards - 1)))
with client.transaction():
entity = client.get(key)
if entity is None:
entity = datastore.Entity(key=key)
entity.update({
"count": 0,
})
entity.update({
"count": entity["count"] + 1,
})
client.put(entity)
This is called from a Google Cloud Function like so
from flask import abort, jsonify, make_response
from src.notify import FCM, APNS
from src.lib.datastore import Counter
def counter(request):
args = request.args
if args.get("platform"):
Counter.increment(args["platform"])
return
return jsonify({
FCM: Counter.count(FCM),
APNS: Counter.count(APNS)
})
This is used both for incrementing and reading the counts and is split by platform for iOS and Android.
In the end I gave up on the counter and started also saving the status of the notifications in BigQuery. The pricing is still reasonable as it’s still per use and the streaming version of data inserting seems to be fast enough that it doesn’t cause me any issues in practice.
With this I can use a simple sql query to count all the entities matching a batched job. This ends up taking something around 3 seconds for all the entities which, compared to the alternative is acceptable performance for me given that this is only for internal use.
I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could monitor the actual filled queue size by accessing a raw counter in the graph, e.g. as follows:
queue = tf.train.shuffle_batch(..., name="training_batch_queue")
queue_size_op = "training_batch_queue/random_shuffle_queue_Size:0"
queue_size = session.run(queue_size_op)
However, with the new Dataset API I can't seem to find any variables in the graph related to the queues / datasets, so my old code doesn't work anymore. Is there any way to obtain the number of items in the queue using the new Dataset API (e.g. in the tf.Dataset.prefetch or tf.Dataset.shuffle queue)?
It is important for me to monitor the number of items in the queue, as that tells me a lot about the behaviour of the pre-processing in the queues, including whether the pre-processing or the remainder (e.g. a neural network) is the speed bottleneck.
As a work around it is possible to keep a counter to indicate how many items are in the queue. Here's how to define the counter:
queue_size = tf.get_variable("queue_size", initializer=0,
trainable=False, use_resource=True)
Then, when pre-processing data (e.g. in the dataset.map function), we can increment that counter:
def pre_processing():
data_size = ... # compute this (could be just '1')
queue_size_op = tf.assign_add(queue_size, data_size) # adding items
with tf.control_dependencies([queue_size_op]):
# do the actual pre-processing here
We can then decrement the counter every-time we run our model with a batch of data:
def model():
queue_size_op = tf.assign_add(queue_size, -batch_size) # removing items
with tf.control_dependencies([queue_size_op]):
# define the actual model here
Now, all we need to do is run the queue_size tensor in our training loop to find out what the current queue size is, i.e. the number of items in the queue at this moment:
current_queue_size = session.run(queue_size)
It's a bit less elegant compared to the old way (before the Dataset API), but it does the trick.
I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.