I have an asynchronous measurement loop running in the background of my quart app which fills up the flask_caching FileSystemCache with measurement data like this:
i = 0
while True:
data = np.ones(4) # fake data
cache.set(f'{i%N}', data)
i += 1
asyncio.sleep(1/freq)
This works fine and I can also read the data from the cache using cache.get(f'{i}'), however, if I set another variable in the cache eg. cache.set('volume', 100), it is lost some time after setting it. This time somehow depends on N and the chosen CACHE_TRESHOLD.
I do not know how to prevent this, I chose N much smaller than CACHE_TRESHOLD but it still keeps deleting the 'volume' item. I have also set CACHE_DEFAULT_TIMEOUT = 0 to prevent automatic deletion. Interestingly, for N=100 and CACHE_TRESHOLD=1000 it won't delete the other items in the cache, but once both are increased by a factor of 10, something is lost.
I would like to know why this is happening and how I can make sure everything is kept in the cache even if I put a lot of entries into the cache, thanks!
Related
I am using Dask for a complicated operation. First I do a reduction which produces a moderately sized df (a few MBs) which I then need to pass to each worker to calculate the final result so my code looks a bit like this
intermediate_result = ddf.reduction().compute()
final_result = ddf.reduction(
chunk=function, chunk_kwargs={"intermediate_result": intermediate_result}
)
However I am getting the warning message that looks like this
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s)
I have tried doing this
intermediate_result = client.scatter(intermediate_result, broadcast=True)
But this isn't working as the function now sees this as a Future object and not the datatype it is supposed to be.
I can't seem to find any documentation on how to use scatter with reductions, does anyone know how to do this? Or should I just ignore the warning message and pass the moderately sized df as I am?
Actually, the best solution probably is not to scatter your materialised result, but to avoid computing it in the first place. You can simply remove the .compute(), which will mean all the calculation gets done in one stage, with the results automatically moved where you need them.
Alternatively, if you want to have a clear boundary between the stages, you can use
intermediate_result = ddf.reduction().persist()
which will kick off the reduction and store it on workers without pulling it to the client. You can choose to wait on this to finish before the next step or not.
I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this:
my_frames = [[arr1f1, arr1f2, arr1f3], [arr2f1, arr2f2, arr2f3], ...]
So each internal list represents the frames for one video (or product). I'm looking for the best way to send/submit frames to be computed while also writing frames to imageio as they complete (in order). To make it more complicated the internal lists above are actually generators and can be 100s or 1000s of frames. Also keep in mind that because of how imageio works I think it needs to exist in one single process. Here is a simplified version of what I have working so far:
for frame_arrays in frames_to_write:
# 'frame_arrays' is [arr1f1, arr2f1, arr3f1, ...]
future_list = _client.compute(frame_arrays)
# key -> future
future_dict = dict(zip(frame_keys, future_list))
# write the current frame
# future -> key
rev_future_dict = {v: k for k, v in future_dict.items()}
result_iter = as_completed(future_dict.values(), with_results=True)
for future, result in result_iter:
frame_key = rev_future_dict[future]
# get the writer for this specific video and add a new frame
w = writers[frame_key]
w.append_data(result)
This works and my actual code is reorganized from the above to submit the next frame while writing the current frame so there is some benefit I think. I'm thinking of a solution where the user says "I want to process X frames at a time" so I send 50 frames, write 50 frames, send 50 more frames, write 50 frames, etc.
My questions after working on this for a while:
When does result's data live in local memory? When it is returned by the iterator or when it is completed?
Is it possible to do something like this with the dask-core threaded scheduler so a user doesn't have to have distributed installed?
Is it possible to adapt how many frames are sent based on number of workers?
Is there a way to send a dictionary of dask arrays and/or use as_completed with the "frame_key" being included?
If I load the entire series of frames and submit them to the client/cluster I would probably kill the scheduler right?
Is using get_client() followed by Client() on ValueError the preferred way of getting the client (if not provided by the user)?
Is it possible to give dask/distributed one or more iterators that it pulls from as workers become available?
Am I being dumb? Overcomplicating this?
Note: This is kind of an extension to this issue that I made a while ago, but is slightly different.
After following a lot of the examples here I got the following:
try:
# python 3
from queue import Queue
except ImportError:
# python 2
from Queue import Queue
from threading import Thread
def load_data(frame_gen, q):
for frame_arrays in frame_gen:
future_list = client.compute(frame_arrays)
for frame_key, arr_future in zip(frame_keys, future_list):
q.put({frame_key: arr_future})
q.put(None)
input_q = Queue(batch_size if batch_size is not None else 1)
load_thread = Thread(target=load_data, args=(frames_to_write, input_q,))
remote_q = client.gather(input_q)
load_thread.start()
while True:
future_dict = remote_q.get()
if future_dict is None:
break
# write the current frame
# this should only be one element in the dictionary, but this is
# also the easiest way to get access to the data
for frame_key, result in future_dict.items():
# frame_key = rev_future_dict[future]
w = writers[frame_key]
w.append_data(result)
input_q.task_done()
load_thread.join()
This answers most of my questions that I had and seems to work the way I want in general.
Background
I need to send out a large batch of notifications to around ~1 mil devices and I'm building it out using Google Cloud Functions.
In the current setup I enqueue each device token as a PubSub message that:
stores a pending notification in DataStore, used for keeping track of retries and success status
attempts to send the notification
marks the notification as either successful or failed if it's retried enough and hasn't gone through
This works more or less fine and I get decent performance out of this, something 1.5K tokens processed per second.
Issue
I want to keep track of the current progress of the whole job. Given that I know how many notifications I'm expecting to process I want to do be able to report something like x/1_000_000 processed and then consider it done when the sum of failures + successes is as much as what I wanted to process.
The DataStore documentation suggests not running a count on the entities themselves because it won't be performant, which I can confirm. I implemented a counter following their example documentation of a sharded counter which I'm including at the end.
The issue I'm seeing is that it is both quite slow and very prone to returning 409 Contention errors which makes my function invocations retry which is not ideal given that the count itself is not essential to the process and there's only a limited retry budget per notification. In practice the thing that fails the most is incrementing the counter which happens at the end of the process which would increase load on notification reads to check their status on retry and means that I end up with a counter that is less than the actual successful notifications.
I ran a quick benchmark using wrk and seem to get around 400 RPS out of incrementing the counter with an average latency of 250ms. This is quite slow comparing to the notification logic itself that does around 3 DataStore queries per notification and is presumably more complex than incrementing a counter. When added to the contention errors I end up with an implementation that I don't consider stable. I understand that Datastore usually auto-scales with continuous heavy usage but the pattern of using this service is very rare and for the whole batch of tokens so there would not be any previous traffic to scale this up.
Questions
Is there something I'm missing about the counter implementation that could be improved to make it less slow?
Is there a different approach I should consider to get what I want?
Code
The code that interacts with datastore
DATASTORE_READ_BATCH_SIZE = 100
class Counter():
kind = "counter"
shards = 2000
#staticmethod
def _key(namespace, shard):
return hashlib.sha1(":".join([str(namespace), str(shard)]).encode('utf-8')).hexdigest()
#staticmethod
def count(namespace):
keys = []
total = 0
for shard in range(Counter.shards):
if len(keys) == DATASTORE_READ_BATCH_SIZE:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
keys = []
keys.append(client.key(Counter.kind, Counter._key(namespace, shard)))
if len(keys) != 0:
counters = client.get_multi(keys)
total = total + sum([int(c["count"]) for c in counters])
return total
#staticmethod
def increment(namespace):
key = client.key(Counter.kind, Counter._key(namespace, random.randint(0, Counter.shards - 1)))
with client.transaction():
entity = client.get(key)
if entity is None:
entity = datastore.Entity(key=key)
entity.update({
"count": 0,
})
entity.update({
"count": entity["count"] + 1,
})
client.put(entity)
This is called from a Google Cloud Function like so
from flask import abort, jsonify, make_response
from src.notify import FCM, APNS
from src.lib.datastore import Counter
def counter(request):
args = request.args
if args.get("platform"):
Counter.increment(args["platform"])
return
return jsonify({
FCM: Counter.count(FCM),
APNS: Counter.count(APNS)
})
This is used both for incrementing and reading the counts and is split by platform for iOS and Android.
In the end I gave up on the counter and started also saving the status of the notifications in BigQuery. The pricing is still reasonable as it’s still per use and the streaming version of data inserting seems to be fast enough that it doesn’t cause me any issues in practice.
With this I can use a simple sql query to count all the entities matching a batched job. This ends up taking something around 3 seconds for all the entities which, compared to the alternative is acceptable performance for me given that this is only for internal use.
I am running the same simulation in a loop with different parameters. Each simulation makes use a pandas DataFrame (data) which is only read, never modified. Using ipyparallel (IPython parallel), I can put this DataFrames into the global variable space of each engine in my view before simulations start:
view['data'] = data
The engines then have access to the DataFrame for all the simulations which get run on them. The process of copying the data (if pickled, data is 40MB) is only a few seconds. However, It appears that if the number of simulations grows, memory usage grows very large. I imagine this shared data is getting copied for each task rather than just for each engine. What's the best practice for sharing static read-only data from a client with engines? Copying it once per engine is acceptable, but ideally it would only have to be copied once per host (I have 4 engines on host1 and 8 engines on host2).
Here's my code:
from ipyparallel import Client
import pandas as pd
rc = Client()
view = rc[:] # use all engines
view.scatter('id', rc.ids, flatten=True) # So we can track which engine performed what task
def do_simulation(tweaks):
""" Run simulation with specified tweaks """
# Do sim stuff using the global data DataFrame
return results, id, tweaks
if __name__ == '__main__':
data = pd.read_sql("SELECT * FROM my_table", engine)
threads = [] # store list of tweaks dicts
for i in range(4):
for j in range(5):
for k in range(6):
threads.append(dict(i=i, j=j, k=k)
# Set up globals for each engine. This is the read-only DataFrame
view['data'] = data
ar = view.map_async(do_simulation, threads)
# Our async results should pop up over time. Let's measure our progress:
for idx, (results, id, tweaks) in enumerate(ar):
print 'Progress: {}%: Simulation {} finished on engine {}'.format(100.0 * ar.progress / len(ar), idx, id)
# Store results as a pickle for the future
pfile = '{}_{}_{}.pickle'.format(tweaks['i'], tweaks['j'], tweaks['j'])
# Save our results to a pickle file
pd.to_pickle(results, out_file_path + pfile)
print 'Total execution time: {} (serial time: {})'.format(ar.wall_time, ar.serial_time)
If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.
If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.
Lastly, host1 also runs the ipycontroller task, and it's memory usage goes up like crazy (a Python task shows using >6GB RAM, a kernel task shows using 3GB). The host2 engine doesn't really show much memory usage at all. What would cause this spike in memory?
I have used this logic in a code couple years ago, and I got using this. My code was something like:
shared_dict = {
# big dict with ~10k keys, each with a list of dicts
}
balancer = engines.load_balanced_view()
with engines[:].sync_imports(): # your 'view' variable
import pandas as pd
import ujson as json
engines[:].push(shared_dict)
results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()
If simulation counts are small (~50), then it takes a while to get
started, but i start to see progress print statements. Strangely,
multiple tasks will get assigned to the same engine and I don't see a
response until all of those assigned tasks are completed for that
engine. I would expect to see a response from enumerate(ar) every time
a single simulation task completes.
In my case, my_func() was a complex method where I put lots of logging messages written into a file, so I had my print statements.
About the task assignment, as I used load_balanced_view(), I left to the library find its way, and it did great.
If simulation counts are large (~1000), it takes a long time to get
started, i see the CPUs throttle up on all engines, but no progress
print statements are seen until a long time (~40mins), and when I do
see progress, it appears a large block (>100) of tasks went to same
engine, and awaited completion from that one engine before providing
some progress. When that one engine did complete, i saw the ar object
provided new responses ever 4 secs - this may have been the time delay
to write the output pickle files.
About the long time, I haven't experienced that, so I can't say nothing.
I hope this might cast some light in your problem.
PS: as I said in the comment, you could try multiprocessing.Pool. I guess I haven't tried to share a big, read-only data as a global variable using it. I would give a try, because it seems to work.
Sometimes you need to scatter your data grouping by a category, so that you are sure that the each subgroup will be entirely contained by a single cluster.
This is how I usually do it:
# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview = client.load_balanced_view()
lview.block = True
CORES = len(client[:])
# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
sz = df.groupby([grouper]).size().sort_values().index.unique()
for core in range(CORES):
ids = sz[core::CORES]
print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
client[core].push({name:df[df[grouper].isin(ids)]})
# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')
Notice that the function I'm suggesting scatters makes sure each cluster will host a similar number of observations, which is usually a good idea.
I had my django application configured with memcached and everything was working smoothly.
I am trying to populate the cache over time, adding to it as new data comes in from external API's. Here is the gist of what I have going on:
main view
api_query, more_results = apiQuery(**params)
cache_key = "mystring"
cache.set(cache_key, data_list, 600)
if more_results:
t = Thread(target = 'apiMoreResultsQuery', args = (param1, param2, param3))
t.daemon = True
t.start()
more results function
cache_key = "mystring"
my_cache = cache.get(cache_key)
api_query, more_results = apiQuery(**params)
new_cache = my_cache + api_query
cache.set(cache_key, new_cache, 600)
if more_results:
apiMoreResultsQuery(param1, param2, param3)
This method works for several iterations through the apiMoreResultsQuery but at some point the cache returns None causing the whole loop to crash. I've tried increasing the cache expiration but that didn't change anything. Why would the cache be vanishing all of a sudden?
For clarification I am running the apiMoreResultsQuery in a distinct thread because I need to return a response from the initial call faster then the full data-set will populate so I want to keep the populating going in the background while a response can still be returned.
When you set a particular cache key and the item you are setting is larger than the size allotted for a cached item, it fails silently and your key gets set to None. (I know this because I have been bitten by it.)
Memcached uses pickle to cache objects, so at some point new_cache is getting pickled and it's simply larger than the size allotted for cached items.
The memcached default size is 1MB, and you can increase it, but the bigger issue that seems a bit odd is that that you are using the same key over and over again and your single cached item just gets bigger and bigger.
Wouldn't a better strategy be to set new items in the cache and to be sure that those items are small enough to be cached?
Anyway, if you want to see how large your item is growing, so you can test whether or not it's going to go into the cache, you can do some of the following:
>>> import pickle
>>> some_object = [1, 2, 3]
>>> len(pickle.dumps(some_object, -1))
22
>>> new_object = list(range(1000000))
>>> len(pickle.dumps(new_object, -1))
4871352 # Wow, that got pretty big!
Note that this can grow a lot larger if you are pickling Django model instances, in which case it's probably recommended just to pickle the values you want from the instance.
For more reading, see this other answer:
How to get the size of a python object in bytes on Google AppEngine?