Best practice to release memory after url fetch on appengine (python)

Best practice to release memory after url fetch on appengine (python) - python

my problem is how to best release memory the response of an asynchrones url fetch needs on appengine. Here is what I basically do in python:
rpcs = []
for event in event_list:
url = 'http://someurl.com'
rpc = urlfetch.create_rpc()
rpc.callback = create_callback(rpc)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
for rpc in rpcs:
rpc.wait()
In my test scenario it does that for 1500 request. But I need an architecture to handle even much more within a short amount of time.
Then there is a callback function, which adds a task to a queue to process the results:
def event_callback(rpc):
result = rpc.get_result()
data = json.loads(result.content)
taskqueue.add(queue_name='name', url='url', params={'data': data})
My problem is, that I do so many concurrent RPC calls, that the memory of my instance crashes: "Exceeded soft private memory limit with 159.234 MB after servicing 975 requests total"
I already tried three things:
del result
del data
and
result = None
data = None
and I ran the garbage collector manually after the callback function.
gc.collect()
But nothing seem to release the memory directly after a callback functions has added the task to a queue - and therefore the instance crashes. Is there any other way to do it?

Wrong approach: Put these urls into a (put)-queue, increase its rate to the desired value (defaut: 5/sec), and let each task handle one url-fetch (or a group hereof). Please note that theres a safety limit of 3000 url-fetch-api-calls / minute (and one url-fetch might use more than one api-call)

Use the task queue for urlfetch as well, fan out and avoid exhausting memory, register named tasks and provide the event_list cursor to next task. You might want to fetch+process in such a scenario instead of registering new task for every process, especially if process also includes datastore writes.
I also find ndb to make these async solutions more elegant.
Check out Brett Slatkins talk on scalable apps and perhaps pipelines.

Related

How to circumvent Django's req/resp cycle when updating it's internal state

I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).

Python: how many process can access to Database(PostgreSQL) table at the same time?

This is the simplified version of my code, at which each process crawl the link and get data and store them in database in parallel.
def crawl_and_save_data(url):
while True:
res = requests.get(url)
price_list = res.json()
if len(price_list) == 0:
sys.exit()
# Save all datas in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
until_date = convert_date_format(price_list[len(price_list)-1]['candleDateTime'])
time.sleep(1)
if __name__=='__main__':
# When executed with pure python
pool = Pool()
pool.map(
crawl_and_save_data,
get_bunch_of_url_list()
)
The key point of this code is,
# Save all data in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
, where each process accesses same database table.
I wonder whether this kind of task prevents concurrency of my whole task.
Or, Would it be a possibility to lose data because of the concurrent database accesses?
Or, would all queries are put in a I/O queue or something?
Need your advices. Thanks

tl;dr - you should be fine, but the question doesn't include enough detail to answer definitively. You will need to run some tests, but you should expect to get a good amount of concurrency (a few dozen simultaneous writes) before things start to slow down.
Note though - as currently written, seems like your workers will get the same URL over and over again, because of the while True loop that never breaks or exits. You detect if the list is empty, but does the URL track state somehow? I would expect multiple, identical GETs to return the same data over and over...
As far as concurrency, that ultimately depends on -
The resources available to the database (memory, I/O, CPU)
The server-side resources consumed by each connection/operation.
That second point includes memory, etc., but also whether independent
operations are competing for the same resources (are 10 different connections
trying to update the same set of rows in the database?). Updating the same
table is fine, more or less, because the database can use row-level locks.
Also note the difference between concurrency (how many things happen at
once) and throughput (how many things happen within a period of time).
Concurrency and throughput can relate to each in counter-intuitive ways -
it's not uncommon to see a situation where 1 process can do N operations per
second, but M processes sustain far less than M x N operations per second,
possibly even bringing the whole thing to a screeching halt (e.g., via a
deadlock)
Thinking about your code snippet, here are some observations:
You are using multiprocessing.Pool, which uses sub-processes for concurrency and will work well for your case if you...
Make sure you open your connections in the sub-process; trying to re-use a connection from the parent process will not work
If you do nothing else to your code, you will be using a number of sub-processes equal to the number of cores on your db client machine
This is a good starting point. If a function is CPU-bound, you really can't go higher. If your function is I/O-bound, the CPU will be idle waiting for I/O operations to return. You can start ramping up the worker count in this case.
Thus, each sub-process will have a connection to the database, with some amount of server memory per connection.
This also means that each insert should be in isolated transactions, with no additional work on your part.
Given that, simple, append-only, row-by-row transactions should support
relatively high concurrency and high throughput, again depending on how
big and fast your DB server is.
Also, note that you are already queueing :) With no args, Pool() creates a
number of child processes equal to os.cpu_count() (see
the docs).
If that's greater than the number of URLs in your collection, that collection
is a queue of sorts, just not a durable one. If your master process dies, the
list of URLs is gone.
Unrelated - unless you are worried about your URL fetches getting throttled, from a db perspective, there is no need for the time.sleep(1) statement.
Hope this helps.

Celery Task Grouping/Aggregation

I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert

I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)

An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)

Python threading in web programming

I face a potential race condition in a web application:
# get the submissions so far from the cache
submissions = cache.get('user_data')
# add the data from this user to the local dict
submissions[user_id] = submission
# update the cached dict on server
submissions = cache.update('user_data', submissions)
if len(submissions) == some_number:
...
The logic is simple, we first fetch a shared dictionary stored in the cache of web server, add the submission (delivered by each request to the server) to its local copy, and then we update the cached copy by replacing it with this updated local copy. Finally we do something else if we have received a certain number of pieces of data. Notice that
submissions = cache.update('user_data', submissions)
will return the latest copy of dictionary from the cache, i.e. the newly updated one.
Because the server may serve multiple requests (each in its own thread) at the same time, and all these threads access the shared dictionary in cache as described above, thus creating potential race conditions.
I wonder, in the context of web programming, how should I efficiently handle threading to prevent race conditions in this particular case, without sacrificing too much performance. Some code examples would be much appreciated.

My preferred solution would be to have a single thread that modifies the submissions dict and a queue that feed that thread. If you are paranoid, you can even expose a read-only view on the submissions dict. Using a queue and consumer pattern, you will not have a problem with locking.
Of course, this assumes that you have a web framework that will let you create that thread.
EDIT: multiprocess was not a good suggestion; removed.
EDIT: This sort of stuff is really simple in Python:
import threading, Queue
Stop = object()
def consumer(real_dict, queue):
while True:
try:
item = queue.get(timeout=100)
if item == Stop:
break
user, submission = item
real_dict[user] = submission
except Queue.Empty:
continue
q = Queue.Queue()
thedict={}
t = threading.Thread(target=consumer, args=(thedict,q,))
t.start()
Then, you can try:
>>> thedict
{}
>>> q.put(('foo', 'bar'))
>>> thedict
{'foo': 'bar'}
>>> q.put(Stop)
>>> q.put(('baz', 'bar'))
>>> thedict
{'foo': 'bar'}

You appear to be transferring lots of data back and forth between your web application and your cache. That's already a problem. You're also right to be suspicious, since it would be possible for the pattern to be like this (remembering that sub is local to each thread):
Thread A Thread B Cache
--------------------------------------------
[A]=P, [B]=Q
sub = get()
[A]=P, [B]=Q
>>>> suspend
sub = get()
[A]=P, [B]=Q
sub[B] = Y
[A]=P, [B]=Y
update(sub)
[A]=P, [B]=Y
>>>> suspend
sub[A] = X
[A]=X, [B]=Q
update(sub)
[A]=X, [B]=Q !!!!!!!!
This sort of pattern can happen for real, and it results in state getting wiped out. It's also inefficient because thread A should usually only need to know about its current user, not everything.
While you could fix this by great big gobs of locking, that would be horribly inefficient. So, you need to redesign so that you transfer much less data around, which will give a performance boost and reduce the amount of locking you need.

This is one of the more difficult questions to answer because it seems to be a bigger design problem.
One potential solution to this problem would be to have one well-defined place where this is updated. For instance, you might want to set up another service that's dedicated to updating the cache and nothing else. Alternatively, if these updates aren't time-sensitive, you may also want to consider using a task queue.
Another solution: you could give each item a separate key and store a list of the keys under a separate key. This doesn't necessarily solve the problem, but it does make it more manageable. Instead of worrying about separate threads overwriting the entire submissions cache, you just have to worry about them overwriting individual elements within it.
If you have the time to add a new piece to your infrastructure, I'd highly recommend looking at Redis, more specifically Redis hashes[1]. The reason being that Redis handles this problem out of the box, with about the same speed as you'd get with memcache (although I definitely encourage you to benchmark it for yourself).
[1] Note: I just found this link through a quick Google search, and haven't verified it. I don't vouch for its correctness.

AppEngine Task Queue API Calls increase on TaskAlreadyExistsError

I'm using deferred to put tasks in the default queue in a AppEngine app similar to this approach.
Im naming the task with a timestamp that changes every 5 second. During that time a lot of calls are made to the queue with the same name resulting in a TaskAlreadyExistsError which is fine. The problem is when I check the quotas the "Task Queue API Calls" are increasing for every call made, not only those who actually are put in the queue.
I mean if you look at the quota: Task Queue API Calls: 34,017 of 100,000 and compare to the actual queue calls: /_ah/queue/deferred - 2.49K
Here is the code that handles the queue:
try:
deferred.defer(function_call, params, _name=task_name, _countdown=int(interval/2))
except (taskqueue.TaskAlreadyExistsError, taskqueue.TombstonedTaskError):
pass
I suppose that is the way it works. Is there a good way to solve the problem with the quota? Can I use memcache to store the task_name and check if the task has been added besides the above try/catch? Or is there a way to check if the task already exists without using Task Queue Api Calls?

Thanks for pointing out that this is the case, because I didn't realise, but the same problem must be affecting me.
As far as I can see yeah, throwing something into memcache comprised of the taskname should work fine, and then if you want to reduce those hits on memcache you can store the flag locally also within the instance.

The "good way" to solve the quota problem is eliminating destined-to-fail calls to Task Queue API.
_name you are using changes every 5 seconds which might not be a bottleneck if you increase the execution rate of your Task Queue. But you also add Tasks using _countdown.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.