How to count results from many GAE tasks? - python

I run many-many tasks to get some information and process it. After each task run, I have an integer, which indicates how many portions of the information I've got.
I would like to get sum of these integers received from different tasks.
Currently I use memcache to store sum:
def update_memcache_value(what, val, how_long=86400):
value_old = get_memcache_value(what)
memcache.set('system_'+what, value_old+val, how_long)
def get_memcache_value(what):
value = memcache.get('system_'+what)
if not value:
value = 0
return int(value)
update_memcache_value is called within each task (quite more often than once). But looks like the data there is often lost during the day. I can use NDB to store the same data, but it will require a lot of write ops. Is there any better way to store the same data (counter)?

It sounds like you are specifically looking to have many tasks do a part of a sum and then have those all reduce down to one number at the end... so you want to use MapReduce. Or you could just use Pipelines, as MapReduce is actually built on top of it. If you're worried about write-ops, then you aren't going to be able to take advantage of App Engine's parallelism
Google I/O 2010 - Data pipelines with Google App Engine
https://www.youtube.com/watch?v=zSDC_TU7rtc
Pipelines Library
https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki
MapReduce
https://cloud.google.com/appengine/docs/python/dataprocessing/

Unfortunately if your tasks span throughout the day memcache is not an option.
If you want to reduce the write ops you could set a second counter and backup the value on memcache every 100 tasks or whatever works for you.
if you are expecting to do this with using write ops for every task you could try backing up those results in a 3rd party storage like for example a Google Spreadsheet through the Spreasheets API but it seems like an overkill just to save some write ops (and not as performant, which is guess is not an issue).

Related

How to speed up Elasticsearch scroll in python

I need to get data for a certain period of time by es api and use python to do some customized analysis of these data and display the result on dashboard.
There are about two hundred thousand records every 15 minutes,indexed by date.
Now I use scroll-scan to get data,But it takes nearly a minute to get 200000 records,It seems to be too slow.
Is there any way to process these data more quickly?and can I use something like redis to save the results and avoid repetitive work?
Is it possible to do the analysis on the Elasticsearch side using aggregations?
Assuming you're not doing it already, you should use _source to only download the absolute minimum data required. You could also try increasing the size parameter to scan() from the default of 1000. I would expect only modest speed improvements from that, however.
If the historical data doesn't change, then a cache like Redis (or even just a local file) could be a good solution. If the historical data can change, then you'd have to manage cache invalidation.

Distribute small collection output to multiple workers using Apache Beam Python over Google Dataflow

I have a pipeline that looks roughly like this:
_ (
p |
SomeSourceProducingListOfFiles() |
beam.Map(some_expensive_fn) |
beam.FlatMap(some_inexpensive_agg)
)
SomeSourceProducingListOfFiles in my case is reading from a single CSV/TSV and doesn't currently support splitting.
some_expensive_fn is an expensive operation that may take a minute to run.
some_inexpensive_agg is perhaps not that important for the question but is to show that there are some results brought together for aggregation purpose.
In the case where SomeSourceProducingListOfFiles produces say 100 items, the load doesn't seem to get split across multiple works.
I understand that in general Apache Beam tries to keep things on one worker to reduce serialisation overhead. (And there is some hard coded limit of 1000 items). How can I convince Apache Beam to split the load across multiple workers even for a very small number of items. If I say have three items and three workers I would like each worker to execute one item.
Note: I disabled auto scaling and am using a fixed number of workers.
https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion discusses ways to prevent fusion. Beam Java 2.2+ has a built-in transform to do this, Reshuffle.viaRandomKey(); Beam Python doesn't yet have it so you'll need to code something similar manually using one of the methods by that link.
Can you try using beam.Reshuffle? It seems like this isn't well documented but I hear from some good sources that this is what you should use.
https://beam.apache.org/documentation/transforms/python/other/reshuffle/

GAE Datastore write operations dalay

I have a DB that maintain a list of calls. Every week I have to import an excel file or a json object to make sure that the list of calls data is in sync with another db, which has a different format (I have to do some interpretations on the data I get from the xls)
Anyhow, I made a function that do all the import, but I noticed that each time I run it I get different results.
After some investigation, what I notice is that if I do lots of put() in sequence there is a lag between the end of the put and when the data is available in the datastore so queries sometimes return different values.
I fixed it adding a delay
time.sleep(1)
But I think there should be a way to just wait until datastore is stable and not a fixed amount of time. I tried to find it but had no luck.
Any help?
This is an often repeated question - though other question at first may not seem the same.
If you are using the datastore you MUST read up on "Eventual consistency"
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
In my opinion the docs for appengine and the datastore should probably lead off with "If you haven't read about eventual consistency, please do so now!" in really big type ;-)

Import Data Efficiently from Datastore to BigQuery every Hour - Python

Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery.
I also reviewed the code using pipeline.
Both methods are not efficient and have high cost since all data is imported everytime.
I need only to add the records added from last import.
What is the right way of doing it?
Is there a working example on how to do it in python?
You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.
If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.
There is no full working example (as far as I know), but I believe that the following process could help you :
1- You'd need to add a "last time changed" to your entities, and update it.
2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour
3- Manually add what needs to be added to your backup.
As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.

Is it possible to make writing to files/reading from files safe for a questionnaire type website?

My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).
I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue

Categories

Resources