Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery.
I also reviewed the code using pipeline.
Both methods are not efficient and have high cost since all data is imported everytime.
I need only to add the records added from last import.
What is the right way of doing it?
Is there a working example on how to do it in python?
You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.
If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.
There is no full working example (as far as I know), but I believe that the following process could help you :
1- You'd need to add a "last time changed" to your entities, and update it.
2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour
3- Manually add what needs to be added to your backup.
As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.
Related
I have a Python script that will regulary check an API for data updates. Since it runs without supervision I would like to be able monitor what the script does to make sure it works properly.
My initial thought is just to write every communication attempt with the API to a text file with date, time and if data was pulled or not. A new line for every imput. My question to you is if you would recommend doing it in another way? Write to excel for example to be able to sort the columns? Or are there any other options worth considering?
I would say it really depends on two factors
How often you update
How much interaction do you want with the monitoring data (i.e. notification, reporting etc)
I have had projects where we've updated Google Sheets (using the API) to be able to collaboratively extract reports from update data.
However, note that this means a web call at every update, so if your updates are close together, this will affect performance. Also, if your app is interactive, there may be a delay while the data gets updated.
The upside is you can build things like graphs and timelines really easily (and collaboratively) where needed.
Also - yes, definitely the logging module as answered below. I sort of assumed you were using the logging module already for the local file for some reason!
Take a look at the logging documentation.
A new line for every input is a good start. You can configure the logging module to print date and time automatically.
I run many-many tasks to get some information and process it. After each task run, I have an integer, which indicates how many portions of the information I've got.
I would like to get sum of these integers received from different tasks.
Currently I use memcache to store sum:
def update_memcache_value(what, val, how_long=86400):
value_old = get_memcache_value(what)
memcache.set('system_'+what, value_old+val, how_long)
def get_memcache_value(what):
value = memcache.get('system_'+what)
if not value:
value = 0
return int(value)
update_memcache_value is called within each task (quite more often than once). But looks like the data there is often lost during the day. I can use NDB to store the same data, but it will require a lot of write ops. Is there any better way to store the same data (counter)?
It sounds like you are specifically looking to have many tasks do a part of a sum and then have those all reduce down to one number at the end... so you want to use MapReduce. Or you could just use Pipelines, as MapReduce is actually built on top of it. If you're worried about write-ops, then you aren't going to be able to take advantage of App Engine's parallelism
Google I/O 2010 - Data pipelines with Google App Engine
https://www.youtube.com/watch?v=zSDC_TU7rtc
Pipelines Library
https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki
MapReduce
https://cloud.google.com/appengine/docs/python/dataprocessing/
Unfortunately if your tasks span throughout the day memcache is not an option.
If you want to reduce the write ops you could set a second counter and backup the value on memcache every 100 tasks or whatever works for you.
if you are expecting to do this with using write ops for every task you could try backing up those results in a 3rd party storage like for example a Google Spreadsheet through the Spreasheets API but it seems like an overkill just to save some write ops (and not as performant, which is guess is not an issue).
I have a DB that maintain a list of calls. Every week I have to import an excel file or a json object to make sure that the list of calls data is in sync with another db, which has a different format (I have to do some interpretations on the data I get from the xls)
Anyhow, I made a function that do all the import, but I noticed that each time I run it I get different results.
After some investigation, what I notice is that if I do lots of put() in sequence there is a lag between the end of the put and when the data is available in the datastore so queries sometimes return different values.
I fixed it adding a delay
time.sleep(1)
But I think there should be a way to just wait until datastore is stable and not a fixed amount of time. I tried to find it but had no luck.
Any help?
This is an often repeated question - though other question at first may not seem the same.
If you are using the datastore you MUST read up on "Eventual consistency"
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
In my opinion the docs for appengine and the datastore should probably lead off with "If you haven't read about eventual consistency, please do so now!" in really big type ;-)
My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).
I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue
I have a website I am looking to stay updated with and scrape some content from there every day. I know the site is updated manually at a certain time, and I've set cron schedules to reflect this, but since it is updated manually it could be 10 or even 20 minutes later.
Right now I have a hack-ish cron update every 5 minutes, but I'd like to use the deferred library to do things in a more precise manner. I'm trying to chain deferred tasks so I can check if there was an update and defer that same update a for couple minutes if there was none, and defer again if need be until there is finally an update.
I have some code I thought would work, but it only ever defers once, when instead I need to continue deferring until there is an update:
(I am using Python)
class Ripper(object):
def rip(self):
if siteHasNotBeenUpdated:
deferred.defer(self.rip, _countdown=120)
else:
updateMySite()
This was just a simplified excerpt obviously.
I thought this was simple enough to work, but maybe I've just got it all wrong?
The example you give should work just fine. You need to add logging to determine if deferred.defer is being called when you think it is. More information would help, too: How is siteHasNotBeenUpdated set?