I would like to wipe out all data for a specific kind in Google App Engine. What is the
best way to do this?
I wrote a delete script (hack), but since there is so much data is
timeout's out after a few hundred records.
I am currently deleting the entities by their key, and it seems to be faster.
from google.appengine.ext import db
class bulkdelete(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
try:
while True:
q = db.GqlQuery("SELECT __key__ FROM MyModel")
assert q.count()
db.delete(q.fetch(200))
time.sleep(0.5)
except Exception, e:
self.response.out.write(repr(e)+'\n')
pass
from the terminal, I run curl -N http://...
You can now use the Datastore Admin for that: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk
If I were a paranoid person, I would say Google App Engine (GAE) has not made it easy for us to remove data if we want to. I am going to skip discussion on index sizes and how they translate a 6 GB of data to 35 GB of storage (being billed for). That's another story, but they do have ways to work around that - limit number of properties to create index on (automatically generated indexes) et cetera.
The reason I decided to write this post is that I need to "nuke" all my Kinds in a sandbox. I read about it and finally came up with this code:
package com.intillium.formshnuker;
import java.io.IOException;
import java.util.ArrayList;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.FetchOptions;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.labs.taskqueue.QueueFactory;
import com.google.appengine.api.labs.taskqueue.TaskOptions.Method;
import static com.google.appengine.api.labs.taskqueue.TaskOptions.Builder.url;
#SuppressWarnings("serial")
public class FormsnukerServlet extends HttpServlet {
public void doGet(final HttpServletRequest request, final HttpServletResponse response) throws IOException {
response.setContentType("text/plain");
final String kind = request.getParameter("kind");
final String passcode = request.getParameter("passcode");
if (kind == null) {
throw new NullPointerException();
}
if (passcode == null) {
throw new NullPointerException();
}
if (!passcode.equals("LONGSECRETCODE")) {
response.getWriter().println("BAD PASSCODE!");
return;
}
System.err.println("*** deleting entities form " + kind);
final long start = System.currentTimeMillis();
int deleted_count = 0;
boolean is_finished = false;
final DatastoreService dss = DatastoreServiceFactory.getDatastoreService();
while (System.currentTimeMillis() - start < 16384) {
final Query query = new Query(kind);
query.setKeysOnly();
final ArrayList<Key> keys = new ArrayList<Key>();
for (final Entity entity: dss.prepare(query).asIterable(FetchOptions.Builder.withLimit(128))) {
keys.add(entity.getKey());
}
keys.trimToSize();
if (keys.size() == 0) {
is_finished = true;
break;
}
while (System.currentTimeMillis() - start < 16384) {
try {
dss.delete(keys);
deleted_count += keys.size();
break;
} catch (Throwable ignore) {
continue;
}
}
}
System.err.println("*** deleted " + deleted_count + " entities form " + kind);
if (is_finished) {
System.err.println("*** deletion job for " + kind + " is completed.");
} else {
final int taskcount;
final String tcs = request.getParameter("taskcount");
if (tcs == null) {
taskcount = 0;
} else {
taskcount = Integer.parseInt(tcs) + 1;
}
QueueFactory.getDefaultQueue().add(
url("/formsnuker?kind=" + kind + "&passcode=LONGSECRETCODE&taskcount=" + taskcount).method(Method.GET));
System.err.println("*** deletion task # " + taskcount + " for " + kind + " is queued.");
}
response.getWriter().println("OK");
}
}
I have over 6 million records. That's a lot. I have no idea what the cost will be to delete the records (maybe more economical not to delete them). Another alternative would be to request a deletion for the entire application (sandbox). But that's not realistic in most cases.
I decided to go with smaller groups of records (in easy query). I know I could go for 500 entities, but then I started receiving very high rates of failure (re delete function).
My request from GAE team: please add a feature to delete all entities of a kind in a single transaction.
Presumably your hack was something like this:
# Deleting all messages older than "earliest_date"
q = db.GqlQuery("SELECT * FROM Message WHERE create_date < :1", earliest_date)
results = q.fetch(1000)
while results:
db.delete(results)
results = q.fetch(1000, len(results))
As you say, if there's sufficient data, you're going to hit the request timeout before it gets through all the records. You'd have to re-invoke this request multiple times from outside to ensure all the data was erased; easy enough to do, but hardly ideal.
The admin console doesn't seem to offer any help, as (from my own experience with it), it seems to only allow entities of a given type to be listed and then deleted on a page-by-page basis.
When testing, I've had to purge my database on startup to get rid of existing data.
I would infer from this that Google operates on the principle that disk is cheap, and so data is typically orphaned (indexes to redundant data replaced), rather than deleted. Given there's a fixed amount of data available to each app at the moment (0.5 GB), that's not much help for non-Google App Engine users.
Try using App Engine Console then you dont even have to deploy any special code
I've tried db.delete(results) and App Engine Console, and none of them seems to be working for me. Manually removing entries from Data Viewer (increased limit up to 200) didn't work either since I have uploaded more than 10000 entries. I ended writing this script
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import wsgiref.handlers
from mainPage import YourData #replace this with your data
class CleanTable(webapp.RequestHandler):
def get(self, param):
txt = self.request.get('table')
q = db.GqlQuery("SELECT * FROM "+txt)
results = q.fetch(10)
self.response.headers['Content-Type'] = 'text/plain'
#replace yourapp and YouData your app info below.
self.response.out.write("""
<html>
<meta HTTP-EQUIV="REFRESH" content="5; url=http://yourapp.appspot.com/cleanTable?table=YourData">
<body>""")
try:
for i in range(10):
db.delete(results)
results = q.fetch(10, len(results))
self.response.out.write("<p>10 removed</p>")
self.response.out.write("""
</body>
</html>""")
except Exception, ints:
self.response.out.write(str(inst))
def main():
application = webapp.WSGIApplication([
('/cleanTable(.*)', CleanTable),
])
wsgiref.handlers.CGIHandler().run(application)
The trick was to include redirect in html instead of using self.redirect. I'm ready to wait overnight to get rid of all the data in my table. Hopefully, GAE team will make it easier to drop tables in the future.
The official answer from Google is that you have to delete in chunks spread over multiple requests. You can use AJAX, meta refresh, or request your URL from a script until there are no entities left.
The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
One tip. I suggest you get to know the remote_api for these types of uses (bulk deleting, modifying, etc.). But, even with the remote api, batch size can be limited to a few hundred at a time.
Unfortunately, there's no way to easily do a bulk delete. Your best bet is to write a script that deletes a reasonable number of entries per invocation, and then call it repeatedly - for example, by having your delete script return a 302 redirect whenever there's more data to delete, then fetching it with "wget --max-redirect=10000" (or some other large number).
With django, setup url:
url(r'^Model/bdelete/$', v.bulk_delete_models, {'model':'ModelKind'}),
Setup view
def bulk_delete_models(request, model):
import time
limit = request.GET['limit'] or 200
start = time.clock()
set = db.GqlQuery("SELECT __key__ FROM %s" % model).fetch(int(limit))
count = len(set)
db.delete(set)
return HttpResponse("Deleted %s %s in %s" % (count,model,(time.clock() - start)))
Then run in powershell:
$client = new-object System.Net.WebClient
$client.DownloadString("http://your-app.com/Model/bdelete/?limit=400")
If you are using Java/JPA you can do something like this:
em = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory)
Query q = em.createQuery("delete from Table t");
int number = q.executeUpdate();
Java/JDO info can be found here: http://code.google.com/appengine/docs/java/datastore/queriesandindexes.html#Delete_By_Query
Yes you can:
Go to Datastore Admin, and then select the Entitiy type you want to delete and click Delete.
Mapreduce will take care of deleting!
On a dev server, one can cd to his app's directory then run it like this:
dev_appserver.py --clear_datastore=yes .
Doing so will start the app and clear the datastore. If you already have another instance running, the app won't be able to bind to the needed IP and therefore fail to start...and to clear your datastore.
You can use the task queues to delete chunks of say 100 objects.
Deleting objects in GAE shows how limited the Admin capabilities are in GAE. You have to work with batches on 1000 entities or less. You can use the bulkloader tool that works with csv's but the documentation does not cover java.
I am using GAE Java and my strategy for deletions involves having 2 servlets, one for doing the actually delete and another to load the task queues. When i want to do a delete, I run the queue loading servlet, it loads the queues and then GAE goes to work executing all the tasks in the queue.
How to do it:
Create a servlet that deletes a small number of objects.
Add the servlet to your task queues.
Go home or work on something else ;)
Check the datastore every so often ...
I have a datastore with about 5000 objects that i purge every week and it takes about 6 hours to clean out, so i run the task on Friday night.
I use the same technique to bulk load my data which happens to be about 5000 objects, with about a dozen properties.
This worked for me:
class ClearHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
q = db.GqlQuery("SELECT * FROM SomeModel")
self.response.out.write("deleting...")
db.delete(q)
Thank you all guys, I got what I need. :D
This may be useful if you have lots db models to delete, you can dispatch it in your terminal. And also, you can manage the delete list in DB_MODEL_LIST yourself.
Delete DB_1:
python bulkdel.py 10 DB_1
Delete All DB:
python bulkdel.py 11
Here is the bulkdel.py file:
import sys, os
URL = 'http://localhost:8080'
DB_MODEL_LIST = ['DB_1', 'DB_2', 'DB_3']
# Delete Model
if sys.argv[1] == '10' :
command = 'curl %s/clear_db?model=%s' % ( URL, sys.argv[2] )
os.system( command )
# Delete All DB Models
if sys.argv[1] == '11' :
for model in DB_MODEL_LIST :
command = 'curl %s/clear_db?model=%s' % ( URL, model )
os.system( command )
And here is the modified version of alexandre fiori's code.
from google.appengine.ext import db
class DBDelete( webapp.RequestHandler ):
def get( self ):
self.response.headers['Content-Type'] = 'text/plain'
db_model = self.request.get('model')
sql = 'SELECT __key__ FROM %s' % db_model
try:
while True:
q = db.GqlQuery( sql )
assert q.count()
db.delete( q.fetch(200) )
time.sleep(0.5)
except Exception, e:
self.response.out.write( repr(e)+'\n' )
pass
And of course, you should map the link to model in a file(like main.py in GAE), ;)
In case some guys like me need it in detail, here is part of main.py:
from google.appengine.ext import webapp
import utility # DBDelete was defined in utility.py
application = webapp.WSGIApplication([('/clear_db',utility.DBDelete ),('/',views.MainPage )],debug = True)
To delete all entities in a given kind in Google App Engine you only need to do as follows:
from google.cloud import datastore
query = datastore.Client().query(kind = <KIND>)
results = query.fetch()
for result in results:
datastore.Client().delete(result.key)
In javascript, the following will delete all the entries for on page:
document.getElementById("allkeys").checked=true;
checkAllEntities();
document.getElementById("delete_button").setAttribute("onclick","");
document.getElementById("delete_button").click();
given that you are on the admin-page (.../_ah/admin) with the entities you want to delete.
Related
I'm using the Python ibm-cloud-sdk in an attempt to iterate all resources in a particular IBM Cloud account. My trouble has been that pagination doesn't appear to "work for me". When I pass in the "next_url" I still get the same list coming back from the call.
Here is my test code. I successfully print many of my COS instances, but I only seem to be able to print the first page....maybe I've been looking at this too long and just missed something obvious...anyone have any clue why I can't retrieve the next page?
try:
####### authenticate and set the service url
auth = IAMAuthenticator(RESOURCE_CONTROLLER_APIKEY)
service = ResourceControllerV2(authenticator=auth)
service.set_service_url(RESOURCE_CONTROLLER_URL)
####### Retrieve the resource instance listing
r = service.list_resource_instances().get_result()
####### get the row count and resources list
rows_count = r['rows_count']
resources = r['resources']
while rows_count > 0:
print('Number of rows_count {}'.format(rows_count))
next_url = r['next_url']
for i, resource in enumerate(resources):
type = resource['id'].split(':')[4]
if type == 'cloud-object-storage':
instance_name = resource['name']
instance_id = resource['guid']
crn = resource['crn']
print('Found instance id : name - {} : {}'.format(instance_id, instance_name))
############### this is SUPPOSED to get the next page
r = service.list_resource_instances(start=next_url).get_result()
rows_count = r['rows_count']
resources = r['resources']
except Exception as e:
Error = 'Error : {}'.format(e)
print(Error)
exit(1)
From looking at the API documentation for listing resource instances, the value of next_url includes the URL path and the start parameter including its token for start.
To retrieve the next page, you would only need to pass in the parameter start with the token as value. IMHO this is not ideal.
I typically do not use the SDK, but a simply Python request. Then, I can use the endpoint (base) URI + next_url as full URI.
If you stick with the SDK, use urllib.parse to extract the query parameter. Not tested, but something like:
from urllib.parse import urlparse,parse_qs
o=urlparse(next_url)
q=parse_qs(o.query)
r = service.list_resource_instances(start=q['start'][0]).get_result()
Could you use the Search API for listing the resources in your account rather than the resource controller? The search index is set up for exactly that operation, whereas paginating results from the resource controller seems much more brute force.
https://cloud.ibm.com/apidocs/search#search
I recently found out about sqlalchemy in Python. I'd like to use it for data science rather than website applications.
I've been reading about it and I like that you can translate the sql queries into Python.
The main thing that I am confused about that I'm doing is:
Since I'm reading data from an already well established schema, I wish I didn't have to create the corresponding models myself.
I am able to get around that reading the metadata for the table and then just querying the tables and columns.
The problem is when I want to join to other tables, this metadata reading is taking too long each time, so I'm wondering if it makes sense to pickle cache it in an object, or if there's another built in method for that.
Edit: Include code.
Noticed that the waiting time was due to an error in the loading function, rather than how to use the engine. Still leaving the code in case people comment something useful. Cheers.
The code I'm using is the following:
def reflect_engine(engine, update):
store = f'cache/meta_{engine.logging_name}.pkl'
if update or not os.path.isfile(store):
meta = alq.MetaData()
meta.reflect(bind=engine)
with open(store, "wb") as opened:
pkl.dump(meta, opened)
else:
with open(store, "r") as opened:
meta = pkl.load(opened)
return meta
def begin_session(engine):
session = alq.orm.sessionmaker(bind=engine)
return session()
Then I use the metadata object to get my queries...
def get_some_cars(engine, metadata):
session = begin_session(engine)
Cars = metadata.tables['Cars']
Makes = metadata.tables['CarManufacturers']
cars_cols = [ getattr(Cars.c, each_one) for each_one in [
'car_id',
'car_selling_status',
'car_purchased_date',
'car_purchase_price_car']] + [
Makes.c.car_manufacturer_name]
statuses = {
'selling' : ['AVAILABLE','RESERVED'],
'physical' : ['ATOURLOCATION'] }
inventory_conditions = alq.and_(
Cars.c.purchase_channel == "Inspection",
Cars.c.car_selling_status.in_( statuses['selling' ]),
Cars.c.car_physical_status.in_(statuses['physical']),)
the_query = ( session.query(*cars_cols).
join(Makes, Cars.c.car_manufacturer_id == Makes.c.car_manufacturer_id).
filter(inventory_conditions).
statement )
the_inventory = pd.read_sql(the_query, engine)
return the_inventory
I am trying to add memcache to my webapp deployed on GAE, and to do this I am using memcache.Client() to prevent damage from any racing conditions:
from google.appengine.api import memcache
client = memcache.Client()
class BlogFront(BlogHandler):
def get(self):
global client
val = client.gets(FRONT_PAGE_KEY)
posts = list()
if val is not None:
posts = list(val)
else:
posts = db.GqlQuery("select * from Post order by created desc limit 10")
client.cas(FRONT_PAGE_KEY, list(posts))
self.render('front.html', posts = posts)
To test the problem I have a front page for a blog that displays the 10 most recent entries. If there is nothing in the cache, I hit the DB with a request, otherwise I just present the cached results to the user.
The problem is that no matter what I do, I always get val == None, thus meaning that I always hit the database with a useless request.
I have sifted through the documentation:
https://developers.google.com/appengine/docs/python/memcache/
https://developers.google.com/appengine/docs/python/memcache/clientclass
http://neopythonic.blogspot.pt/2011/08/compare-and-set-in-memcache.html
And it appears that I am doing everything correctly. What am I missing?
(PS: I am a python newb, if this is a retarded error, please bear with me xD )
from google.appengine.api import memcache
class BlogFront(BlogHandler):
def get(self):
client = memcache.Client()
client.gets(FRONT_PAGE_KEY)
client.cas(FRONT_PAGE_KEY, 'my content')
For a reason I cannot yet possible understand, the solution lies in having a gets right before having a cas call ...
I think I will stick with the memcache non-thread-safe version of the code for now ...
I suspect that the client.cas call is failing because there is no object. Perhaps client.cas only works to update and existing object (not to set a new object if there is none currently)? You might try client.add() (which will fail if an object already exists with the specified key, which I think is what you want?) instead of client.cas()
i have this issue with app engine's datastore. In interactive console, I always get no entities when i ask if a url already exists in my db. When I execute the following code....
from google.appengine.ext import db
class shortURL(db.Model):
url = db.TextProperty()
date = db.DateTimeProperty(auto_now_add=True)
q = shortURL.all()
#q.filter("url =", "httphello")
print q.count()
for item in q:
print item.url
i get the this response, which is fine
6
httphello
www.boston.com
http://www.boston.com
httphello
www.boston.com
http://www.boston.com
But when I uncomment the line "q.filter("url =", "httphello")", i get no entities at all (a response of 0). I know its something ultra simple, but I just can't see it! help.
TextProperty values are not indexed, and cannot be used in filters or sort orders.
You might want to try with StringProperty if you don't need more than 500 characters.
I think .fetch() is missing . you can do a fetch before you do some manipulation on a model.
Also. I don't think you need db.TextProperty() for this, you can use db.StringProperty().
I have a filter that I use for lang support in my webapp. But when I publish it to GAE it keeps telling me that it the usage of CPU is to high.
I think I located the problem to my filters I use for support. I use this in my templates:
<h1>{{ "collection.header"|translate:lang }}</h1>
The filter code looks like this:
import re
from google.appengine.ext import webapp
from util import dictionary
register = webapp.template.create_template_register()
def translate(key, lang):
d = dictionary.GetDictionaryKey(lang, key)
if d == False:
return "no key for " + key
else:
return d.value
register.filter(translate)
I'm to new to Python to see what's wrong with it. Or is the the entire wrong approach?
..fredrik
Little more about what I'm trying to do: I'm trying to find away to handle language support. A user needs to be able to update text elements via an admin page. As of now I have all text elements stored in a db.model. And use a filter to get the right key based on language.
After a lot of testing I still can't get to work well enough. When published I still get error messages in the logs about to much CPU usage. A typical page has about 30-50 text elements. And according to the logs it uses about 1500ms (900ms API) for each page load. I'm starting to think this might not be the best approach?
I've tried using both memcache and indexes to get around the CPU usage. It helps a little. Should one use memcache and manually added indexes?
This is how my filter looks like:
import re
from google.appengine.ext import webapp
from google.appengine.api import memcache
from util import dictionary
register = webapp.template.create_template_register()
def translate(key, lang):
re = "no key for " + key
data = memcache.get("dictionary" + lang)
if data is None:
data = dictionary.GetDictionaryKey(lang)
memcache.add("dictionary" + lang, data, 60)
if key in data:
return data[key]
else:
return "no key for " + key
register.filter(translate)
And util.dictionary looks like this:
from google.appengine.ext import db
class DictionaryEntries(db.Model):
lang = db.StringProperty()
dkey = db.StringProperty()
value = db.TextProperty()
params = db.StringProperty()
#property
def itemid(self):
return self.key().id()
def GetDictionaryKey(lang):
entries = DictionaryEntries.all().filter("lang = ", lang)
if entries.count() > 0:
langObj = {}
for entry in entries:
langObj[entry.dkey] = entry.value
return langObj
else:
return False
Your initial question is about high cpu usage, the answer i think is simple, with GAE and databases like BigTable (non-relational) the code with entries.count() is expensive and the for entry in entrie too if you have a lot of data.
I think you must have to do a couple of things:
in your utils.py
def GetDictionaryKey(lang, key):
chache_key = 'dictionary_%s_%s' % (lang, key)
data = memcache.get(cache_key)
if not data:
entry = DictionaryEntries.all().filter("lang = ", lang).filter("value =", key).get()
if entry:
data = memcache.add(cache_key, entry.value, 60)
else:
data = 'no result for %s' % key
return data
and in your filter:
def translate(key, lang):
return dictionary.GetDictionaryKey(lang, key)
This approach is better because:
You don't make the expensive query of count
You respect the MVC pattern, because a filter is part of the Template (View in the pattern) and the method GetDictionaryKey is part of the Controler.
Besides, if you are using django i suggest you slugify your cache_key:
from django.template.defaultfilters import slugify
def GetDictionaryKey(lang, key):
chache_key = 'dictionary_%s_%s' % (slugify(lang), slugify(key))
data = memcache.get(cache_key)
if not data:
entry = DictionaryEntries.all().filter("lang = ", lang).filter("value =", key).get()
if entry:
data = memcache.add(cache_key, entry.value, 60)
else:
data = 'no result for %s' % key
return data
Have you considered switching to standard gettext methods? Gettext is a widely spread approach for internationalization and very well embedded in the Python (and the Django) world.
Some links:
Python's gettext module
Django's support for gettext with special attention to unicode
PoEdit, an editor for .po-files produced by pygettext
Your template would then look like this:
{% load i18n %}
<h1>{% trans "Header of my Collection" %}</h1>
The files for translations can be generated by manage.py:
manage.py makemessages -l fr
for generating french (fr) messages, for example.
Gettext is quite performant, so I doubt that you will experience a significant slow-down with this approach compared to your storage of the translation table in memcache. And what's more, it let's you work with "real" messages instead of abstract dictionary keys, which is, at least in my experience, ways better, if you have to read and understand the code (or if you have to find and change a certain message).