Batch posting on blogger using gdata python client - python

I'm trying to copy all my Livejournal posts to my new blog on blogger.com. I do so by using slightly modified example that ships with the gdata python client. I have a json file with all of my posts imported from Livejournal. Issue is that blogger.com has a daily limit for posting new blog entries per day — 50, so you can imagine that 1300+ posts I have will be copied in a month, since I can't programmatically enter captcha after 50 imports.
I recently learned that there's also batch operation mode somewhere in gdata, but I couldn't figure how to use it. Googling didn't really help.
Any advice or help will be highly appreciated.
Thanks.
Update
Just in case, I use the following code
#!/usr/local/bin/python
import json
import requests
from gdata import service
import gdata
import atom
import getopt
import sys
from datetime import datetime as dt
from datetime import timedelta as td
from datetime import tzinfo as tz
import time
allEntries = json.load(open("todays_copy.json", "r"))
class TZ(tz):
def utcoffset(self, dt): return td(hours=-6)
class BloggerExample:
def __init__(self, email, password):
# Authenticate using ClientLogin.
self.service = service.GDataService(email, password)
self.service.source = "Blogger_Python_Sample-1.0"
self.service.service = "blogger"
self.service.server = "www.blogger.com"
self.service.ProgrammaticLogin()
# Get the blog ID for the first blog.
feed = self.service.Get("/feeds/default/blogs")
self_link = feed.entry[0].GetSelfLink()
if self_link:
self.blog_id = self_link.href.split("/")[-1]
def CreatePost(self, title, content, author_name, label, time):
LABEL_SCHEME = "http://www.blogger.com/atom/ns#"
# Create the entry to insert.
entry = gdata.GDataEntry()
entry.author.append(atom.Author(atom.Name(text=author_name)))
entry.title = atom.Title(title_type="xhtml", text=title)
entry.content = atom.Content(content_type="html", text=content)
entry.published = atom.Published(time)
entry.category.append(atom.Category(scheme=LABEL_SCHEME, term=label))
# Ask the service to insert the new entry.
return self.service.Post(entry,
"/feeds/" + self.blog_id + "/posts/default")
def run(self, data):
for year in allEntries:
for month in year["yearlydata"]:
for day in month["monthlydata"]:
for entry in day["daylydata"]:
# print year["year"], month["month"], day["day"], entry["title"].encode("utf-8")
atime = dt.strptime(entry["time"], "%I:%M %p")
hr = atime.hour
mn = atime.minute
ptime = dt(year["year"], int(month["month"]), int(day["day"]), hr, mn, 0, tzinfo=TZ()).isoformat("T")
public_post = self.CreatePost(entry["title"],
entry["content"],
"My name",
",".join(entry["tags"]),
ptime)
print "%s, %s - published, Waiting 30 minutes" % (ptime, entry["title"].encode("utf-8"))
time.sleep(30*60)
def main(data):
email = "my#email.com"
password = "MyPassW0rd"
sample = BloggerExample(email, password)
sample.run(data)
if __name__ == "__main__":
main(allEntries)

I would recommend using Google Blog converters instead ( https://code.google.com/archive/p/google-blog-converters-appengine/ )
To get started you will have to go through
https://github.com/google/gdata-python-client/blob/master/INSTALL.txt - Steps for setting up Google GData API
https://github.com/pra85/google-blog-converters-appengine/blob/master/README.txt - Steps for using Blog Convertors
Once you have everything setup , you would have to run the following command (its the LiveJournal Username and password)
livejournal2blogger.sh -u <username> -p <password> [-s <server>]
Redirect its output into a .xml file. This file can now be imported into a Blogger blog directly by going to Blogger Dashboard , your blog > Settings > Other > Blog tools > Import Blog
Here remember to check the Automatically publish all imported posts and pages option. I have tried this once before with a blog with over 400 posts and Blogger did successfully import & published them without issue
Incase you have doubts the Blogger might have some issues (because the number of posts is quite high) or you have other Blogger blogs in your account. Then just for precaution sake , create a separate Blogger (Google) account and then try importing the posts. After that you can transfer the admin controls to your real Blogger account (To transfer , you will first have to send an author invite , then raise your real Blogger account to admin level and lastly remove the dummy account. Option for sending invite is present at Settings > Basic > Permissions > Blog Authors )
Also make sure that you are using Python 2.5 otherwise these scripts will not run. Before running livejournal2blogger.sh , change the following line (Thanks for Michael Fleet for this fix http://michael.f1337.us/2011/12/28/google-blog-converters-blogger2wordpress/ )
PYTHONPATH=${PROJ_DIR}/lib python ${PROJ_DIR}/src/livejournal2blogger/lj2b.py $*
to
PYTHONPATH=${PROJ_DIR}/lib python2.5 ${PROJ_DIR}/src/livejournal2blogger/lj2b.py $*
P.S. I am aware this is not the answer to your question but as the objective of this answer is same as your question (To import more than 50 posts in a day) , Thats why I shared it. I don't have much knowledge of Python or GData API , I setup the environment & followed these steps to answer this question (And I was able to import posts from LiveJournal to Blogger with it ).

# build feed
request_feed = gdata.base.GBaseItemFeed(atom_id=atom.Id(text='test batch'))
# format each object
entry1 = gdata.base.GBaseItemFromString('--XML for your new item goes here--')
entry1.title.text = 'first batch request item'
entry2 = gdata.base.GBaseItemFromString('--XML for your new item here--')
entry2.title.text = 'second batch request item'
# Add each blog item to the request feed
request_feed.AddInsert(entry1)
request_feed.AddInsert(entry2)
# Execute the batch processes through the request_feed (all items)
result_feed = gd_client.ExecuteBatch(request_feed)

Related

Can CampaignPerformanceReportRequest return for all campaigns?

Trying to use the Bing Ads API to duplicate what I see on the Hourly report.
Unfortunately, even though I'm properly authenticated, the data I'm getting back is only for One Campaign (one which has like 1 impression per day). I can see the data in the UI just fine, but authenticated as the same user via the API, I can only seem to get back the smaller data set. I'm using https://github.com/BingAds/BingAds-Python-SDK and basing my code on the example:
def get_hourly_report(
account_id,
report_file_format,
return_only_complete_data,
time):
report_request = reporting_service.factory.create('CampaignPerformanceReportRequest')
report_request.Aggregation = 'Hourly'
report_request.Format = report_file_format
report_request.ReturnOnlyCompleteData = return_only_complete_data
report_request.Time = time
report_request.ReportName = "Hourly Bing Report"
scope = reporting_service.factory.create('AccountThroughCampaignReportScope')
scope.AccountIds = {'long': [account_id]}
# scope.Campaigns = reporting_service.factory.create('ArrayOfCampaignReportScope');
# scope.Campaigns.CampaignReportScope.append();
report_request.Scope = scope
report_columns = reporting_service.factory.create('ArrayOfCampaignPerformanceReportColumn')
report_columns.CampaignPerformanceReportColumn.append([
'TimePeriod',
'CampaignId',
'CampaignName',
'DeviceType',
'Network',
'Impressions',
'Clicks',
'Spend'
])
report_request.Columns = report_columns
return report_request
I'm not super familiar with these ad data APIs, so any insight will be helpful, even if you don't have a solution.
I spent weeks back and forth with Microsoft Support. Here's the result:
You can get logs out of the examples by adding this code:
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
The issue was related to the way the example is built. In the auth_helper.py file there is a method named authenticate that looks like this:
def authenticate(authorization_data):
# import logging
# logging.basicConfig(level=logging.INFO)
# logging.getLogger('suds.client').setLevel(logging.DEBUG)
# logging.getLogger('suds.transport.http').setLevel(logging.DEBUG)
customer_service = ServiceClient(
service='CustomerManagementService',
version=13,
authorization_data=authorization_data,
environment=ENVIRONMENT,
)
# You should authenticate for Bing Ads services with a Microsoft Account.
authenticate_with_oauth(authorization_data)
# Set to an empty user identifier to get the current authenticated Bing Ads user,
# and then search for all accounts the user can access.
user = get_user_response = customer_service.GetUser(
UserId=None
).User
accounts = search_accounts_by_user_id(customer_service, user.Id)
# For this example we'll use the first account.
authorization_data.account_id = accounts['AdvertiserAccount'][0].Id
authorization_data.customer_id = accounts['AdvertiserAccount'][0].ParentCustomerId
As you can see, at the very bottom, it says "For this example, we'll use the first account." It turns out that my company had 2 accounts. This was not configurable anywhere and I had no idea this code was here, but you can add a breakpoint here to see your full list of accounts. We only had 2, so I flipped the 0 to a 1 and everything started working.

Python times out after if in for loop

F.W. This isn't just a PRAW question, it leans toward Python more than PRAW. Python people are welcome to contribute, and please note this is not my mother language xD!
Essentially, I'm writing a Reddit bot using the PRAW that does the following:
Loop through "unsaved" posts
Loop through the comments of said posts (targeting subcomments)
If the comment contains "!completed", is written by the submitter OR is a moderator, and the parent comment is not by submitter:
Do etc., e.x. print("Hey")
No, I didn't explain that too well. Examples are better, so here xD:
Use cases:
- Post by #dudeOne
- Comment by #dudeTwo
- Comment with "!completed" by #dudeOne
- Post by #dudeOne
- Comment by #dudeTwo
- Comment with "!completed" by #moderatorOne
print("Hey"), and:
- Post by #dudeOne
- Comment by #dudeOne
- Comment with "!completed" by #dudeOne
... does nothing, maybe even removes + messages #dudeOne.
Here's my messy code (xD):
import praw
import os
import re
sub = "RedditsQuests"
client_id = os.environ.get('client_id')
client_secret = os.environ.get('client_secret')
password = os.environ.get('pass')
reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
password=password,
user_agent='r/RedditsQuests bot',
username='TheQuestMaster')
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
if ((("!completed" in comment.body)) and ((comment.is_submitter) or ('RedditsQuests' in comment.author.moderated())) and (comment.parent().author.name is not submission.author.name)):
print("etc...")
There's a decently-sized stack, so I've added it in this bin for your reference. To me it looks like PRAW is timing out because the if-in-for loop is taking too long. I could be wrong though!
The issue (as you've said) is somewhat sporadic but I've narrowed it down. As it turns out, trying to fetch the subreddits moderated by /u/AutoModerator will sometimes time out (presumably because the list is long).
Figuring out the issue
Here's how I found the issue. Skip this section if you're only interested in the solution.
First, I modified your script to use try and except to catch the exception when it happened. Your traceback told me that it was happening on the line that starts with if ((("!completed" in comment.body)), specifically when fetching the subreddits that a user moderates. Here was my modified script:
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
try:
if (
(("!completed" in comment.body))
and (
(comment.is_submitter)
or ("RedditsQuests" in comment.author.moderated())
)
and (comment.parent().author.name is not submission.author.name)
):
print("etc...")
except Exception:
print(f'Author: {comment.author} ({type(comment.author)})')
And the output:
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
etc...
etc...
etc...
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
With this in mind I wrote a very simple 3-line script to reproduce the issue:
import praw
reddit = praw.Reddit(...)
print(reddit.redditor("AutoModerator").moderated())
Sometimes this script would succeed but sometimes it would fail with the same socket read timeout. Presumably the timeout happens because AutoModerator moderates so many subreddits (at least 10,000), and the Reddit API takes too long to process the request.
Fixing the issue
Your script tries to determine whether the redditor in question is a moderator of the subreddit. You're doing this by checking if the subreddit is in the list of the user's moderated subreddits, but you can switch this to checking if the user is in the list of the subreddit's moderators. Not only should this not time out, but you'll be saving a lot of network requests because you can just fetch the list of moderators once.
The PRAW documentation of Subreddit shows how we can get a list of moderators of a subreddit. In your case, we can do
moderators = list(reddit.subreddit(sub).moderator())
Then, instead of checking "RedditsQuests" in comment.author.moderated(), we check
comment.author in moderators
Your code then becomes
import praw
import os
import re
sub = "RedditsQuests"
client_id = os.environ.get("client_id")
client_secret = os.environ.get("client_secret")
password = os.environ.get("pass")
reddit = praw.Reddit(
client_id=client_id,
client_secret=client_secret,
password=password,
user_agent="r/RedditsQuests bot",
username="TheQuestMaster",
)
moderators = list(reddit.subreddit(sub).moderator())
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
if (
(("!completed" in comment.body))
and ((comment.is_submitter) or (comment.author in moderators))
and (comment.parent().author.name is not submission.author.name)
):
print("etc...")
In my brief testing, this script runs many times faster, since we only get the list of moderators once, rather than fetching all subreddits moderated by all users who commented.
As an unrelated style note, instead of if submission.saved is False you should do if not submission.saved, which is the conventional way to check if a condition is false.

Creating a Google shared contact using the API - contact is created but not in the shared Directory

I'm currently using the shared_contacts_profiles.py script to load contacts from an external system into our Google Shared Domain contacts. I'd like to make the process more automated so I've tried to create a shared contact (with just a full name and email address) using a basic python script. The contact is created but it gets added to the administrator's contacts and not the Directory.
My code is
#!/usr/bin/python
import atom
import gdata.data
import gdata.contacts.client
import gdata.contacts.data
def main():
admin_email = 'admin#mydomain.com'
admin_password = 'P4ssw0rd'
domain_index = admin_email.find('#')
domain = admin_email[domain_index+1:]
contacts_client = gdata.contacts.client.ContactsClient(domain=domain)
contacts_client.client_login(email=admin_email,
password=admin_password,
source='shared_contacts_profiles',
account_type='HOSTED')
new_contact = gdata.contacts.data.ContactEntry()
new_contact.name = gdata.data.Name(
full_name=gdata.data.FullName(text='John Doe'))
new_contact.email.append(gdata.data.Email(address='john.doe#example.com',
primary='true',rel=gdata.data.WORK_REL))
contact_entry = contacts_client.CreateContact(new_contact)
print "Contact's ID: %s" % contact_entry.id.text
if __name__ == '__main__':
main()
I must be missing something fairly simple, but just can't see what it is.
EDIT * I think that shared_contacts_profiles.py sets the domain contact list when it sends batches to Google. I wasn't going to use batches as there are only ever a couple of contacts to add. I also suspect I should be using gdata.contacts.service.ContactsService and not gdata.contacts.client.ContactsClient
Thanks
Dave
In the end I used the original code as shown above with some additions. I needed to get the feed uri for the shared domain contact list and then supply that uri in the CreateContact.
feed_url = contacts_client.GetFeedUri(contact_list=domain, projection='full')
contact_entry = contacts_client.CreateContact(new_contact,insert_uri=feed_url)
Thanks
Dave

Google Analytics and Python

I'm brand new at Python and I'm trying to write an extension to an app that imports GA information and parses it into MySQL. There is a shamfully sparse amount of infomation on the topic. The Google Docs only seem to have examples in JS and Java...
...I have gotten to the point where my user can authenticate into GA using SubAuth. That code is here:
import gdata.service
import gdata.analytics
from django import http
from django import shortcuts
from django.shortcuts import render_to_response
def authorize(request):
next = 'http://localhost:8000/authconfirm'
scope = 'https://www.google.com/analytics/feeds'
secure = False # set secure=True to request secure AuthSub tokens
session = False
auth_sub_url = gdata.service.GenerateAuthSubRequestUrl(next, scope, secure=secure, session=session)
return http.HttpResponseRedirect(auth_sub_url)
So, step next is getting at the data. I have found this library: (beware, UI is offensive) http://gdata-python-client.googlecode.com/svn/trunk/pydocs/gdata.analytics.html
However, I have found it difficult to navigate. It seems like I should be gdata.analytics.AnalyticsDataEntry.getDataEntry(), but I'm not sure what it is asking me to pass it.
I would love a push in the right direction. I feel I've exhausted google looking for a working example.
Thank you!!
EDIT: I have gotten farther, but my problem still isn't solved. The below method returns data (I believe).... the error I get is: "'str' object has no attribute '_BecomeChildElement'" I believe I am returning a feed? However, I don't know how to drill into it. Is there a way for me to inspect this object?
def auth_confirm(request):
gdata_service = gdata.service.GDataService('iSample_acctSample_v1.0')
feedUri='https://www.google.com/analytics/feeds/accounts/default?max-results=50'
# request feed
feed = gdata.analytics.AnalyticsDataFeed(feedUri)
print str(feed)
Maybe this post can help out. Seems like there are not Analytics specific bindings yet, so you are working with the generic gdata.
I've been using GA for a little over a year now and since about April 2009, i have used python bindings supplied in a package called python-googleanalytics by Clint Ecker et al. So far, it works quite well.
Here's where to get it: http://github.com/clintecker/python-googleanalytics.
Install it the usual way.
To use it: First, so that you don't have to manually pass in your login credentials each time you access the API, put them in a config file like so:
[Credentials]
google_account_email = youraccount#gmail.com
google_account_password = yourpassword
Name this file '.pythongoogleanalytics' and put it in your home directory.
And from an interactive prompt type:
from googleanalytics import Connection
import datetime
connection = Connection() # pass in id & pw as strings **if** not in config file
account = connection.get_account(<*your GA profile ID goes here*>)
start_date = datetime.date(2009, 12, 01)
end_data = datetime.date(2009, 12, 13)
# account object does the work, specify what data you want w/
# 'metrics' & 'dimensions'; see 'USAGE.md' file for examples
account.get_data(start_date=start_date, end_date=end_date, metrics=['visits'])
The 'get_account' method will return a python list (in above instance, bound to the variable 'account'), which contains your data.
You need 3 files within the app. client_secrets.json, analytics.dat and google_auth.py.
Create a module Query.py within the app:
class Query(object):
def __init__(self, startdate, enddate, filter, metrics):
self.startdate = startdate.strftime('%Y-%m-%d')
self.enddate = enddate.strftime('%Y-%m-%d')
self.filter = "ga:medium=" + filter
self.metrics = metrics
Example models.py: #has the following function
import google_auth
service = googleauth.initialize_service()
def total_visit(self):
object = AnalyticsData.objects.get(utm_source=self.utm_source)
trial = Query(object.date.startdate, object.date.enddate, object.utm_source, ga:sessions")
result = service.data().ga().get(ids = 'ga:<your-profile-id>', start_date = trial.startdate, end_date = trial.enddate, filters= trial.filter, metrics = trial.metrics).execute()
total_visit = result.get('rows')
<yr save command, ColumnName.object.create(data=total_visit) goes here>

Delete all data for a kind in Google App Engine

I would like to wipe out all data for a specific kind in Google App Engine. What is the
best way to do this?
I wrote a delete script (hack), but since there is so much data is
timeout's out after a few hundred records.
I am currently deleting the entities by their key, and it seems to be faster.
from google.appengine.ext import db
class bulkdelete(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
try:
while True:
q = db.GqlQuery("SELECT __key__ FROM MyModel")
assert q.count()
db.delete(q.fetch(200))
time.sleep(0.5)
except Exception, e:
self.response.out.write(repr(e)+'\n')
pass
from the terminal, I run curl -N http://...
You can now use the Datastore Admin for that: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Deleting_Entities_in_Bulk
If I were a paranoid person, I would say Google App Engine (GAE) has not made it easy for us to remove data if we want to. I am going to skip discussion on index sizes and how they translate a 6 GB of data to 35 GB of storage (being billed for). That's another story, but they do have ways to work around that - limit number of properties to create index on (automatically generated indexes) et cetera.
The reason I decided to write this post is that I need to "nuke" all my Kinds in a sandbox. I read about it and finally came up with this code:
package com.intillium.formshnuker;
import java.io.IOException;
import java.util.ArrayList;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import com.google.appengine.api.datastore.Key;
import com.google.appengine.api.datastore.Query;
import com.google.appengine.api.datastore.Entity;
import com.google.appengine.api.datastore.FetchOptions;
import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.labs.taskqueue.QueueFactory;
import com.google.appengine.api.labs.taskqueue.TaskOptions.Method;
import static com.google.appengine.api.labs.taskqueue.TaskOptions.Builder.url;
#SuppressWarnings("serial")
public class FormsnukerServlet extends HttpServlet {
public void doGet(final HttpServletRequest request, final HttpServletResponse response) throws IOException {
response.setContentType("text/plain");
final String kind = request.getParameter("kind");
final String passcode = request.getParameter("passcode");
if (kind == null) {
throw new NullPointerException();
}
if (passcode == null) {
throw new NullPointerException();
}
if (!passcode.equals("LONGSECRETCODE")) {
response.getWriter().println("BAD PASSCODE!");
return;
}
System.err.println("*** deleting entities form " + kind);
final long start = System.currentTimeMillis();
int deleted_count = 0;
boolean is_finished = false;
final DatastoreService dss = DatastoreServiceFactory.getDatastoreService();
while (System.currentTimeMillis() - start < 16384) {
final Query query = new Query(kind);
query.setKeysOnly();
final ArrayList<Key> keys = new ArrayList<Key>();
for (final Entity entity: dss.prepare(query).asIterable(FetchOptions.Builder.withLimit(128))) {
keys.add(entity.getKey());
}
keys.trimToSize();
if (keys.size() == 0) {
is_finished = true;
break;
}
while (System.currentTimeMillis() - start < 16384) {
try {
dss.delete(keys);
deleted_count += keys.size();
break;
} catch (Throwable ignore) {
continue;
}
}
}
System.err.println("*** deleted " + deleted_count + " entities form " + kind);
if (is_finished) {
System.err.println("*** deletion job for " + kind + " is completed.");
} else {
final int taskcount;
final String tcs = request.getParameter("taskcount");
if (tcs == null) {
taskcount = 0;
} else {
taskcount = Integer.parseInt(tcs) + 1;
}
QueueFactory.getDefaultQueue().add(
url("/formsnuker?kind=" + kind + "&passcode=LONGSECRETCODE&taskcount=" + taskcount).method(Method.GET));
System.err.println("*** deletion task # " + taskcount + " for " + kind + " is queued.");
}
response.getWriter().println("OK");
}
}
I have over 6 million records. That's a lot. I have no idea what the cost will be to delete the records (maybe more economical not to delete them). Another alternative would be to request a deletion for the entire application (sandbox). But that's not realistic in most cases.
I decided to go with smaller groups of records (in easy query). I know I could go for 500 entities, but then I started receiving very high rates of failure (re delete function).
My request from GAE team: please add a feature to delete all entities of a kind in a single transaction.
Presumably your hack was something like this:
# Deleting all messages older than "earliest_date"
q = db.GqlQuery("SELECT * FROM Message WHERE create_date < :1", earliest_date)
results = q.fetch(1000)
while results:
db.delete(results)
results = q.fetch(1000, len(results))
As you say, if there's sufficient data, you're going to hit the request timeout before it gets through all the records. You'd have to re-invoke this request multiple times from outside to ensure all the data was erased; easy enough to do, but hardly ideal.
The admin console doesn't seem to offer any help, as (from my own experience with it), it seems to only allow entities of a given type to be listed and then deleted on a page-by-page basis.
When testing, I've had to purge my database on startup to get rid of existing data.
I would infer from this that Google operates on the principle that disk is cheap, and so data is typically orphaned (indexes to redundant data replaced), rather than deleted. Given there's a fixed amount of data available to each app at the moment (0.5 GB), that's not much help for non-Google App Engine users.
Try using App Engine Console then you dont even have to deploy any special code
I've tried db.delete(results) and App Engine Console, and none of them seems to be working for me. Manually removing entries from Data Viewer (increased limit up to 200) didn't work either since I have uploaded more than 10000 entries. I ended writing this script
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import wsgiref.handlers
from mainPage import YourData #replace this with your data
class CleanTable(webapp.RequestHandler):
def get(self, param):
txt = self.request.get('table')
q = db.GqlQuery("SELECT * FROM "+txt)
results = q.fetch(10)
self.response.headers['Content-Type'] = 'text/plain'
#replace yourapp and YouData your app info below.
self.response.out.write("""
<html>
<meta HTTP-EQUIV="REFRESH" content="5; url=http://yourapp.appspot.com/cleanTable?table=YourData">
<body>""")
try:
for i in range(10):
db.delete(results)
results = q.fetch(10, len(results))
self.response.out.write("<p>10 removed</p>")
self.response.out.write("""
</body>
</html>""")
except Exception, ints:
self.response.out.write(str(inst))
def main():
application = webapp.WSGIApplication([
('/cleanTable(.*)', CleanTable),
])
wsgiref.handlers.CGIHandler().run(application)
The trick was to include redirect in html instead of using self.redirect. I'm ready to wait overnight to get rid of all the data in my table. Hopefully, GAE team will make it easier to drop tables in the future.
The official answer from Google is that you have to delete in chunks spread over multiple requests. You can use AJAX, meta refresh, or request your URL from a script until there are no entities left.
The fastest and efficient way to handle bulk delete on Datastore is by using the new mapper API announced on the latest Google I/O.
If your language of choice is Python, you just have to register your mapper in a mapreduce.yaml file and define a function like this:
from mapreduce import operation as op
def process(entity):
yield op.db.Delete(entity)
On Java you should have a look to this article that suggests a function like this:
#Override
public void map(Key key, Entity value, Context context) {
log.info("Adding key to deletion pool: " + key);
DatastoreMutationPool mutationPool = this.getAppEngineContext(context)
.getMutationPool();
mutationPool.delete(value.getKey());
}
One tip. I suggest you get to know the remote_api for these types of uses (bulk deleting, modifying, etc.). But, even with the remote api, batch size can be limited to a few hundred at a time.
Unfortunately, there's no way to easily do a bulk delete. Your best bet is to write a script that deletes a reasonable number of entries per invocation, and then call it repeatedly - for example, by having your delete script return a 302 redirect whenever there's more data to delete, then fetching it with "wget --max-redirect=10000" (or some other large number).
With django, setup url:
url(r'^Model/bdelete/$', v.bulk_delete_models, {'model':'ModelKind'}),
Setup view
def bulk_delete_models(request, model):
import time
limit = request.GET['limit'] or 200
start = time.clock()
set = db.GqlQuery("SELECT __key__ FROM %s" % model).fetch(int(limit))
count = len(set)
db.delete(set)
return HttpResponse("Deleted %s %s in %s" % (count,model,(time.clock() - start)))
Then run in powershell:
$client = new-object System.Net.WebClient
$client.DownloadString("http://your-app.com/Model/bdelete/?limit=400")
If you are using Java/JPA you can do something like this:
em = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory)
Query q = em.createQuery("delete from Table t");
int number = q.executeUpdate();
Java/JDO info can be found here: http://code.google.com/appengine/docs/java/datastore/queriesandindexes.html#Delete_By_Query
Yes you can:
Go to Datastore Admin, and then select the Entitiy type you want to delete and click Delete.
Mapreduce will take care of deleting!
On a dev server, one can cd to his app's directory then run it like this:
dev_appserver.py --clear_datastore=yes .
Doing so will start the app and clear the datastore. If you already have another instance running, the app won't be able to bind to the needed IP and therefore fail to start...and to clear your datastore.
You can use the task queues to delete chunks of say 100 objects.
Deleting objects in GAE shows how limited the Admin capabilities are in GAE. You have to work with batches on 1000 entities or less. You can use the bulkloader tool that works with csv's but the documentation does not cover java.
I am using GAE Java and my strategy for deletions involves having 2 servlets, one for doing the actually delete and another to load the task queues. When i want to do a delete, I run the queue loading servlet, it loads the queues and then GAE goes to work executing all the tasks in the queue.
How to do it:
Create a servlet that deletes a small number of objects.
Add the servlet to your task queues.
Go home or work on something else ;)
Check the datastore every so often ...
I have a datastore with about 5000 objects that i purge every week and it takes about 6 hours to clean out, so i run the task on Friday night.
I use the same technique to bulk load my data which happens to be about 5000 objects, with about a dozen properties.
This worked for me:
class ClearHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
q = db.GqlQuery("SELECT * FROM SomeModel")
self.response.out.write("deleting...")
db.delete(q)
Thank you all guys, I got what I need. :D
This may be useful if you have lots db models to delete, you can dispatch it in your terminal. And also, you can manage the delete list in DB_MODEL_LIST yourself.
Delete DB_1:
python bulkdel.py 10 DB_1
Delete All DB:
python bulkdel.py 11
Here is the bulkdel.py file:
import sys, os
URL = 'http://localhost:8080'
DB_MODEL_LIST = ['DB_1', 'DB_2', 'DB_3']
# Delete Model
if sys.argv[1] == '10' :
command = 'curl %s/clear_db?model=%s' % ( URL, sys.argv[2] )
os.system( command )
# Delete All DB Models
if sys.argv[1] == '11' :
for model in DB_MODEL_LIST :
command = 'curl %s/clear_db?model=%s' % ( URL, model )
os.system( command )
And here is the modified version of alexandre fiori's code.
from google.appengine.ext import db
class DBDelete( webapp.RequestHandler ):
def get( self ):
self.response.headers['Content-Type'] = 'text/plain'
db_model = self.request.get('model')
sql = 'SELECT __key__ FROM %s' % db_model
try:
while True:
q = db.GqlQuery( sql )
assert q.count()
db.delete( q.fetch(200) )
time.sleep(0.5)
except Exception, e:
self.response.out.write( repr(e)+'\n' )
pass
And of course, you should map the link to model in a file(like main.py in GAE), ;)
In case some guys like me need it in detail, here is part of main.py:
from google.appengine.ext import webapp
import utility # DBDelete was defined in utility.py
application = webapp.WSGIApplication([('/clear_db',utility.DBDelete ),('/',views.MainPage )],debug = True)
To delete all entities in a given kind in Google App Engine you only need to do as follows:
from google.cloud import datastore
query = datastore.Client().query(kind = <KIND>)
results = query.fetch()
for result in results:
datastore.Client().delete(result.key)
In javascript, the following will delete all the entries for on page:
document.getElementById("allkeys").checked=true;
checkAllEntities();
document.getElementById("delete_button").setAttribute("onclick","");
document.getElementById("delete_button").click();
given that you are on the admin-page (.../_ah/admin) with the entities you want to delete.

Categories

Resources