How to delete entities not found in feed on GAE - python

I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed.
I have this code:
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
def updateFeed(offset, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name)
)
db.put(feedEntriesToAdd)
How do I find out which items were not in the feed and delete them from the datastore?
I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. - but that seems rather slow.
PS: All item.id are unique for that feed item and are consistent.

If you add a DateTimeProperty with auto_now=True, it will record the last modified time of each entity. Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more.
Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh.
Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. A group post suggests maybe it's 10 bytes as opposed to 8.
But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. Further, 40k times a few bytes isn't much even at $0.24/G/month.
With either a generation or a datetime, you don't necessarily have to delete the data immediately. Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. Depends entirely on the app whether it's worth having.

I would add a generation counter
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
generation = db.IntegerProperty(required=True)
def updateFeed(offset, generation, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name,generation=generation)
)
db.put(feedEntriesToAdd)
def deleteOld(generation):
q = db.GqlQuery("SELECT * FROM FeedEntry " +
"WHERE generation != :1" ,generation )
db.delete(generation)

Related

How do I make sure a model field is and incremental number for my model?

I have the following model in django:
class Page(models.Model):
page_number = models.IntegerField()
...
and I would like to make sure that this page number keeps being a sequence of integers without gaps, even if I delete some pages in the middle of the existing pages in the data base. For example, I have pages 1, 2 and 3, delete page 2, and ensure page 3 becomes page 2.
At the moment, I am not updating the page_number, but rather reconstructing an increasing sequence without gaps in my front end by:
querying the pages
sorting them according to page_number
assigning a new page_order which is incremental and without gaps
But this does not seem to the be best way to go...
Basically you'd have to manually bump all of the pages down
When you get to custom views you'd do something like this:
def deletePage(request):
if request.method = 'POST':
pageObj = Page.objects.filter(page_number=request.POST.get('page_number')).first()
if pageObj:
pageObj.delete()
# Note: Using F() means Django doesn't need to Fetch the value from the db before subtracting
# - It's a blind change, it's faster though
from django.db.models import F
for i in Page.objects.filter(page_number__gt=request.POST.get('page_number')):
i.page_number = F('page_number') - 1
i.save()
else:
# No Page Object Found
# Raise some error
pass
The admin page is tougher tho, you'd basically do the same thing but in functions described in: Django admin: override delete method
Note: Deleting multiple would be tough, especially if you're deleting page 2 + 4 + 5. Possible, but a lot of thinking involved

Elasticsearch-Py bulk not indexing all documents

I am using the elasticsearch-py Python package to interact with Elasticsearch through code. I have a script that is meant to take each document from one index, generate a field + value, then re-index it into a new index.
The issue is that there is 1216 documents in the first index, but only 1000 documents make it to the second one. Typically, it is exactly 1000 documents, occasionally making it higher around 1100, but never making it to the full 1216.
I usually keep the batch_size at 200, but changing it around seems to have some effect on the amount of documents that make it to the second index. Changing it to 400 will typically get a result of 800 documents being transferred. Using parallel_bulk seems to have the same results as using bulk.
I believe the issue is with the generating process I am performing. For each document I am generating its ancestry (they are organized in a tree structure) by recursively getting its parent from the first index. This involves rapid document GET requests interwoven with Bulk API calls to index the documents and Scroll API calls to get the documents from the index in the first place.
Would activity like this cause the documents to not go through? If I remove (comment out) the recursive GET requests, all documents seem to go through every time. I have tried creating multiple Elasticsearch clients, but that wouldn't even help if ES itself is the bottleneck.
Here is the code if you're curious:
def complete_resources():
for result in helpers.scan(client=es, query=query, index=TEMP_INDEX_NAME):
resource = result["_source"]
ancestors = []
parent = resource.get("parent")
while parent is not None:
ancestors.append(parent)
parent = es.get(
index=TEMP_INDEX_NAME,
doc_type=TEMPORARY_DOCUMENT_TYPE,
id=parent["uid"]
).get("_source").get("parent")
resource["ancestors"] = ancestors
resource["_id"] = resource["uid"]
yield resource
This generator is consumed by helpers.parallel_bulk()
for success, info in helpers.parallel_bulk(
client=es,
actions=complete_resources(),
thread_count=10,
queue_size=12,
raise_on_error=False,
chunk_size=INDEX_BATCH_SIZE,
index=new_primary_index_name,
doc_type=PRIMARY_DOCUMENT_TYPE,
):
if success:
successful += 1
else:
failed += 1
print('A document failed:', info)
This gives me the following result:
Time: 7 seconds
Successful: 1000
Failed: 0

Mongoengine - How to perform a "save new item or increment counter" operation?

I'm using MongoEngine in a web-scraping project. I would like to keep track of all the images I've encountered on all the scraped webpages.
To do so, I store the image src URL and the number of times the image has been encountered.
The MongoEngine model definition is the following:
class ImagesUrl(Document):
""" Model representing images encountered during web-scraping.
When an image is encountered on a web-page during scraping,
we store its url and the number of times it has been
seen (default counter value is 1).
If the image had been seen before, we do not insert a new document
in collection, but merely increment the corresponding counter value.
"""
# The url of the image. There cannot be any duplicate.
src = URLField(required=True, unique=True)
# counter of the total number of occurences of the image during
# the datamining process
counter = IntField(min_value=0, required=True, default=1)
I'm looking for the proper way to implement the "save or increment" process.
So far, I'm handling it that way, but I feel there might be a better, built-in way of doing it with MongoEngine:
def save_or_increment(self):
""" If it is the first time the image has been encountered, insert
its src in mongo, along with a counter=1 value.
If not, increment its counter value by 1.
"""
# check if item is already stored
# if not, save a new item
if not ImagesUrl.objects(src=self.src):
ImagesUrl(
src=self.src,
counter=self.counter,
).save()
else:
# if item already stored in Mongo, just increment its counter
ImagesUrl.objects(src=self.src).update_one(inc__counter=1)
Is there a better way of doing it?
Thank you very much for your time.
You should be able to just do an upsert eg:
ImagesUrl.objects(src=self.src).update_one(
upsert=True,
inc__counter=1,
set__src=self.src)
update_one as in #ross answer has a count of modified documents as a result (or a full result of update) and it will not return the document or a new counter number. If you want to have one you should use upsert_one:
images_url = ImagesUrl.objects(src=self.src).upsert_one(
inc__counter=1,
set__src=self.src)
print images_url.counter
It will create document if it does not exists or modify it and increase the counter number.

Getting The Most Recent Data Item - Google App Engine - Python

I need to retrieve the most recent item added to a collection. Here is how I'm doing it:
class Box(db.Model):
ID = db.IntegerProperty()
class Item(db.Model):
box = db.ReferenceProperty(Action, collection_name='items')
date = db.DateTimeProperty(auto_now_add=True)
#get most recent item
lastItem = box.items.order('-date')[0]
Is this an expensive way to do it? Is there a better way?
If you are going to iterate over a list of boxes, that is a very bad way to do it. You will run an additional query for every box. You can easily see what is going on with Appstats.
If you are doing one of those per request, it may be ok. But it is not ideal. you might also want to use: lastItem = box.items.order('-date').get(). get will only return the first result to the app.
If possible it would be significantly faster to add a lastItem property to Box, or store the Box ID (your attribute) on Item. In other words, denormalize the data. If you are going to fetch a list of Boxes and their most recent item you need to use this type of approach.

Aggregating across columns in Django

I'm trying to figure out if there's a way to do a somewhat-complex aggregation in Django using its ORM, or if I'm going to have to use extra() to stick in some raw SQL.
Here are my object models (stripped to show just the essentials):
class Submission(Models.model)
favorite_of = models.ManyToManyField(User, related_name="favorite_submissions")
class Response(Models.model)
submission = models.ForeignKey(Submission)
voted_up_by = models.ManyToManyField(User, related_name="voted_up_responses")
What I want to do is sum all the votes for a given submission: that is, all of the votes for any of its responses, and then also including the number of people who marked the submission as a favorite.
I have the first part working using the following code; this returns the total votes for all responses of each submission:
submission_list = Response.objects\
.values('submission')\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
(So after getting the vote total, I sort in descending order and return the top TOP_NUM submissions, to get a "best of" listing.)
That part works. Is there any way you can suggest to include the number of people who have favorited each submission in its votes? (I'd prefer to avoid extra() for portability, but I'm thinking it may be necessary, and I'm willing to use it.)
EDIT: I realized after reading the suggestions below that I should have been clearer in my description of the problem. The ideal solution would be one that allowed me to sort by total votes (the sum of voted_up_by and favorited) and then pick just the top few, all within the database. If that's not possible then I'm willing to load a few of the fields of each response and do the processing in Python; but since I'll be dealing with 100,000+ records, it'd be nice to avoid that overhead. (Also, to Adam and Dmitry: I'm sorry for the delay in responding!)
One possibility would be to re-arrange your current query slightly. What if you tried something like the following:
submission_list = Response.objects\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
submission_list.query.group_by = ['submission_id']
This will return a queryset of Response objects (objects with the same Submission will be lumped together). In order to access the related submission and/or the favorite_of list/count, you have two options:
num_votes = submission_list[0].votes
submission = submission_list[0].submission
num_favorite = submission.favorite_of.count()
or...
submissions = []
for response in submission_list:
submission = response.submission
submission.votes = response.votes
submissions.append(submission)
num_votes = submissions[0].votes
submission = submissions[0]
num_favorite = submission.favorite_of.count()
Basically the first option has the benefit of still being a queryset, but you have to be sure to access the submission object in order to get any info about the submission (since each object in the queryset is technically a Response). The second option has the benefit of being a list of the submissions with both the favorite_of list as well as the votes, but it is no longer a queryset (so be sure you don't need to alter the query anymore afterwards).
You can count favorites in another query like
favorite_list = Submission.objects.annotate(favorites=Count(favorite_of))
After that you add the values from two lists:
total_votes = {}
for item in submission_list:
total_votes[item.submission.id] = item.voted_by
for item in favorite_list:
has_votes = total_votes.get(item.id, 0)
total_votes[item.id] = has_votes + item.favorites
I am using ids in the dictionary because Submission objects will not be identical. If you need the Submissions themselves, you may use one more dictionary or store tuple (submission, votes) instead of just votes.
Added: this solution is better than the previous because you have only two DB requests.

Categories

Resources