use standard datastore index or build my own - python

I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.

Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).

I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.


GCP Datastore vs Search API performance benchmarks?

Are there any existing benchmarks about GCP Datastore queries and Search queries performance?
I'm interested how the performance changes as the data grows. For instance, if we have:
class Project:
members = ndb.StringProperty(repeated=True)
and we have document in Search like:
SearchDocument([AtomField(name=member, value='value...'), ...])
I want to run a search to get all project ids the user is member of. Something like:
ndb.query(keys_only=True).filter(Project.members == 'This Member')
in Datastore and similar query in the Search.
How would the performance compare when there are 10, 100, ... 16 * 6 objects?
I'm interested whether there is some rule of thumb about the latency we could expect for this simple kind of queries. Of course I can go and try that, but would like to get some intuitive idea about the performance I can expect beforehand, if someone had done similar benchmarks. Also, I would like to avoid spending $ and time on writing/reading data I would later need to delete, so if someone could share their experience, that would be much appreciated!
p.s. I use Python, but would assume the answer would be same/similar for all languages which have support for GCP.
Until this moment, Api Search is only supported for Python 2, unfortunately this version of Python is no longer supported, so you should consider that you will not be able to receive support for this service.
On the other hand, take a look at the code provided in this thread, it can give you an idea of how to perform a benchmark test for Datastore using python 3.

Django + Scrapy multi scrapers architecture

Recently I took over Django project whose one component is Scrapy scrapprs (a lot of - core functionality). It is worth adding that scrapers simply feed the database several times a day and django web app is using this data.
__Scraper__s have direct access to Django model, but in my opinion is not the best idea (mixed responsibilities - django rather should act as a web app, not also scrapers, isn't it?). For example after such split scrapers could be run serverless, saving money and being spawned only when needed.
I see it at least as separate component in the architecture. But if I would separate scrapers from Django website then I would need to populate DB there as well - and change in model either in Django webapp or in scraping app would require change in second app to adjust.
I haven't seen really articles about splitting those apps.
What are the best practices here? Is it worth splitting it? How would you organise deployment to cloud solution(e.g. AWS)?
Thank you
Well, this is a big discussion and I have the same "good problem".
Short answer:
I suggest you that if you want to separate it, you can separate the logic from the data using different schemes. I did it before and is a good approach.
Long answer:
The questions are:
Once you gather information from scrapers, are you doing something with them (Aggregation, treatment, or anything else)?
If the answer is yes, you can separate it in 2 DB. One with the raw information and the other with the treated one (which will be the shared with Django).
If the answer is no, I don't see any reason to separate it. At the end, Django is only the visualizer of the data.
The Django website is using a lot of stored data that for the Single Responsibility you want to separate it from the scraped data?
If the answer is yes, separate it by schemas or even DB.
If the answer is no, you can store it in the same DB of Django. At the end, the important data will be the extracted data. Django maybe will have a configuration's DB or other extra data to manage the web, but the big percentage of the DB will be the data crawled/treated. Depends how much cost it will take you to separate it and maintain. If you are doing from the beginning, I would do it separately.

Searching with the pyramid framework

I'm trying to implement a search function into my website, which is running on pyramid, and I was wondering what is the most efficient way of approaching this problem. I am currently looking into Whoosh and MySQL full text searching with SqlAlchemy. I need a fast and simple implementation, and wondering which one would be the best choice.
I tried using fulltext with the native database for a while and it just was too much work to keep things working across sqlite, mysql, and pgsql. I ported all the search code over to whoosh and have been really happy ever since. It performs well for small workloads, is pure python, and no server to setup.
You just implement it almost like writing and updating a file on disk. From what I've read it does well in the single millions of documents. I'm using it with some 18k documents with an index size of around 100MB. There's a lot of flexibility to implement various tokenizing and other config with it. I really suggest people start there and if they out grow the whoosh, then look at starting up extra processes with elasticsearch, lucene/solr, and the like.
You can see how I've got it implemented here:
and I update it using SqlAlchemy event hooks:
and you can judge a basic implementation of it by searching:
I'm a huge fan of ElasticSearch. It's the easiest to set up, maintain, and work with.
I generally use requests.
to index:
to search:
you can get way more in depth in searching using the DSL:

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here:
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

