Is it possible to generate hash from a queryset? - python

My idea is to create a hash of a queryset result. For example, product inventory.
Each update of this stock would generate a hash.
This use would be intended to only request this queryset in the API, when there is a change (example: a new product in invetory).
Example for this use:
no change, same hash - no request to get queryset
there was change, different hash. Then a request will be made.
This would be a feature designed for those who are consuming the data and not for the Django that is serving.
Does this make any sense? I saw that in python there is a way to generate a hash from a tuple, in my case it would be to use the frozenset and generate the hash. I don't know if it's a good idea.

I would comment, but I'm waiting on the 50 rep to be able to do that. It sounds like you're trying to cache results so you aren't querying on data that hasn't been changed. If you're not familiar with caching, the idea is to save hard-to-compute answers in memory for frequently queried endpoints/functions.
For example, if I had a program that calculated the first n digits of pi, I may choose to save a map of [digit count -> value] so that if 10 people asked me for the first thousand, I would only calculate it once. Redis is a popular option for caching, and I believe it exists for Django. It allows you to cache some information, set a time before expiration on it, and then wipe specific parts of that information (to force it to recalculate) every time something specific changes (like a new product in inventory).
Everybody should try writing their own cache at least once, like what you're describing, but the de facto professional option is to use a caching library. Your idea is good, it will definitely work, and you will probably want a dict of [hash->result] for each hash, where result is the information you would send back over your API. If you plan to save data so it persists across multiple program starts, remember Python forces random seeds for hashes, causing inconsistent values. Check out this post for more info.

Related

Is there a good way to store a boolean array into a file or database in python?

I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".

GAE: Best way to keep track of number of entities of an NDB kind?

I'm trying to figure out the best approach to keeping track of the number of entities of a certain NDB kind I have in my cloud datastore.
One approach is just, when I want to know how many I have, get the .count() of a query that I know will return all of them, but that costs a ton of datastore small operations (looks like it's proportional to the number of entities of that kind I have). So that's not ideal.
Another option would be having a counter in the datastore that gets updated every time I create or delete an entity, but that's also not ideal because it would add an extra read and write operation to every entity I create or destroy.
As of now, it looks like the second option is my best choice, so my question is--do you agree? Are there any other options that would be more cost-effective?
Thanks a lot.
PS: Working in Python if that makes a difference.
Second option is the way to go.
Other considerations:
If you have many writes per second you may wish to consider using a shared counter
To reduce datastore writes, you could use a cron job to update the datastore at timed intervals (ie count how many entities have been created since last run)
Also consider using memcache.incr() in conjunction with a cron job to persist the data. Downside of this is you're memcache key could drop, so only really an option if the count doesn't have to be accurate.
There's actually a better/cheaper/faster way to see the info you are looking for but it might not work if you need to know EXACT number of fields at any given moment since its only updated couple of times a day (i.e. you can access it anytime but it may be a few hours outdated).
The "Datastore Statistics" page in GAE dashboard displays some detailed data about kinds/entities including "count" numbers and there's a way to access it programmatically. See more info here: https://cloud.google.com/appengine/docs/python/datastore/stats

Collecting keys vs automatic indexing in Google App Engine

After enabling Appstats and profiling my application, I went on a panic rage trying to figure out how to reduce costs by any means. A lot of my costs per request came from queries, so I sought out to eliminate querying as much as possible.
For example, I had one query where I wanted to get a User's StatusUpdates after a certain date X. I used a query to fetch: statusUpdates = StatusUpdates.query(StatusUpdates.date > X).
So I thought I might outsmart the system and avoid a query, but incur higher write costs for the sake of lower read costs. I thought that every time a user writes a Status, I store the key to that status in a list property of the user. So instead of querying, I would just do ndb.get_multi(user.list_of_status_keys).
The question is, what is the difference for the system between these two approaches? Sure I avoid a query with the second case, but what is happening behind the scenes here? Is what I'm doing in the second case, where I'm collecting keys, just me doing a manual indexing that GAE would have done for me with queries?
In general, what is the difference between get_multi(keys) and a query? Which is more efficient? Which is less costly?
Check the docs on billing:
https://developers.google.com/appengine/docs/billing
It's pretty straightforward. Reads are $0.07/100k, smalls are $0.01/100k, so you want to do smalls.
A query is 1 read + 1 small / entity
A get is 1 read. If you are getting more than 1 entity back with a query, it's cheaper to do a query than reading entities from keys.
Query is likely more efficient too. The only benefit from doing the gets is that they'll be fully consistent (whereas a query is eventually consistent).
Storing the keys does not query, as you cannot do anything with just the keys. You will still have to fetch the Status objects from memory. Also, since you want to query on the date of the Status object, you will need to fetch all the Status objects into memory and compare their dates yourself. If you use a Query, appengine will fetch only the Status with the required date. Since you fetch less, your read costs will be lower.
As this is basically the same question as you have posed here, I suggest that you look at the answer I gave there.

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources