Optimizing join query performance in google app engine

Optimizing join query performance in google app engine - python

Scenario
Entity1 (id,itmname)
Entity2 (id,itmname,price)
Entity3 (id,itmname,profit)
profit and price are both IntegerProperty
I want to count all the item with price more then 500 and profit more then 10.
I know its join operation and is not supported by google. I tried my best to find out the way other then executing queries separately and performing count but I didn't get anything.
The reason for not executing queries separately is query execution time. In each query I am getting more then 50000 records as result so it takes nearly 20 seconds in fetching records from first query.

Google App Engine developers have always focused on read optimization and that is one thing where denormalization pops in. Before you design your data structure, you should work on possible cases in which the data could be retrieved. Designing of models comes later. A closer look at I/O session about
Building Scalable Web Applications with Google App Engine will prove helpful.
In the current situation, if you are interested in just the counts, you may go with a shard counter. It will require you to update every associated counter if the field updates.
Another approach involves performing nightly scheduled task which will do heavy calculations, and update counts and other stats you might need. You might find mapreduce helpful in this case. This approach will never give you real time data.

The standard solution to this problem is denormalization. Try storing a copy of price and profit in Entity1 and then you can answer your question with a single, simple query on Entity1.

Related

How to efficiently query a large database on a hourly basis?

Background:
I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year.
(I also have access to the server that updates this data every minute.)
Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.
These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.
Problem:
The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.
Possible Solutions:
Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.
Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour
Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.

So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.
Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.
Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.
Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.
Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.
Hope this helps

Continuous aggregates over large datasets

I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.

Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.

Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)

I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.

Reduce GAE hrd(db) read operation counts

To reduce GAE Python usage cost, I want to optimize DB read operation. Do you have any suggestions?
I can't understand why GAE shows quite a lot DB read operation than I thought. If you can give general logic how GAE counts DB read operation it also should be very helpful.
Thanks!

You can get the full breakdown of what a high-level operation (get, query, put, delete, ...) costs in low-level operations (small, read, write) here - https://developers.google.com/appengine/docs/billing (scroll down about half way).
I highly recommend using AppStats to help track down where your read operations are coming from. One big thing to watch out for is not to use the offset option with .fetch() for pagination, as this just skips results, but still costs reads. That means if you do .fetch(10, offset=20), it will cost you 30 reads. You want to use query cursors instead.
Another optimization is to fetch by key (.get(keys)) vs querying, which will only cost 1 read operation, as opposed to querying which cost 1 read for the query + 1 read for each entity returned (so a query with 1 entity returned cost 2 reads, but a .get() for that same entity would only cost 1 read. You might also want to look at using projection queries, which cost 1 read for the query, but only 1 small per projected entity retrieved (note: all properties projected must be indexed).
Also, if you're not already, you should be using the NDB API which automatically caches fetches and will help reduce your read operations. Along with the official docs, the NDB cheat sheet by Rodrigo and Guido is a great way to transition from ext.db to ndb.
There are some good tips under Managing Datastore Usage here:
https://developers.google.com/appengine/articles/managing-resources
Lastly, you might also be interested in using gae_mini_profiler, which provides convenient access to AppStats for the current request, as well as other helpful profiling and logging information.

Hard to say why without seeing your code but if you're not already, use memcache to save on db reads.
https://developers.google.com/appengine/docs/python/memcache/usingmemcache

Collecting keys vs automatic indexing in Google App Engine

After enabling Appstats and profiling my application, I went on a panic rage trying to figure out how to reduce costs by any means. A lot of my costs per request came from queries, so I sought out to eliminate querying as much as possible.
For example, I had one query where I wanted to get a User's StatusUpdates after a certain date X. I used a query to fetch: statusUpdates = StatusUpdates.query(StatusUpdates.date > X).
So I thought I might outsmart the system and avoid a query, but incur higher write costs for the sake of lower read costs. I thought that every time a user writes a Status, I store the key to that status in a list property of the user. So instead of querying, I would just do ndb.get_multi(user.list_of_status_keys).
The question is, what is the difference for the system between these two approaches? Sure I avoid a query with the second case, but what is happening behind the scenes here? Is what I'm doing in the second case, where I'm collecting keys, just me doing a manual indexing that GAE would have done for me with queries?
In general, what is the difference between get_multi(keys) and a query? Which is more efficient? Which is less costly?

Check the docs on billing:
https://developers.google.com/appengine/docs/billing
It's pretty straightforward. Reads are $0.07/100k, smalls are $0.01/100k, so you want to do smalls.
A query is 1 read + 1 small / entity
A get is 1 read. If you are getting more than 1 entity back with a query, it's cheaper to do a query than reading entities from keys.
Query is likely more efficient too. The only benefit from doing the gets is that they'll be fully consistent (whereas a query is eventually consistent).

Storing the keys does not query, as you cannot do anything with just the keys. You will still have to fetch the Status objects from memory. Also, since you want to query on the date of the Status object, you will need to fetch all the Status objects into memory and compare their dates yourself. If you use a Query, appengine will fetch only the Status with the required date. Since you fetch less, your read costs will be lower.
As this is basically the same question as you have posed here, I suggest that you look at the answer I gave there.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.