AppEngine real time querying - cost, performance, latency balancing act and quotas - python

I am trying to design an app that uses Google AppEngine to store/process/query data that is then served up to mobile devices via Cloud Endpoints API in as real time as possible.
It is straight forward enough solution, however I am struggling to get the right balance between, performance, cost and latency on AppEngine.
Scenario (analogy) is a user checks-in (many times per day from different locations, cities, countries), and we would like to allow the user to query all the data via their device and provide as up to date information as possible.
Such as:
The number of check-ins over the last:
24 hours
1 week
1 month
All time
Where is the most checked in place/city/country over the same time periods
Where is the least checked in place over the same time periods
Other similar querying reports
We can use Memcache to store the most recent checkins, pushing to the Datastore every 5 minutes, but this may not scale very well and is not robust!
Use a Cron job to run the Task Queue/Map Reduce to get the aggregates, averages for each location every 30 mins and update the Datastore.
The challenge is to use as little read/writes over the datastore because the last "24 hours" data is changing every 5 mins, and hence so is the last weeks data, last months data and so on. The data has to be dynamic to some degree, so it is not fixed points in time, they are always changing - here in lies the issue!
It is not a problem to set this up, but to set it up in an efficient manner, balancing performance/latency for the user and cost/quotas for us is not so easy!
The simple solution would be to use SQL, and run date range queries but this will not scale very well.
We could eventually use BigTable & BigQuery for the "All time" time period querying, but in order to give the users as real-time as possible data via the API for the other time periods is proving quite the challenge!
Any suggestions of AppEngine architecture/approaches would be seriously welcomed.
Many thanks.

Push Queue is more robust than Memcache for adding new check-ins. Memcache together with get_entity_group_version(key) reduces read volumes.
Aggregate statistical data (for example most and least popular locations) ahead of time from user history over daily, weekly, monthly and yearly dimensions to reduce query record counts (same as analysis databases do). Design your real time queries so as to merge stored aggregate data from the past with the small amount of current data that you have not yet aggregated.

First, writes to the datastore take milliseconds. By the time your user hits the refresh button (or whatever you offer), the data will be as "real-time" as it gets.
Typically, developers become concerned with real-time when there is a synchronization/congestion issue, i.e. each user can update something (e.g. bid on an item), and all users have to get the same data (the highest bid) in real time. In your case, what's the harm if a user gets the number of check-ins which is 1 second old?
Second, data in Memcache can be lost at any moment. In your proposed solution (update the datastore every 5 minutes), you risk losing all data for the 5 min period.
I would rather use Memcache in the opposite direction: read data from datastore, put it in Memcache with 60 seconds (or more) expiration, serve all users from Memcache, then refresh it. This will minimize your reads. I would do it, of course, unless your users absolutely must know how many checkins happened in the last 60 seconds.
The real question for you is how to model your data to optimize writes. If you don't want to lose data, you will have to record every checkin in datastore. You can save by making sure you don't have unnecessary indexed fields, separate out frequently updated fields from the rest, etc.

Related

How to efficiently query a large database on a hourly basis?

Background:
I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year.
(I also have access to the server that updates this data every minute.)
Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.
These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.
Problem:
The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.
Possible Solutions:
Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.
Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour
Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.
So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.
Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.
Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.
Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.
Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.
Hope this helps

Is it possible to use Hour parameter in Google Analytics API request? Python

hopefully this is a simple one. I wanted to pull data from GA every hour. I know that there is an end and start date parameters available. Is it possible to have an Hour filter on the request so I can run a query every 60 minutes and have the data being appended to a file, I don't want to pull all available data for a day every time I send a request.
Thanks!
sampak
Yes it is possible
See Filters
https://developers.google.com/analytics/devguides/reporting/core/v4/basics#filtering_2
Don't forget to consider usage quota to adjust how frequent you get data
https://developers.google.com/analytics/devguides/reporting/core/v4/limits-quotas
And remind that "intraday" data isn't "stable" and is re-processed before becoming consistent on the next day (or day after, depending on the volume you collect and if you are a 360 customer or not)
(these apply to Universal Analytics, is it the version you are referring to?)

Continuous aggregates over large datasets

I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.

Download activity chart flask SQL

I am working on a web application for downloading resources of an unimportant type. It's written in python using the flask web framework. I use the SQLAlchemy DB system.
It has a user authentication system and you can download the resources only while logged in.
What I am trying to do is a download history chart for every resource and every user. To elaborate, each user could see two charts of their download activity on their profile page, for the last 7 days and the last year respectively. Each resource would also have a similar pair of charts, but they would instead visualize how many times the resource itself was downloaded in the time periods.
Here is an example screenshot of the charts
(Don't have enough reputation to embed images)
http://dl.dropbox.com/u/5011799/Selection_049.png
The problem is, I can't seem to figure out what the best way to store the downloads in a database would be. I found 2 ways that are relatively easy to implement and should work:
1) I could store the download count for each day in the last week in separate fields and every 24 hours just get rid of the first one and move them to the left by 1. This, however, seems like a kind of a hacky way to do this.
2) I could also create a separate table for the downloads and every time a user downloads a resource I would insert a row into the table with the Datetime, user_id of the downloader and the resource_id of the downloaded resource. This would allow me to do some nice querying of time periods etc. The problem with that configuration could be the row count in the table. I have no idea how heavily the website is going to be used, but if I do the math with 1000 downloads / day, I am going to end up with over 360k rows in just the first year. I don't know how fast that would to perform. I know I could just archive old entries if performace started being a huge problem.
I would like to know whether the 2nd option would be fast enough for a web app and what configuration you would use.
Thanks in advance.
I recommend the second approach, with periodic aggregation to improve performance.
Storing counts by day will force you to SELECT the existing count so that you can either add to it with an UPDATE statement or know that you need to INSERT a new record. That's two trips to the database on every download. And if things get out of whack, there's really no easy way to determine what happened or what the correct numbers ought to be. (You're not saving information about the individual events.) That's probably not a significant concern for a simple download count, but if this were sensitive information it might matter.
The second approach simply requires a single INSERT for each download, and because each event is stored separately, it's easy to troubleshoot. And, as you point out, you can slice this data any way you like.
As for performance, 360,000 rows is trivial for a modern RDBMS on contemporary hardware, but you do want to make sure you have an index on date, username/resource name or any other columns that will be used to select data.
Still, you might have more volume than you expect, or maybe your DB is iffy (I'm not familiar with SQLAlchemy). To reduce your row count you could create a weekly batch process (yeah, I know, batch ain't dead despite what some people say) during non-peak hours to create summary records by week.
It would probably be easiest to create your summary records in a different table that is simply keyed by week and year, or start/end dates, depending on how you want to use it. After you've generated the weekly summary for a period, you can archive or delete the daily detail records for that period.

Should I optimize around reads or CPU time in Google App Engine

I'm trying to optimize my design, but it's really difficult to put things in perspective. Say I have the following cases:
A. A User has 1,000 status updates. These updates are stored in a separate entity, Statuses. I want to get a User's statuses which have a uploadDate after date X. So I do a query:
statuses = Statuses.query(Statuses.uploadDate > X).fetch()
B. A User has 1,000 status updates. Each User entity has a list property list_of_status_keys, which is a list of all keys to the user's statuses. I want to get all statuses with uploadDate after date X. So I easily get a list of statuses using statuses = ndb.get_multi(list_of_status_keys). Then I loop through each one, checking the date:
for a_status in statuses:
if a_status.uploadDate > X:
myList.append(a_status)
I really don't know which I should be optimizing for. A query is more organized it seems, but fetching by keys is quicker. Anyone have any insight?
UPDATE
Here's what it comes down to:
In each http request to GAE, I get all notifications and status updates for a user (just like facebook). Using Appstats, it tells me that each request costs 490 micropennies (where 1 penny = 1,000,000 micropennies).
Getting notifications and statuses is important for a user, so you can expect them to do this many times. What I'm having a hard time with is determining if this is a lot or not. I'm freaking out trying to minimize this number in any way possible. I've never run a service before, so I don't know if this is how much it should cost. Here's the math:
Each request costs 490 micropennies when no results are returned (so just for a basic query it costs 490, but on some cases when several results are returned, it could cost 10,000 mp), so for 1 penny, I can run 2040 requests, or for $1 dollar, I can run 204,000 requests.
Let's say I have 50,000 users, and each user checks for notifications 75 times a day (reasonable):
75 requests X 490 mp per request X 50,000 users = 1,837,500,000 micropennies per day = 1837.5 pennies = 18.37 dollars per day. (is that right?)
I've never run a large scale service before, so are these usual costs? Or is this too high? Is 490 micropennies per request high? How would I find an answer to this if it depends?
Design A is superior.
In design A GAE will use the date to perform a keyed query. What this means is, that Appengine will automatically create an index for you on the Status table sorted by the date. Since it has an index, it will read and fetch only the records after the date you specify. This will save you a large number of reads.
In Design B you basically will have to do the indexing work yourself. Since you will need to fetch each Status and then compare its date you will have to do more work, both in terms of CPU (is cost) as in terms of performance.
EDIT
If your data is accessed as frequently as this, you may have other design options as well.
First you could consider combining the Status objects into StatusUpdatesPerDay. For each day you create a single instance and then append status updates to that object. This will reduce hundreds of reads into a couple of reads.
Second, since the status updates will be accessed very frequently, you can cache the Status in memcache. This will give reduce costs and latency.
Third, even if you do not optimize as above, I believe ndb has built in caching. I have never used this feature, but your actual read counts may be lower than in your calculations.
A fourth option is avoid displaying all status updates at once. Maybe the user wants to see only the last few. Then you can use query cursors to get the remainder when (and if) the user requests them.

Categories

Resources