Getting DISTINCT users on Google App Engine

Getting DISTINCT users on Google App Engine - python

How to do this on Google App Engine (Python):
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
Long version:
I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.
My event entities look like this:
class Event(db.Model):
t = db.DateTimeProperty(auto_now_add=True)
user = db.StringProperty(required=True)
event_type = db.StringProperty(required=True)
With an SQL database, I would do something like
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:
query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
usernames.append(event.user)
answer = len(set(usernames))
But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.
Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".
It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like
class Chard(db.Model):
usernames = db.StringListProperty(required=True)
So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:
chards = Chard.all()
all_usernames = set()
for chard in chards:
all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)
It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.
Is there any prettier solution to my problem?
Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.

As of SDK v1.7.4, there is now experimental support for the DISTINCT function.
See : https://developers.google.com/appengine/docs/python/datastore/gqlreference

Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.
You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.
Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.
At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:
class UniqueVisitsRecord(db.Model):
# be careful setting date correctly if processing at midnight
activity_date = db.DateProperty()
event_count = IntegerProperty()
You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.

NDB still does not support DISTINCT. I have written a small utility method to be able to use distinct with GAE.
See here. http://verysimplescripts.blogspot.jp/2013/01/getting-distinct-properties-with-ndb.html

Google App Engine and more particular GQL does not support a DISTINCT function.
But you can use Python's set function as described in this blog and in this SO question.

Related

Importing CSV takes too long

The Problem
I am writing an app engine Karaoke Catalogs app. The app is very simple: in the first release, it offers the ability to import CSV song lists into catalogs and display them.
I am having problem with CSV import: it takes a very long time (14 hours) to import 17,500 records in my development environment. In the production environment, it imports about 1000 records, then crashed with code 500. I am going through the logs, but did not find any useful clues.
The Code
class Song(ndb.Model):
sid = ndb.IntegerProperty()
title = ndb.StringProperty()
singer = ndb.StringProperty()
preview = ndb.StringProperty()
#classmethod
def new_from_csv_row(cls, row, parent_key):
song = Song(
sid=int(row['sid']),
title=row['title'],
singer=row['singer'],
preview=row['preview'],
key=ndb.Key(Song, row['sid'], parent=parent_key))
return song
class CsvUpload(webapp2.RequestHandler):
def get(self):
# code omit for brevity
def post(self):
catalog = get_catalog(…) # retrieve old catalog or create new
# upfile is the contents of the uploaded file, not the filename
# because the form uses enctype="multipart/form-data"
upfile = self.request.get('upfile')
# Create the songs
csv_reader = csv.DictReader(StringIO(upfile))
for row in csv_reader:
song = Song.new_from_csv_row(row, catalog.key)
song.put()
self.redirect('/upload')
Sample Data
sid,title,singer,preview
19459,Zoom,Commodores,
19460,Zoot Suit Riot,Cherry Poppin Daddy,
19247,You Are Not Alone,Michael Jackson,Another day has gone. I'm still all alone
Notes
In the development environment, I tried to import up to 17,500 records and experienced no crashing
At first, records are created and inserted quickly, but as the database grows into the thousands, the time it took to create and insert records increases to a few seconds per record.
How do I speed up the import operation? Any suggestion, hint, or tip will be greatly appreciated.
Update
I followed Murph's advice and used a KeyProperty to link a song back to the catalog. The result is about 4 minutes and 20 seconds for 17,500 records--a huge improvement. That means, I did not fully understood how NDB works in App Engine and I still have a long way to learn.
While a big improvement, 4+ minutes is admittedly still too long. I am now looking into Tim's and Dave's advises further shorten the perceived response time of my app.

In Google App Engine's Datastore, writes to an Entity Group are restricted to 1 write per second.
Since you are specifying a "parent" key for every Song, they all end up in one Entity Group, which is very slow.
Would it be acceptable for your use to just use a KeyProperty to keep track of that relationship? That would be much faster, though the data might have more consistency issues.

In addition to the other answer re: entity groups, if the import process is going to take longer than 60 secs, use a task, then you have 10min run time.
Store the csv as a BlobProperty in an entity (if compressed <1MB) or GCS for larger, then fire off a task which retrieves the CSV from storage then does the processing.

First off, Tim is on the right track. If you can't get work done within 60 seconds, defer to a task. But if you can't get work done within 10 minutes, fall back on App Engine MapReduce, which apportions the work of processing your csv across multiple tasks. Consult the demo program, which has some of the pieces that you would need.
For development-time slowness, are you using the --use_sqlite option when starting the dev_appserver?
Murph touches on the other part of your problem. Using Entity groups, you're rate-limited on how many inserts (per Entity group) you can do. Trying to insert 17,500 rows using a single parent isn't going to work at all well. That'll take about 5 hours.
So, do you really need consistent reads? If this is a one-time upload, can you do non-ancestor inserts (with the catalog as a property), then wait a bit for the data to become eventually consistent? This simplifies querying.
If you really, absolutely need consistent reads, you'll probably need to split your writes across multiple parents keys. This will increase your write rate, at the expense of making your ancestor queries more complicated.

Should I optimize around reads or CPU time in Google App Engine

I'm trying to optimize my design, but it's really difficult to put things in perspective. Say I have the following cases:
A. A User has 1,000 status updates. These updates are stored in a separate entity, Statuses. I want to get a User's statuses which have a uploadDate after date X. So I do a query:
statuses = Statuses.query(Statuses.uploadDate > X).fetch()
B. A User has 1,000 status updates. Each User entity has a list property list_of_status_keys, which is a list of all keys to the user's statuses. I want to get all statuses with uploadDate after date X. So I easily get a list of statuses using statuses = ndb.get_multi(list_of_status_keys). Then I loop through each one, checking the date:
for a_status in statuses:
if a_status.uploadDate > X:
myList.append(a_status)
I really don't know which I should be optimizing for. A query is more organized it seems, but fetching by keys is quicker. Anyone have any insight?
UPDATE
Here's what it comes down to:
In each http request to GAE, I get all notifications and status updates for a user (just like facebook). Using Appstats, it tells me that each request costs 490 micropennies (where 1 penny = 1,000,000 micropennies).
Getting notifications and statuses is important for a user, so you can expect them to do this many times. What I'm having a hard time with is determining if this is a lot or not. I'm freaking out trying to minimize this number in any way possible. I've never run a service before, so I don't know if this is how much it should cost. Here's the math:
Each request costs 490 micropennies when no results are returned (so just for a basic query it costs 490, but on some cases when several results are returned, it could cost 10,000 mp), so for 1 penny, I can run 2040 requests, or for $1 dollar, I can run 204,000 requests.
Let's say I have 50,000 users, and each user checks for notifications 75 times a day (reasonable):
75 requests X 490 mp per request X 50,000 users = 1,837,500,000 micropennies per day = 1837.5 pennies = 18.37 dollars per day. (is that right?)
I've never run a large scale service before, so are these usual costs? Or is this too high? Is 490 micropennies per request high? How would I find an answer to this if it depends?

Design A is superior.
In design A GAE will use the date to perform a keyed query. What this means is, that Appengine will automatically create an index for you on the Status table sorted by the date. Since it has an index, it will read and fetch only the records after the date you specify. This will save you a large number of reads.
In Design B you basically will have to do the indexing work yourself. Since you will need to fetch each Status and then compare its date you will have to do more work, both in terms of CPU (is cost) as in terms of performance.
EDIT
If your data is accessed as frequently as this, you may have other design options as well.
First you could consider combining the Status objects into StatusUpdatesPerDay. For each day you create a single instance and then append status updates to that object. This will reduce hundreds of reads into a couple of reads.
Second, since the status updates will be accessed very frequently, you can cache the Status in memcache. This will give reduce costs and latency.
Third, even if you do not optimize as above, I believe ndb has built in caching. I have never used this feature, but your actual read counts may be lower than in your calculations.
A fourth option is avoid displaying all status updates at once. Maybe the user wants to see only the last few. Then you can use query cursors to get the remainder when (and if) the user requests them.

Basic friend timeline algorithm?

I'm sure a lot of services online today must perform a task similar to what I'm doing. A user has friends, and I want to get all status updates of all the user's friends after their friends last status update date.
That was a mouthful, but here's what I have:
A user has say 10 friends. What I want to do is get new status updates for all his friends. So, I prepare a dictionary with each friend's last status date. Something like:
for friend in user:
dictionary['userId] = friend.id
dictionary['lastDate'] = friend.mostRecentStatusUpdate.date
Then, on my server side, I do something like this:
for dict in friends:
userId = dict['userId]
lastDate = dict['lastDate']
# each get below, however, launches an RPC and does a separate table lookup, so if I have 100 friends, this seems extremely inefficient
get statusUpdates for userId where postDate > lastDate
The problem with the above approach is that on the server side each iteration of the for loop launches a new query, which launches an RPC. So if there are a lot of friends, it would seem to be really inefficient.
Is there a better way to design my structure to make this task more efficient? How does say Twitter do something like that, where it gets new time line updates?

From the high level, I'd suggest you follow the prescribed app-engine mantra - make writes expensive to make reads cheap.
For each friend, you should keep a collection of known friends and their last status updates. This will allow you to update friends at write time. This is expensive for the write, but saves you processing and querying at read. This also assumes that you read more than you write.
Additionally, if you are just trying to display N number of latest updates for each friend, I would suggest you use NDB Structured property to store the Friend objects - this way you can create matching data structure. As part of the object, create a collection of keys that correspond to the status updates. When the status update is written, add to the collection, and potentially remove older entries (if space is a concern).
This way when you need to retrieve the updates, you are getting them by key, instead of a more expensive query types.
An alternative to this that avoids any additional queries, is to keep the entire update instead of just keys. However, this will be a lot bigger for storage - 10 friends all interconnected, means 100 versions of the same update.

Inequality Filters on a date and a number

Am trying to query my Google App Engine datastore [Python], which has a item_name, manufacturing_date and number_of_items_shipped. There are ~1.0 million records in the datastore and ever increasing.
The scenario:
Get all the item_names which have been shipped more than x_items [user input] and manufactured after some_date [user input].
Basically, kind of an inventory check.
Effectively 2 inequalities on properties.
But due to restrictions on queries in GAE, am not able to do this.
Searched SO for this issue. But, no luck till now. Did you come across this issue? If so, were you able to resolve this? Please let me know.
Also in Google I/O 2010, Next Gen Queries, Alfred Fuller mentioned that they are going to remove this restriction soon. Its been more than 8 months, but this restriction is in place even now. Unfortunately.
Appreciate if anyone can post an answer if they were able to circumvent this restriciton.
Thanks a lot.

Building on Sudhir's answer, I'd probably assign each record to a manufacture date "bucket", based on the granularity you care about. If your range of manufacturing dates is over a couple of years, use monthly buckets for example. If your range is just in the last year, weekly.
Now when you want to find records with > n sales and manufacturing date in a given range, do your query once per bucket in that range, and postfilter out the items you are not interested in.
For example (totally untested):
BUCKET_SIZE_DAYS = 10
def put(self):
self.manufacture_bucket = int(self.manufacture_date.toordinal() / BUCKET_SIZE_DAYS)
super(self.__class__, self).put()
def filter_date_after(self, date_start):
first_bucket = int(date_start.toordinal() / BUCKET_SIZE_DAYS)
last_bucket = int(datetime.datetime.today().toordinal() / BUCKET_SIZE_DAYS)
for this_bucket in range(first_bucket, last_bucket+1):
for found in self.filter("manufacture_bucket =", this_bucket):
if found.manufacture_date >= date_start:
yield found
You should be then able to use this like:
widgets.filter("sold >", 7).filter_date_after(datetime.datetime(2010,11,21))
Left as an exercise for the reader:
Making it play nicely with other filters added to the end
Multiple bucket sizes allowing you to always query ln(days in date range) buckets.

Unfortunately, you can't circumvent this restriction, but I can help you model the data in a slightly different way.
First off, Bigtable is suited to very fast reads off large databases - the kind you do when have a million people hitting your app at the same time. What you're trying to do here is a report on historical data. While I would recommend moving the reporting to a RDBMS, there is a way you can do it on Bigtable.
First, override the put() method on your item model to split the date before saving it. What you would do is something like
def put(self):
self.manufacture_day = self.manufacture_date.day
self.manufacture_month = self.manufacture_date.month
self.manufacture_year = self.manufacture_date.year
super(self.__class__, self).put()
You can do this to any level of granularity you want, even hours, minutes, seconds, whatever.
You can apply this retroactively to your database by just loading and saving your item entities. The mapper is very convenient for this.
Then change your query to use the inequality only on the item count, and select the days / months / years you want using normal equalities. You can do ranges by either firing multiple queries or using the IN clause. (Which does the same thing anyway).
This does seem contrived and tough to do, but keep in mind that your reports will run almost instantaneously if you do this, even when millions of people try to run them at the same time. You might not need this kind of scale, but well... that's what you get :D

Efficiently filter large number of datastore entities with a large number of property values

In my App Engine datastore I've got an entity type which may hold a large number of entities, each of which will have the property 'customer_id'. For example, lets say a given customer_id has 10,000 entities, and there are 50,000 customer_ids.
I'm trying to filter this effectively, so that a user could get information for at least 2000 customer_ids at one time. That is, read them out to the UI within the 30 second time out limit (further filtering will be done at the front end, so the user isn't bombarded with all results at once).
Below I've listed a view of my current datastore models. 'Reports' refer to sets of customer_ids, so continuing the above example, I could get my 2000 customer_ids from ReportCids.
class Users(db.Model):
user = db.StringProperty()
report_keys_list = db.ListProperty(db.Key)
class Reports(db.Model):
#report_key
report_name = db.StringProperty()
class ReportCids(db.Model):
report_key_reference = db.ReferenceProperty(Reports, collection_name="report_cid_set")
customer_id = db.IntegerProperty()
start_timestamp = db.IntegerProperty()
end_timestamp = db.IntegerProperty()
class CustomerEvent(db.Model):
customer_id = db.IntegerProperty()
timestamp = db.IntegerProperty()
event_type = db.IntegerProperty()
The options I considered:
-Perform a separate query for each customer_in in my set of 2000
-Use lists of keys indicating customer events, but this is limited to 5000 entries in a list (so I've read)
-Get all entries, and filter in my code
I'd really appreciate if anyone had some advice on how to do this in the most efficient way, or if I'm approaching the problem in completely the wrong way. I'm a novice in terms of using the datastore effectively.
Of course happy to provide any clarification or information if it helps.
Thanks very much!

Thanks for getting back to me. It looks like I had an issue with the account used when I posted so I need to respond in the comments here.
Having thought about it and from what you've said, fetching that many results is not going to work.
Here's what I'm trying to do:
I'm trying to make a report that shows for multiple Customer IDs the events that happened for that group of customers. So lets say I have a report to view information for 2000 customers. I want to be able to get all events (CustomerEvent) and then filter this by event_type. I'm likely asking a lot here, but what I was hoping to do was get all of these events for 2000 customers, and then do the event_type filtering at the front end, so that a user could dynamically adjust the event_type they want to look for and get some information on successful actions for that event type.
So my main problem is getting the right entities out of CustomerEvent effectively.
Right now, I'm grabbing a list of Customer IDs like this:
cid_list = []
this_report = models.Reports.get(report_key)
if this_report.report_cid_set:
for entity in this_report.report_cid_set:
cid_list.append(entity.customer_id)
My estimation of 10,000 CustomerEvent entities was quite high, but theoretically it could happen. Perhaps when I go to get the report results, I could filter straight away by the event_type specified by the user. This means that I have to go back to the datastore each time they choose a new option which isn't ideal, but perhaps it's my only option given this set up.
Thanks very much for taking the time to look at this!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.