How to best design a date/geographic proximity query on GAE?

How to best design a date/geographic proximity query on GAE? - python

I'm building a directory for finding athletic tournaments on GAE with
web2py and a Flex front end. The user selects a location, a radius, and a maximum
date from a set of choices. I have a basic version of this query implemented, but it's
inefficient and slow. One way I know I can improve it is by condensing
the many individual queries I'm using to assemble the objects into
bulk queries. I just learned that was possible. But I'm also thinking about a more extensive redesign that utilizes memcache.
The main problem is that I can't query the datastore by location
because GAE won't allow multiple numerical comparison statements
(<,<=,>=,>) in one query. I'm already using one for date, and I'd need
TWO to check both latitude and longitude, so it's a no go. Currently,
my algorithm looks like this:
1.) Query by date and select
2.) Use destination function from geopy's distance module to find the
max and min latitude and longitudes for supplied distance
3.) Loop through results and remove all with lat/lng outside max/min
4.) Loop through again and use distance function to check exact
distance, because step 2 will include some areas outside the radius.
Remove results outside supplied distance (is this 2/3/4 combination
inefficent?)
5.) Assemble many-to-many lists and attach to objects (this is where I
need to switch to bulk operations)
6.) Return to client
Here's my plan for using memcache.. let me know if I'm way out in left
field on this as I have no prior experience with memcache or server
caching in general.
-Keep a list in the cache filled with "geo objects" that represent all
my data. These have five properties: latitude, longitude, event_id,
event_type (in anticipation of expanding beyond tournaments), and
start_date. This list will be sorted by date.
-Also keep a dict of pointers in the cache which represent the start
and end indices in the cache for all the date ranges my app uses (next
week, 2 weeks, month, 3 months, 6 months, year, 2 years).
-Have a scheduled task that updates the pointers daily at 12am.
-Add new inserts to the cache as well as the datastore; update
pointers.
Using this design, the algorithm would now look like:
1.) Use pointers to slice off appropriate chunk of list based on
supplied date.
2-4.) Same as above algorithm, except with geo objects
5.) Use bulk operation to select full tournaments using remaining geo
objects' event_ids
6.) Assemble many-to-manys
7.) Return to client
Thoughts on this approach? Many thanks for reading and any advice you
can give.
-Dane

GeoModel is the best I found. You may look how my GAE app return geospatial queries. For instance India http query is with optional cc (country code) using geomodel library lat=20.2095231&lon=79.560344&cc=IN

You might be interested by geohash, which enables you to do an inequality query like this:
SELECT latitude, longitude, title FROM
myMarkers WHERE geohash >= :sw_geohash
AND geohash <= :ne_geohash
Have a look at this fine article which was featured in this month's Google App Engine App Engine Community Update blog post.
As a note on your proposed design, don't forget that entities in Memcache have no guarantee of staying in memory, and that you can not have them "sorted by date".

Related

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.

It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.

In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Create a pseudo GTFS dataset from AVL (GPS) data in .CSV format

I have an automatic vehicle location (AVL) dataset in .csv format of the public transit system of a city. I would like to use this AVL dataset to build a GTFS dataset for the purpose of running accessibility analysis.
I've seen a solution of how to create a GTFS dataset based on GPS data stored in SQL database(here), but I haven't found a solution when the GPS data is stored in .csv format, which is the case here. I would be glad to have any help on this but I would be glad if the solution would come either in R or Python.
I already have the stops.txt file of the GTFS, but I guess I would need to create the files shapes.txt, tips.txt, routes.txt and stop_times.txt.
This is how my GPS.csv dataset looks like:
timestamp order line lat long speed route_name
1: 2016-02-24 00:04:56 B27084 905 -22.9 -43.3 32.00 12860326
2: 2016-02-24 00:05:07 B41878 2302 -22.9 -43.2 0.19 12860386
3: 2016-02-24 00:04:37 B75563 928 -22.9 -43.2 0.00 12867184
4: 2016-02-24 00:05:17 D86084 852 -23.0 -43.6 24.26 12860043
5: 2016-02-24 00:04:58 C41420 -22.9 -43.2 0.00 NA
6: 2016-02-24 00:04:47 C30084 -23.0 -43.3 0.00 NA

There are five required files: agency.txt, routes.txt, trips.txt, stop_times.txt, and stops.txt. For a pseudo-GTFS that is only intended for the purposes of computing accessibility, a lot of the optional fields in the required files can be omitted, as well as all of the optional files. However you might want to copy real ones or construct them as they can be useful for this purpose (e.g. people will consider fares when choosing how to travel, so you could do with fares.txt).
Read the specification carefully.
agency
If it's acceptable to imagine that all routes are served by the same agency, yours could simply be:
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
XXX,My Awesome Agency,http://example.com,,,
i.e. you only need the first three fields.
agency.txt is intended to repesent:
One or more transit agencies that provide the data in this feed.
routes
You need:
route_id (primary key)
route_short_name
route_long_name
route_type (must be in range 0–7; indicates mode)
Example:
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color
12860326,XXX,12860326,12860326,,3,,
12860386,XXX,12860386,12860386,,3,,
12867184,XXX,12867184,12867184,,3,,
I don't know what to do with the routes that do not have a route assigned to them in your example data. I also don't know what order refers to. Perhaps order is a name for the route? As long as you can come up with something that is the same concept as a "route" identifier, you can use that. For reference, a "route" is defined as:
A route is a group of trips that are displayed to riders as a single
service.
trips
A trip is a sequence of two or more stops that occurs at specific time.
You need:
trip_id (primary key)
route_id (foreign key)
service_id (foreign key)
Example:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
12860326,1,1,,1,,12860326
12860326,1,2,,1,,12860326
12860386,1,1,,1,,12860386
12860386,1,2,,2,,12860386
direction_id, while optional, tends to be pretty useful, and I've had several applications that ingest GTFS require it despite its optional status.
service_id is tricky, and works in conjunction with calendar dates. It allows the GTFS to easily represent, say, "normal" weekday service, and holiday services when holidays fall on weekdays. For your purposes, you can probably just use 1 for everything, but it depends on your application and when your AVL data has been collected. When I worked on a similar application, I maintained a lookup table in my database that told me whether a particular date was a public holiday, and/or a school holiday, and/or during the university semesters, because bus routes changed to suit students.
shape_id is optional but will be critical if you want to draw your routes on maps, or use tools like OpenTripPlanner.
stop_times
Times that a vehicle arrives at and departs from individual stops for each trip.
You will need:
stop_id (primary key)
trip_id (foreign key)
arrival_time
departure_time
stop_sequence
This will require the most work when scripting. It will be several orders of magnitude larger than all of the other files combined.
stop_id and trip_id happily relate to the stops and trips as already identified. The departure_time and arrival_time will be in two rows of the AVL data, and in many cases actually identifying when a service arrived at a stop is the most difficult aspect of this task. It's easier with access to passenger smartcard data, and when a service actually stops you're likely to find spatial clusters of AVL records as the vehicle would not have moved for a particular period of time. However if a stop is empty and no one wants to get off, it will be hard to determine when a service actually "arrived" at the stop---particularly because the behaviour of a driver can sometimes change if they do not intend to make a stop when one is scheduled (e.g. travelling faster or taking a shortcut if they see no one waiting). In your case, the speed value is likely to be helpful, but be careful not to confuse a passenger stop with an intersection.
stop_sequence is optional but is another case where applications often expect it to exist. Anyway, if your script can't identify stop_sequence then you probably can't correctly invent this file.
Example:
trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
1,00:05:07,00:05:54,22018,1,,1,1,0
1,00:07:16,00:08:01,22557,2,,1,1,39
1,00:10:56,00:10:56,22559,3,,1,1,76
Indicating dwelling time is optional, so if this is too hard to work out, arrival_time and departure_time can validly be the same moment.
In practice, pickup_type and drop_off_type are very influential, but generally impossible to determine from AVL data alone, unless your AVL collector has really thought about supporting GTFS in their archival... which is unfortunately very unlikely. You will probably just have to allow both always, unless you have additional information that you can insert (e.g. "all trips on route 1 after stop 4 in weekday evenings only let passengers off").
stops
stop_id (primary key)
stop_name
stop_lon
stop_lat
You said that you have this already, which is great. The challenge is really in getting this to interface with stop_times via the stop_id foreign key. The AVL data I have worked with fortunately identified when services were stopped, and at what stop they were stopped at, using the same code as in the GTFS representation of the schedule.
calendar
To get good results with tools like OpenTripPlanner, you will probably need to include a calendar.txt file. This also helps to identify the period of validity for your pseudo-GTFS, if you're taking the approach of modelling a defined period of time. For example:
service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
1,1,1,1,1,1,0,0,20160224,20160226
2,0,0,0,0,0,1,1,20160224,20160226
3,0,0,0,0,0,1,0,20160224,20160226
This indicates that the modelled period is from 2016-06-24 to 2016-06-26 for those services. Any route requested outside of that range has an undefined travel time. I recommend that you pick a period of no more than a week: more than that and applications consuming the GTFS will start to struggle with the volume of data. Real GTFS data benefits from redundancy that this "pseudo" data cannot.
shapes
Don't worry about shape_dist_traveled, I just use dummy information for that (monotically increasing): it can be inferred from the shape, unless the shape is too generalised.
Example:
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
12860386,-22.9,-43.3,1,1
12860386,-22.0,-42.9,2,2
Note
The general idea is to use the AVL data at hand to fulfil the minimum requirements of a specification-meeting transit feed. You will probably need to write your own scripts to create these files, because there is no standard for AVL data. You can make some information up, and you will probably need to: most applications will raise exceptions when you try to use an incomplete feed. Indeed in my experience, quite a few applications will actually have problems with feeds that meet only the minimum requirements, because the program is poor and most real-world data goes a bit beyond the minimum standard.
You will probably find deficiencies in your AVL data that make it hard to use. The most notable case of this is routes that did run, but the AVL did not work. In such a case, your pseudo-GTFS will not accurately represent the transit system in practice. These are nearly impossible to detect.
In this case, I don't understand the differences between your order, line, and route fields. You will need to determine where these best fit; I've ignored them because I don't understand what they represent. You need to match the AVL schema to the concepts of the GTFS.
Transit systems tend to be very complicated with lots of little exceptions. You might end up excluding some particularly aberrant cases.
Your latitude and longitude values do not look very precise: if that is real data, you probably will not be able to use shapes.txt. Try asking for more precision in the vehicle positions.

Django and location based app

I'm building a django app that requires to display (on google map) items near the user current location. (ie. the nearest coffee shop kind of request).
I haven't decided how I will capture those items' location in db.
The requirements aren't final but I'm guessing the user will want to sort by distance, etc.
My question is what solution should I look into?
I looked at geodjango but it seems to be an overkill.
Could you please tell me what solution you would use and why?
Thanks everyone!

You will need to use RDBMS like MySQL or postgresql. Then, create your objects (e.g: cafeshops) with latitude and longitude as flout. Get the user's latitude and longitude and look it up via sin and cos functions.
You will need to write a raw sql query to look up objects based on their latitude and longitude.
Read this: http://www.scribd.com/doc/2569355/Geo-Distance-Search-with-MySQL
Take a look at this: Filter zipcodes by proximity in Django with the Spherical Law of Cosines
and this: latitude/longitude find nearest latitude/longitude - complex sql or complex calculation

How would I sort a set of lat/lons by distance from a single lat/lon?

A user signs up for my site and enters in their zip code. I want to query for other users, and sort by distance.
I have a database full of zip codes with lat/lon points for each zip code.
zip_code (char)
lat (float)
lon (float)
I have a method which will calculate the distance between two sets of lat/lons, but to run this on every other zip code in my db is expensive. I'd need to run this on every zip code combination. I suppose I can do it once and store it somewhere, but where would I store it? Seems strange to have a table for every zip code which would contain the distance to every other zip code. Is there a clean way to do this?

Doing it once and storing it somewhere sounds good to me. Here are some ideas that might give good performance with some consideration to storage space without sacrificing accuracy:
There are something like 43,191 zip codes, so the full would be 1,865,462,481. But the distances are of course symmetrical and the self-to-self ones are useless, which immediately cuts it down to 932,709,645 entries. We might also cut the space by realizing that a bunch of zip codes are either the same as each other, or one contains the other (e.g. 10178 seems to be inside 10016, and they're both geographically small). Many zip codes will have no users at all, so we might avoid populating those until they're needed (i.e. lazy load the cache). And finally, you can probably throw away large-distance results, where large is defined as a distance greater than is useful for your users.
For a more algorithmic view, see this previous question: Calculate distance between zip codes and users
Bonus tip: don't forget about non-US users. Poor non-US users.

Here's a solution with a fair amount of overhead, but which will pay off as your dataset size, user base, and/or number of transactions grow:
If you don't already have one, use a database that supports spatial types and spatial indexing. I recommend the PostGIS extension for PostGres, but most of these steps apply to other spatially-enabled databases:
Store your zip code location as Point geometry type instead of a two columns for lat and long.
Create a spatial index against the Point geometry column. Every time you add a new zip code, its location will automatically be added to the spatial index.
Assuming you don't want to show "nearest" neighbors that are thousands of miles away, use a Within function (ST_DWithin in PostGIS) to filter out those zip codes that are too far away. This will significantly reduce the search space for close neighbors.
Finally use a Distance function (ST_Distance in PostGIS) to calculate the distance between your zip code of interest and its closer neighbors, and use the DB to return results sorted by distance.
By using a database with spatial index and a filtering function that uses that index, you can significantly speed up your search. And when the time comes to do more spatial analysis or show maps, you'll already have a framework in place to support that new functionality.

Algorithm in Python to store and search daily occurrence for thousands of numbered events?

I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.
This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.
Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.
A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:
sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')
Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.
Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.
I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.

Create a 2D-array in Numpy:
import numpy as np
nbLamps = 200000
nbDays = 365
arr = np.array([nbLamps, nbDays], dtype=np.bool)
It will be very memory-efficient and you can aggregate easily the days and lamps.
In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.

I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.
This way when you get to the next sample you do nothing unless the state changed compared to the last item.
Searching is quick and efficient as you can use binary search approaches on the times.
Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.
If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.