Best data structure for finding nearby xyz points by distance?

Best data structure for finding nearby xyz points by distance? - python

I'm working on Python, but I suppose that doesn't affect the question itself.
I'm working on a game, and I need to store entities, each of which has an [x,y,z] in the world. I need to be able to run a "All entities within X euclidean distance of point Y".
These entities will be moving fairly often.
What would be the most efficient way to store the entities to make this as fast as possible?

As an alternative to what has been suggested already, if you don't need an exact distance, you could also use spatial hashing, which is quite easy to implement.
In summary, you have to think of your world as a grid where each cell in the grid would correspond to one bucket in the hash table. Since your entities are moving often, on each new frame you could clear and reconstruct the whole table and put the entities to their corresponding bucket depending on their position. Then, for any given entity you could just check the cells nearby and get the entities lists.

You can use a kd-tree (link has photo and code and examples) or an octree (this link is a C++ class template which you can use). Real-usage can be seen in this open-source game engine

Related

store large data python

I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.

For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.

Merging millions of polylines into single unique polyline

I am trying to figure out an efficient way to calculate the uniqueness of millions of paths. i.e. calculate coverage.
I have millions of [(lon,lat),...] points, that represent paths taken around a given city.
I am trying to figure out an efficient way to calculate unique millage. So points from one side of the street cancel another out, or paths that are heading in different direction cancel each other out, etc. Leaving only 1 path representing that a given area has been walked.
Now I can loop through the dataset, adding points to a new list, while also comparing that new point's distance to all points in the new list, and if its within a certain radius consider it none unique, and don't include it.
This is slow.
This has to be a solved problem, but alas I can not find anything helpful online as of yet. Does anyone have an recommendations, formulas, or any general advice on merging [(lon,lat),...] paths, to represent unique coverage in miles?

Since your intermediate goal seems to be reducing the number of points in your polylines and they are conveniently situated in a city, you could try map matching them to known streets and paths to achieve that. This GIS SE question might be helpful in that regard.

How would I sort a set of lat/lons by distance from a single lat/lon?

A user signs up for my site and enters in their zip code. I want to query for other users, and sort by distance.
I have a database full of zip codes with lat/lon points for each zip code.
zip_code (char)
lat (float)
lon (float)
I have a method which will calculate the distance between two sets of lat/lons, but to run this on every other zip code in my db is expensive. I'd need to run this on every zip code combination. I suppose I can do it once and store it somewhere, but where would I store it? Seems strange to have a table for every zip code which would contain the distance to every other zip code. Is there a clean way to do this?

Doing it once and storing it somewhere sounds good to me. Here are some ideas that might give good performance with some consideration to storage space without sacrificing accuracy:
There are something like 43,191 zip codes, so the full would be 1,865,462,481. But the distances are of course symmetrical and the self-to-self ones are useless, which immediately cuts it down to 932,709,645 entries. We might also cut the space by realizing that a bunch of zip codes are either the same as each other, or one contains the other (e.g. 10178 seems to be inside 10016, and they're both geographically small). Many zip codes will have no users at all, so we might avoid populating those until they're needed (i.e. lazy load the cache). And finally, you can probably throw away large-distance results, where large is defined as a distance greater than is useful for your users.
For a more algorithmic view, see this previous question: Calculate distance between zip codes and users
Bonus tip: don't forget about non-US users. Poor non-US users.

Here's a solution with a fair amount of overhead, but which will pay off as your dataset size, user base, and/or number of transactions grow:
If you don't already have one, use a database that supports spatial types and spatial indexing. I recommend the PostGIS extension for PostGres, but most of these steps apply to other spatially-enabled databases:
Store your zip code location as Point geometry type instead of a two columns for lat and long.
Create a spatial index against the Point geometry column. Every time you add a new zip code, its location will automatically be added to the spatial index.
Assuming you don't want to show "nearest" neighbors that are thousands of miles away, use a Within function (ST_DWithin in PostGIS) to filter out those zip codes that are too far away. This will significantly reduce the search space for close neighbors.
Finally use a Distance function (ST_Distance in PostGIS) to calculate the distance between your zip code of interest and its closer neighbors, and use the DB to return results sorted by distance.
By using a database with spatial index and a filtering function that uses that index, you can significantly speed up your search. And when the time comes to do more spatial analysis or show maps, you'll already have a framework in place to support that new functionality.

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

How to convert from lat lon to zipcode or state to generate choropleth map

I have a large collection (and growing) of geospatial data (lat, lon) points (stored in mongodb, if that helps).
I'd like to generate a choropleth map (http://vis.stanford.edu/protovis/ex/choropleth.html), which requires knowing the state which contains that point. Is there a database or algorithm that can do this without requiring call to external APIs (i.e. I'm aware of things like geopy and the google maps API).

Actually, the web app you linked to contains the data you need -
If you look at http://vis.stanford.edu/protovis/ex/us_lowres.js for each state, borders[] contains a [lat,long] polyline which outlines the state. Load this data and check for point-in-polygon - http://en.wikipedia.org/wiki/Point_in_polygon
Per Reverse Geocoding Without Web Access you can speed it up a lot by pre-calculating a bounding box on each state and only testing point-in-polygon if point-in-bounding-box.

Here's how to do it in FORTRAN. Remember FORTRAN? Me neither. Anyway, it looks pretty simple, as every state has its own range.
EDIT It's been point out to me that your starting point is LAT-LONG, not the zipcode.
The algorithm for converting a lat-long to a political division is called "a map". Seriously, that's allan ordinary map is, a mapping of every point in some range to the division it belongs to. A detailed digital map of all 48 contiguous states would be a big database, and then you would need some (fairly simple) code to determine for each state (described as a series of line segments outlining the border) whether a given point was inside it or out.

you can try using Geonames database. It has long/lat as well as city, postal and other location type data. It is free as well.
but If you need to host it locally or import it into your own database , the USGS and NGA provide a comprehensive list of cities with lat/lon. It's updated reguarly, free, and reliable.
http://geonames.usgs.gov
http://earth-info.nga.mil/gns/html/index.html

Not sure the quality of the data, but give this a shot: http://www.boutell.com/zipcodes/

If you don't mind a very crude solution, you could adapt the click-map here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.