Merging millions of polylines into single unique polyline - python

I am trying to figure out an efficient way to calculate the uniqueness of millions of paths. i.e. calculate coverage.
I have millions of [(lon,lat),...] points, that represent paths taken around a given city.
I am trying to figure out an efficient way to calculate unique millage. So points from one side of the street cancel another out, or paths that are heading in different direction cancel each other out, etc. Leaving only 1 path representing that a given area has been walked.
Now I can loop through the dataset, adding points to a new list, while also comparing that new point's distance to all points in the new list, and if its within a certain radius consider it none unique, and don't include it.
This is slow.
This has to be a solved problem, but alas I can not find anything helpful online as of yet. Does anyone have an recommendations, formulas, or any general advice on merging [(lon,lat),...] paths, to represent unique coverage in miles?

Since your intermediate goal seems to be reducing the number of points in your polylines and they are conveniently situated in a city, you could try map matching them to known streets and paths to achieve that. This GIS SE question might be helpful in that regard.

Related

Steps for representation of spatial boundaries, dividing into grids of equal size, assigning entities to grids

I have some data related to phenomenon between 2 geolocated(lat/lon) entities for a certain country. My high level algorithm is like this:
Get a boundary of country(shp/geojson). I have this.
Divide country area into 20km x20km grid (haversine etc). Not sure exactly how to do this.
Store this information into some data structure not a plot for visualisation.
Once I have the grid data structure, assign my geolocated entities into pair of grids. Correlated entities are pairwise (X1,X2). Depending on where the w entities are located, I will want to store this Information into a data structure. Like : (X1,X2): grid20,grid55. Now I have a mapping of all my geolocated entities with grids they fall into.
Once I have this data structure in place, I have to randomly assign(permutation) each entity to a random grid. This is part of a permutation test to see if entities have a certain geographical function or just by chance.
I understand the high level of what I need to do but not how to do it practically. I saw some posts about dividing earth into grids. But seemed more like for plotting rather than usage as a data structure. any help in how I can do this is appreciated.

Looking for repeated patterns in time series data

I have spent the best part of the last few days searching forums and reading papers trying to solve the following question. I have thousands of time series arrays each of varying lengths containing a single column vector. this column vector contains the time between clicks for dolphins using echolocation.
I have managed to cluster these into similar groups using DTW and want to check which trains have a high degree of similarity i.e repeated patterns. I only want to know the similarity with themselves and don't care to compare them with other trains as I have already applied DTW for that. I'm hoping some of these clusters will contain trains with a high proportion of repeated patterns.
I have already applied the Ljung–Box test to each series to check for autocorrelation but think i should maybe be using something with FFT and the power spectrum. I don't have much experience in this but have tried to do so using a Python package waipy. Ultimately, I just want to know if there is some kind of repeated pattern in the data ideally tested with a p-value. The image I have attached shows an example train across the top. the maximum length of my trains is 550.
example output from Waipy
I know this is quite a complex question but any help would be greatly appreciated even if it is a link to a helpful Python library.
Thanks,
Dex
For anyone in a similar position I decided to go with Motifs as they are able to find a repeated pattern in a time series using euclidian distance. There is a really good package in Python called Stumpy which makes this very easy!
Thanks,
Dex

How to find if a GPS coordinate lies on a specific road

Is there a way to get all road of an area, and then find out if a GPS coordinate is on a specific road. Something like:
all_driveable_road_in_NY = [id1, id2, id3, ..., idn] //Where idi represents the road number i
gps_coordj = [lat1, lat2]
for p in range(0, len(all_driveable_road_in_NY):
if gps_coordj on road all_driveable_road_in_NY[p]:
print("gps on road : " + all_driveable_road_in_NY[p])
How could man do that in python using openstreetmap?
Any hints will be welcome.
Thanks
A linear search by road name is not appropriate for this problem. For example, we don't need to search all roads in the Bronx to find a location in Manhattan. Instead swap the search criteria.
Suppose we have a large database of (road, lat long) corresponding to a point on each road every x meters. We have many data points for each road. Rather than searching all existing roads, we can search for the closest location. Any database can index these points for faster searches, probably using some sort of tree under the hood for O(log(n)) searches rather than O(n).
It's the same technique you would use when searching a word in a dictionary (like the book. okay maybe you would just use google, but hear me out). To find 'hello', you would first open the book halfway and see you are at 'R'. Now you know you only need to search the front half of the book, not the entire book. In the same way, we order the lat, long points to help us search faster.
Unless you are running this database, it will need to be supported by your API. This sounds like a common use case, so it's likely that it is supported.

How would I sort a set of lat/lons by distance from a single lat/lon?

A user signs up for my site and enters in their zip code. I want to query for other users, and sort by distance.
I have a database full of zip codes with lat/lon points for each zip code.
zip_code (char)
lat (float)
lon (float)
I have a method which will calculate the distance between two sets of lat/lons, but to run this on every other zip code in my db is expensive. I'd need to run this on every zip code combination. I suppose I can do it once and store it somewhere, but where would I store it? Seems strange to have a table for every zip code which would contain the distance to every other zip code. Is there a clean way to do this?
Doing it once and storing it somewhere sounds good to me. Here are some ideas that might give good performance with some consideration to storage space without sacrificing accuracy:
There are something like 43,191 zip codes, so the full would be 1,865,462,481. But the distances are of course symmetrical and the self-to-self ones are useless, which immediately cuts it down to 932,709,645 entries. We might also cut the space by realizing that a bunch of zip codes are either the same as each other, or one contains the other (e.g. 10178 seems to be inside 10016, and they're both geographically small). Many zip codes will have no users at all, so we might avoid populating those until they're needed (i.e. lazy load the cache). And finally, you can probably throw away large-distance results, where large is defined as a distance greater than is useful for your users.
For a more algorithmic view, see this previous question: Calculate distance between zip codes and users
Bonus tip: don't forget about non-US users. Poor non-US users.
Here's a solution with a fair amount of overhead, but which will pay off as your dataset size, user base, and/or number of transactions grow:
If you don't already have one, use a database that supports spatial types and spatial indexing. I recommend the PostGIS extension for PostGres, but most of these steps apply to other spatially-enabled databases:
Store your zip code location as Point geometry type instead of a two columns for lat and long.
Create a spatial index against the Point geometry column. Every time you add a new zip code, its location will automatically be added to the spatial index.
Assuming you don't want to show "nearest" neighbors that are thousands of miles away, use a Within function (ST_DWithin in PostGIS) to filter out those zip codes that are too far away. This will significantly reduce the search space for close neighbors.
Finally use a Distance function (ST_Distance in PostGIS) to calculate the distance between your zip code of interest and its closer neighbors, and use the DB to return results sorted by distance.
By using a database with spatial index and a filtering function that uses that index, you can significantly speed up your search. And when the time comes to do more spatial analysis or show maps, you'll already have a framework in place to support that new functionality.

Best data structure for finding nearby xyz points by distance?

I'm working on Python, but I suppose that doesn't affect the question itself.
I'm working on a game, and I need to store entities, each of which has an [x,y,z] in the world. I need to be able to run a "All entities within X euclidean distance of point Y".
These entities will be moving fairly often.
What would be the most efficient way to store the entities to make this as fast as possible?
As an alternative to what has been suggested already, if you don't need an exact distance, you could also use spatial hashing, which is quite easy to implement.
In summary, you have to think of your world as a grid where each cell in the grid would correspond to one bucket in the hash table. Since your entities are moving often, on each new frame you could clear and reconstruct the whole table and put the entities to their corresponding bucket depending on their position. Then, for any given entity you could just check the cells nearby and get the entities lists.
You can use a kd-tree (link has photo and code and examples) or an octree (this link is a C++ class template which you can use). Real-usage can be seen in this open-source game engine

Categories

Resources