Google Distance Matrix API very slow on Python - python

I was using the Google Maps Distance matrix API in python to calculate distances on bicycle between two points, using latitude and longitude. I was using a loop to calculate almost 300,000 rows of data for a student project (I am studying Data Science with Python). I added a debug line to output the row# and distance every 10,000 rows, but after humming away for a while with no results, I stopped the kernel and changed it to every 1000 rows. With that, after about 5 minutes it finally got to row 1000. After over an hour, it was only on row 70,000. Unbelievable. I stopped execution and later that day got an email from Google saying I had used up my free trial. so not only did it work incredibly slowly, I can't even use it at all anymore for a student project without incurring enormous fees.
So I rewrote the code to use geometry and just calculate "as the crow flies" distance. Not really what I want, but short of any alternatives, that's my only option.
Does anyone know of another (open-source, free) way to calculate distance to get what I want, or how to use the google distance matrix API more efficiently?
thanks,
so here is some more information, as suggested I post a bit more. I am trying to calculate distances between "stations", and am given lat's and long's for about 300K pairs. I was going to set up a function and then apply that function to the dataframe (bear with me, I'm still new at python and dataframes) -- but for now I was using a loop to go through all the pairs. Here is my code:
i = 0
while i < len(trip):
from_coords = str(result.loc[i, 'from_lat']) + " " + str(result.loc[i, 'from_long'])
to_coords = str(result.loc[i, 'to_lat']) + " " + str(result.loc[i, 'to_long'])
# now to get distances!!!
distance = gmaps.distance_matrix([from_coords], #origin lat & long, formatted for gmaps
[to_coords], #destination lat & long, formatted for gmaps
mode='bicycling')['rows'][0]['elements'][0] #mode=bicycling to use streets for cycling
result['distance'] = distance['distance']['value']
# added this bit to see how quickly/slowly the code is running
# ... and btw it's running very slowly. had the debug line at 10000 and changed it to 1000
# ... and i am running on a with i9-9900K with 48GB ram
# ... why so slow?
if i % 1000 == 0:
print(distance['distance']['value'])
i += 1

You could approximate the distance in KM with the haversine distance.
Here I have my distances as lat/long pairs as random_distances with shape (300000, 2) as a numpy array:
import numpy as np
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
random_distances = np.random.random( (300000,2) )
Than we can approximate the distances with
distances = np.zeros( random_distances.shape[0] - 2 )
for idx in range(random_distances.shape[0]-2):
distances[idx] = dist.pairwise(np.radians(random_distances[idx:idx+2]), np.radians(random_distances[idx:idx+2]) )[0][1]
distances *= 6371000/1000 # to get output as KM
distances now contains the distances.
It is 'allright' in speed, but can be improved. We could get rid of the for loop for instance, also 2x2 distances are returned and only 1 is used.
The haversine distance is an good approximation, but not exact which I imagine the API is:
From sklearn:
As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average.

Related

Time-efficient way to find connected spheres paths in Python

I have written a code to find connected spheres paths using NetworkX library in Python. For doing so, I need to find distances between the spheres before using the graph. This part of the code (calculation section (the numba function) --> finding distances and connections) led to memory leaks when using arrays in parallel scheme by numba (I had this problem when using np.linalg or scipy.spatial.distance.cdist, too). So, I wrote a non-parallel numba code using lists to do so. Now, it is memory-friendly but consumes a much time to calculate these distances (it consumes just ~10-20% of 16GB memory and ~30-40% of each CPU cores of my 4-cores CPU machine). For example, when I was testing on ~12000 data volume, it took less than one second for each of the calculation section and the NetworkX graph creation and for ~550000 data volume, it took around 25 minutes for calculation section (numba part) and 7 seconds for graph creation and getting the output list.
import numpy as np
import numba as nb
import networkx as nx
radii = np.load('rad_dist_12000.npy')
poss = np.load('pos_dist_12000.npy')
#nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba_parallel(radii, poss):
radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
poss_arr = np.zeros((poss.shape[0], poss.shape[0]), dtype=np.float64)
for i in nb.prange(radii.shape[0] - 1):
for j in range(i+1, radii.shape[0]):
radii_arr[i, j] = radii[i] + radii[j]
poss_arr[i, j] = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
return radii_arr, poss_arr
#nb.njit("(List(UniTuple(int64, 2)))(float64[::1], float64[:, ::1])")
def distances_numba_non_parallel(radii, poss):
connections = []
for i in range(radii.shape[0] - 1):
connections.append((i, i))
for j in range(i+1, radii.shape[0]):
radii_arr_ij = radii[i] + radii[j]
poss_arr_ij = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
if poss_arr_ij <= radii_arr_ij:
connections.append((i, j))
return connections
def connected_spheres_path(radii, poss):
# in parallel mode
# maximum_distances, distances = distances_numba_parallel(radii, poss)
# connections = distances <= maximum_distances
# connections[np.tril_indices_from(connections, -1)] = False
# in non-parallel mode
connections = distances_numba_non_parallel(radii, poss)
G = nx.Graph(connections)
return list(nx.connected_components(G))
My datasets will contain maximum of 10 millions spheres (data are positions and radii), mostly, up to 1 millions; As it is mentioned above, the most part of the consumed time is related to the calculation section. I have little experience using graphs and don't know if (and how) it can be handled much faster using all CPU cores or RAM capacity (max 12GB) or if it can be calculated internally (I doubt that it is needed to calculate and find the connected spheres separately before using graphs) using other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
I would be grateful for any suggested answer that can make my code faster for large data volumes (performance is the first priority; if much memory capacities are needed for large data volumes, mentioning (some benchmarks) its amounts will be helpful).
Update:
Since just using trees will not be helpful enough to improve the performance, I have written an advanced optimized code to improve the calculation section speed by combining tree-based algorithms and numba jitting.
Now, I am curious if it can be calculated internally (calculation section is an integral part and basic need for such graphing) by other Python libraries such as graph-tool, igraph, and netwrokit to do all the process in C or C++ in an efficient way.
Data
radii: 12000, 50000, 550000
poss: 12000, 50000, 550000
If you are computing the pairwise distance between all points, that's N^2 calculations, which will take a very long time for sufficiently many data points.
If you can place an upper bound on the distance you need to consider for any two points, then there are some nice data structures for finding pairs of neighbors in a set of points. If you already have scipy installed, then the most convenient structure to reach for is the KDTree (or the optimized version, cKDTree). (Read more here.)
The basic recipe is:
Load your point set into the KDTree.
Ask the KDTree for all pairs of points which are within some maximum distance from each other.
Calculate the actual distances between each of the returned points.
Compare those distances with the summed radii associated with the point pair. Drop the pairs whose distances are too large.
Finally, you need to determine the clusters of spheres. Your question mentions "paths", but in your example code you're only concerned with connected components. Of course you can use networkx or graph-tool for that, but maybe that's overkill.
If connected components are all you need, then you don't even need a proper graph data structure. You just need a way to find the groups of linked nodes, without maintaining the specific connections that linked them. Again, scipy has a nice tool: DisjointSet. (Read more here.)
Here is a complete example. The execution time depends on not only the number of points, but how "dense" they are. I tried some reasonable (I think) test data with 1M points, which took 24 seconds to process on my laptop.
Your example data (the largest of the sets provided above) takes longer: about 45 seconds. The KDTree finds 312M pairs of points to consider, of which fewer than 1M are actually valid connections.
import numpy as np
from scipy.spatial import cKDTree
from scipy.cluster.hierarchy import DisjointSet
## Example data (2D)
## N = 1000
# D = 2
# max_point = 1000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
## Example data (3D)
# N = 1_000_000
# D = 3
# max_point = 3000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)
# Question data (3D)
points = np.load('b (556024).npy')
radii = np.load('a (556024).npy')
N = len(points)
# Load into a KD tree and extract all pairs which could possibly be linked
# (using the maximum radius as the upper bound of the search distance.)
kd = cKDTree(points)
pairs = kd.query_pairs(2 * radii.max(), output_type='ndarray')
def filter_pairs(pairs):
# Calculate the distance between each pair of points
vectors = points[pairs[:, 1]] - points[pairs[:, 0]]
distances = np.linalg.norm(vectors, axis=1)
# Drop the pairs whose summed radii aren't large enough
# to span the distance between the points.
thresholds = radii[pairs].sum(axis=1)
return pairs[distances <= thresholds]
# We could do this in one big step
# ...but that might require lots of RAM.
# It's cheaper to do it in big chunks, in a loop.
fp = []
CHUNK = 1_000_000
for i in range(0, len(pairs), CHUNK):
fp.append(filter_pairs(pairs[i:i+CHUNK]))
filtered_pairs = np.concatenate(fp)
# Load the pairs into a DisjointSet (a.k.a. UnionFind)
# data structure and extract the groups.
ds = DisjointSet(range(N))
for u, v in filtered_pairs:
ds.merge(u, v)
connected_sets = list(ds.subsets())
print(f"Found {len(connected_sets)} sets of circles/spheres")
Just for fun, here's a visualization of the 2D test data:
from bokeh.plotting import output_notebook, figure, show
output_notebook()
p = figure()
p.circle(*points.T, radius=radii, fill_alpha=0.25)
p.segment(*points[filtered_pairs[:, 0]].T,
*points[filtered_pairs[:, 1]].T,
line_color='red')
show(p)
to find connected spheres using NetworkX library in Python. For
doing so, I need to find distances between the spheres
Are you calculating the distance between every pair of spheres?
If all you need is to know the pairs of spheres that touch, or maybe that overlap, then you do NOT need to calculate the distance between every pair of spheres, only ones that are in reasonable proximity to each other. The standard way of handling this it to use an octree https://en.wikipedia.org/wiki/Octree
This takes some time to set up, but once you have it, you can find quickly all the spheres that are close but none that are two far away. A reasonable distance would be twice the radius of the largest sphere. For large dataset the improvement in performance can be spectacular
( For more details about this test https://github.com/JamesBremner/quadtree )
So, the complete algorithm to find the paths through the connected spheres can be broken out into four conceptual steps
Find the connected spheres, using an octree to optimize finding them. Instead of searching through every pair of spheres, loop over the spheres and search through the spheres in the same octree cell. For more details on how to make this work you might want to look at the C++ code at https://github.com/JamesBremner/quadtree
Create the adjacency matrix of connected spheres. Conceptually this is a separate step, however, you will probably want to do that as you search for connected sphere in the first step. Construct an empty adjacency matrix N by N where N is the number of spheres. Each time you find a pair of connected spheres, fill in in matrix.
Load the matrix into a graph library. It may be more efficient to simply add the link between two connected spheres directly into the library and let it build the adjacency matrix.
Use the graph library methods to find the path.

Finding a vector direction that deviates approximately equally by e.g. 5 degree from all other vector directions in a set

I have a set of approximately 10,000 vectors max (random directions) in 3d space and I'm looking for a new direction v_dev (vector) which deviates from all other directions in the set by e.g. a minimum of 5 degrees. My naive initial try is the following, which has of course bad runtime complexity but succeeds for some cases.
#!/usr/bin/env python
import numpy as np
numVecs = 10000
vecs = np.random.rand(numVecs, 3)
randVec = np.random.rand(1, 3)
notFound=True
foundVec=randVec
below=False
iter = 1
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, foundVec)/(np.linalg.norm(vec) * np.linalg.norm(foundVec))))
print("angle: %f\n" % angle)
while notFound:
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, randVec)/(np.linalg.norm(vec) * np.linalg.norm(randVec))))
if angle < 5:
below=True
if below:
randVec = np.random.rand(1, 3)
else:
notFound=False
print("iteration no. %i" % iter)
iter = iter + 1
Any hints how to approach this problem (language agnostic) would be appreciate.
Consider the vectors in a spherical coordinate system (u,w,r), where r is always 1 because vector length doesn't matter here. Any vector can be expressed as (u,w) and the "deadzone" around each vector x, in which the target vector t cannot fall, can be expressed as dist((u_x, w_x, 1), (u_x-u_t, w_x-w_t, 1)) < 5°. However calculating this distance can be a bit tricky, so converting back into cartesian coordinates might be easier. These deadzones are circular on the spherical shell around the origin and you're looking for a t that doesn't hit any on them.
For any fixed u_t you can iterate over all x and using the distance function can find the start and end point of a range of w_t, that are blocked because they fall into the deadzone of the vector x. The union of all 10000 ranges build the possible values of w_t for that given u_t. The same can be done for any fixed w_t, looking for a u_t.
Now comes the part that I'm not entirely sure of: Given that you have two unknows u_t and w_t and 20000 knowns, the system is just a tad overdetermined and if there's a solution, it should be possible to find it.
My suggestion: Set u_t fixed to a random value and check which w_t are possible. If you find a non-empty range, great, you're done. If all w_t are blocked, select a different u_t and try again. Now, selecting u_t at random will work eventually, yet a smarter iteration should be possible. Maybe u_t(n) = u_t(n-1)*phi % 360°, where phi is the golden ratio. That way the u_t never repeat and will cover the whole space with finer and finer granularity instead of starting from one end and going slowly to the other.
Edit: You might also have more luck on the mathematics stackexchange since this isn't so much a code question as it is a mathematics question. For example I'm not sure what I wrote is all that rigorous, so I don't even know it works.
One way would be two build a 2d manifold (area on the sphere) of forbidden areas. You start by adding a point, then, the forbidden area is a circle on the sphere surface.
While true, pick a point on the boundary of the area. If this is not close (within 5 degrees) to any other vector, then, you're done, return it. If not, you just found a new circle of forbidden area. Add it to your manifold of forbidden area. You'll need to chop the circle in line or arc segments and build the boundary as a list.
If the set of vector has no solution, you boundary will collapse to an empty point. Then you return failure.
It's not the easiest approach, and you'll have to deal with the boundaries of a complex shape over a sphere. But it's guaranteed to work and should have reasonable complexity.

What is the Most Efficient Way to Compute the (euclidean) Distance of the Nearest Neighbor in a List of (x,y,z) points?

What is the most efficient way compute (euclidean) distance of the nearest neighbor for each point in an array?
I have a list of 100k (X,Y,Z) points and I would like to compute a list of nearest neighbor distances. The index of the distance would correspond to the index of the point.
I've looked into PYOD and sklearn neighbors, but those seem to require "teaching". I think my problem is simpler than that. For each point: find nearest neighbor, compute distance.
Example data:
points = [
(0 0 1322.1695
0.006711111 0 1322.1696
0.026844444 0 1322.1697
0.0604 0 1322.1649
0.107377778 0 1322.1651
0.167777778 0 1322.1634
0.2416 0 1322.1629
0.328844444 0 1322.1631
0.429511111 0 1322.1627...)]
compute k = 1 nearest neighbor distances
result format:
results = [nearest neighbor distance]
example results:
results = [
0.005939372
0.005939372
0.017815632
0.030118587
0.041569616
0.053475883
0.065324964
0.077200014
0.089077602)
]
UPDATE:
I've implemented two of the approaches suggested.
Use the scipy.spatial.cdist to compute the full distances matrices
Use a nearest X neighbors in radius R to find subset of neighbor distances for every point and return the smallest.
Results are that Method 2 is faster than Method 1 but took a lot more effort to implement (makes sense).
It seems the limiting factor for Method 1 is the memory needed to run the full computation, especially when my data set is approaching 10^5 (x, y, z) points. For my data set of 23k points, it takes ~ 100 seconds to capture the minimum distances.
For method 2, the speed scales as n_radius^2. That is, "neighbor radius squared", which really means that the algorithm scales ~ linearly with number of included neighbors. Using a Radius of ~ 5 (more than enough given application) it took 5 seconds, for the set of 23k points, to provide a list of mins in the same order as the point_list themselves. The difference matrix between the "exact solution" and Method 2 is basically zero.
Thanks for everyones' help!
Similar to Caleb's answer, but you could stop the iterative loop if you get a distance greater than some previous minimum distance (sorry - no code).
I used to program video games. It would take too much CPU to calculate the actual distance between two points. What we did was divide the "screen" into larger Cartesian squares and avoid the actual distance calculation if the Delta-X or Delta-Y was "too far away" - That's just subtraction, so maybe something like that to qualify where the actual Eucledian distance metric calculation is needed (extend to n-dimensions as needed)?
EDIT - expanding "too far away" candidate pair selection comments.
For brevity, I'll assume a 2-D landscape.
Take the point of interest (X0,Y0) and "draw" an nxn square around that point, with (X0,Y0) at the origin.
Go through the initial list of points and form a list of candidate points that are within that square. While doing that, if the DeltaX [ABS(Xi-X0)] is outside of the square, there is no need to calculate the DeltaY.
If there are no candidate points, make the square larger and iterate.
If there is exactly one candidate point and it is within the radius of the circle incribed by the square, that is your minimum.
If there are "too many" candidates, make the square smaller, but you only need to reexamine the candidate list from this iteration, not all the points.
If there are not "too many" candidates, then calculate the distance for that list. When doing so, first calculate DeltaX^2 + DeltaY^2 for the first candidate. If for subsequent candidates the DetlaX^2 is greater than the minumin so far, no need to calculate the DeltaY^2.
The minimum from that calculation is the minimum if it is within the radius of the circle inscribed by the square.
If not, you need to go back to a previous candidate list that includes points within the circle that has the radius of that minimum. For example, if you ended with one candidate in a 2x2 square that happened to be on the vertex X=1, Y=1, distance/radius would be SQRT(2). So go back to a previous candidate list that has a square greated or equal to 2xSQRT(2).
If warranted, generate a new candidate list that only includes points withing the +/- SQRT(2) square.
Calculate distance for those candidate points as described above - omitting any that exceed the minimum calcluated so far.
No need to do the square root of the sum of the Delta^2 until you have only one candidate.
How to size the initial square, or if it should be a rectangle, and how to increase or decrease the size of the square/rectangle could be influenced by application knowledge of the data distribution.
I would consider recursive algorithms for some of this if the language you are using supports that.
How about this?
from scipy.spatial import distance
A = (0.003467119 ,0.01422762 ,0.0101960126)
B = (0.007279433 ,0.01651597 ,0.0045558849)
C = (0.005392258 ,0.02149997 ,0.0177409387)
D = (0.017898802 ,0.02790659 ,0.0006487222)
E = (0.013564214 ,0.01835688 ,0.0008102952)
F = (0.013375397 ,0.02210725 ,0.0286032185)
points = [A, B, C, D, E, F]
results = []
for point in points:
distances = [{'point':point, 'neighbor':p, 'd':distance.euclidean(point, p)} for p in points if p != point]
results.append(min(distances, key=lambda k:k['d']))
results will be a list of objects, like this:
results = [
{'point':(x1, y1, z1), 'neighbor':(x2, y2, z2), 'd':"distance from point to neighbor"},
...]
Where point is the reference point and neighbor is point's closest neighbor.
The fastest option available to you may be scipy.spatial.distance.cdist, which finds the pairwise distances between all of the points in its input. While finding all of those distances may not be the fastest algorithm to find the nearest neighbors, cdist is implemented in C, so it is likely run faster than anything you try in Python.
import scipy as sp
import scipy.spatial
from scipy.spatial.distance import cdist
points = sp.array(...)
distances = sp.spatial.distance.cdist(points)
# An element is not its own nearest neighbor
sp.fill_diagonal(distances, sp.inf)
# Find the index of each element's nearest neighbor
mins = distances.argmin(0)
# Extract the nearest neighbors from the data by row indexing
nearest_neighbors = points[mins, :]
# Put the arrays in the specified shape
results = np.stack((points, nearest_neighbors), 1)
You could theoretically make this run faster (mostly by combining all of the steps into one algorithm), but unless you're writing in C, you won't be able to compete with SciPy/NumPy.
(cdist runs in Θ(n2) time (if the size of each point is fixed), and every other part of the algorithm in O(n) time, so even if you did try to optimize the code in Python, you wouldn't notice the change for small amounts of data, and the improvements would be overshadowed by cdist for more data.)

Error in computing velocity from distance between GPS points

I'm using a Kalman filter to track the position of a vehicle and for my measurement, I have a GPX file (WGS84 Format) containing information about the latitude, longitude, elevation and the timestamp of each point given by GPS. Using this data, I computed the distance between GPS points (Using Geodesic distance and Vincenty formula) and, since the timestamp information is known, the time difference between the points can be used to calculate the time delta. Since, we now have the distance and the time delta between the points, we can calculate the velocity (= distance between points/time delta) which could then be also used as a measurement input to the Kalman.
However, I have read that this is only the average velocity and not the instantaneous velocity at any given point. In order to obtain the instantaneous velocity, it is suggested that one must take the running average and some implementations directly compute the velocity considering the difference in time between the current time at the point with the first initial point. I'm a bit confused as to which method I need to use to implement this in python.
Firstly, is this method used in my implementation correct to calculate velocity? (I also read about doppler shifts that can be used but sadly, I only collect the GPS data through a running app (Strava) on my iPhone)
How can the instantaneous velocity at every GPS point be calculated from my implementation?( Is the bearing information also necessary?)
What would be the error from this computed velocity? (Since the error in position itself from an iPhone can be about 10 metres, error from distance measurement about 1mm and considering that I want the focus to be on as much accuracy as possible)
Current Implementation
import gpxpy
import pandas as pd
import numpy as np
from geopy.distance import vincenty, geodesic
import matplotlib.pyplot as plt
"Import GPS Data"
with open('my_run_001.gpx') as fh:
gpx_file = gpxpy.parse(fh)
segment = gpx_file.tracks[0].segments[0]
coords = pd.DataFrame([
{'lat': p.latitude,
'lon': p.longitude,
'ele': p.elevation,
} for p in segment.points])
"Compute delta between timestamps"
times = pd.Series([p.time for p in segment.points], name='time')
dt = np.diff(times.values) / np.timedelta64(1, 's')
"Find distance between points using Vincenty and Geodesic methods"
vx = []
for i in range(len(coords.lat)-1):
if(i<=2425):
vincenty_distance = vincenty([coords.lat[i], coords.lon[i]],[coords.lat[i+1], coords.lon[i+1]]).meters
vx.append(vincenty_distance)
print(vx)
vy = []
for i in range(len(coords.lat)-1):
if(i<=2425):
geodesic_distance = geodesic([coords.lat[i], coords.lon[i]],[coords.lat[i+1], coords.lon[i+1]]).meters
vy.append(geodesic_distance)
print(vy)
"Compute and plot velocity"
velocity = vx/dt
time = [i for i in range(len(dt))]
plt.plot(velocity,time)
plt.xlabel('time')
plt.ylabel('velocity')
plt.title('Plot of Velocity vs Time')
plt.show()
Reference for GPX Data:
https://github.com/stevenvandorpe/testdata/blob/master/gps_coordinates/gpx/my_run_001.gpx
interesting topic out here.
If you are planning to use standalone GPS location output for calculating velocity of an object, be ready for some uncertainty of the results. As I am sure you know, there are certain propagation delays in the whole process, so there are several information you need to pay attention on.
1.
Basically, taking the distance and time, and then calculating velocity based on those deltas is right approach, but as you've told, that is the average velocity between two gps measurements, since gps has some propagation delay in its nature.
2.
Like we've told, this kind of calculation gives us average velocity in function of delta time and distance, and by nature, we cant change that. What we can do is affect the frequency of sampling the gps signal, and by that increase or decrease real time accuracy of our system.
SUGGESTION: If you wish a bit more accurate real time velocity data, I suggest you involving gyroscope sensor of your phone and processing it's output. Collecting first delta (average) speed from GPS and then detect gyro changes will be interesting way to continue with your idea.
3.
Lets say you are walking (or superspeed running :)) with your device. At one moment, device is sending request for GPS location but due some issues (perhaps bad satellite connection) you got response with data with 10 seconds delay. For purpose of the example lets consider you are walking on absolutely straight line on absolutely flat part of surface :) After 1 minute from last request received you are sending another request for gps location, and you receive the data which told you you've walked 300m on north from your previous measurement, with 2 secs delay. If you measure it from send request to send another request you would get your speed was 300/70 = 4.28 m/s (quite impressive speed), but what's one of actual possible scenarions:
- You didnt walked 300m, you walked 270 m (gps error)
- Time between two measures (received) is around 62s
- You were even faster with 270/62 = 4.84 m/s
With phone it is tricky you cant measure when did you actually sent request in ether or when in ms you got the response, and those things are quite possible when you are manipulating sensors on hardware-proximity layer. Therefore you will certainly loose some accuracy.

Python KD Tree Nearest Neigbour where distance is greater than zero

I am trying to implement a Nearest neighbour search for Lat and Lon data. Here is the Data.txt
61.3000183105 -21.2500038147 0
62.299987793 -23.750005722 1
66.3000488281 -28.7500038147 2
40.8000183105 -18.250005722 3
71.8000183105 -35.7500038147 3
39.3000183105 -19.7500019073 4
39.8000183105 -20.7500038147 5
41.3000183105 -20.7500038147 6
The problem is, when I want to do the nearest neighbour for each of the Lat and Lon on the data set, it is searching it self. e.g Nearest Neighbour of (-21.2500038147,61.3000183105) will be (-21.2500038147,61.3000183105) and the resulting distance will be 0.0. I am trying to avoid this but with no luck. I tried doing if not (array_equal) but still...
Below is my python code
import numpy as np
from numpy import *
import decimal
from scipy import spatial
from scipy.spatial import KDTree
from math import radians,cos,sin,sqrt,exp
Lat =[]
Lon =[]
Day =[]
nja = []
Data = np.loadtxt('Data.txt',delimiter=" ")
for i in range(0,len(Data)):
Lon.append(Data[i][:][0])
Lat.append(Data[i][:][1])
Day.append(Data[i][:][2])
tree =spatial.KDTree(zip(Lon,Lat) )
print "Lon :",len(Lon)
print "Tree :",len(tree.data)
for i in range(0,len(tree.data)):
pts = np.array([tree.data[i][0],tree.data[i][1]])
nja.append(pts)
for i in range(0, len(nja)):
if not (np.array_equal(nja,tree.data)):
nearest = tree.query(pts,k=1,distance_upper_bound =9)
print nearest
For each point P[i] in your data set, you're asking "Which is the point nearest to P[i] in my data set?" and you get the answer "It is P[i]".
If you ask a different question, "Which are the TWO points nearest to P[i]?", i.e., tree.query(pts,k=2) (the difference with your code being s/k=1/k=2/)
you will get P[i] and also a P[j], the second nearest point, that is the result you want.
Side note:
I'd recommend that you project your data before building the tree, cause in your range of latitudes there is a large fluctuation in what is meant by a 1 degree distance in longitude.
How'bout a low-tech solution? If you have a large number of points (say 10000 or more), this is no more reasonable, but for a smaller number this brute force solution might be useful:
import numpy as np
dist = (Lat[:,None]-Lat[None,:])**2 + (Lon[:,None]-Lon[None,:])**2
Now you have an NxN array (N is the number of points) with distances (or squares of distances, to be more precise) between all point pairs. Finding the shortest distance for each point is then a matter of finding the smallest value on each row. To exclude the point itself you may set the diagonal to NaN and use nanargmax:
np.fill_diagonal(dist, np.nan)
closest = np.nanargmin(dist, axis=1)
This approach is very simple and guaranteed to find the closest points, but has two significant downsides:
It is O(n^2), and at 10000 points it takes around one second
Ot consumes a lot of memory (800 MB for the above mentioned case)
The latter problem can of course be avoided by doing this piecewise, but the first problem excludes large point sets.
This can be carried out also by using scipy.spatial.distance.pdist:
dist=scipy.spatial.distance.pdist(np.column_stack((Lon, Lat)))
This is a bit faster (by half at least), but the output matrix is in the condensed form, see the documentation for scipy.spatial.distance.squareform.
If you need to calculate the real distances, then this is a good alternative, as pdist can handle distances on a sphere.
Then, again, you may use your KDtree approach by just extending your query to two closest point:
nearest = tree.query(pts, k=2, distance_upper_bound=9)
Then nearest[1][0] has the point itself ("me, myself, and I"), nearest[1][1] the real nearest neighbour (or inf if there is nothing near enough).
The best solution depends on the number of points you have. Also, you might want to use something else than cartesian 2D distances if your map points are not close to each other on the globe.
Just a note about using latitudes and longitudes in finding distances: If you just try to pretend they are 2D Cartesian points, you get it wrong. At 60°N one degree of latitude is 1111 km, whereas one degree of longitude is 555 km. So, at least you will have to divide the longitudes by cos(latitude). And even with that trick you will end up in trouble when the longitudes change from east to west.
Probably the easiest way out of this trouble is to calculate the coordinate points into Cartesian 3D points:
x = cos(lat) * cos(lon)
y = cos(lat) * sin(lon)
z = sin(lat)
If you then calculate the shortest distances between these points, you will get the correct results. (Just note that the distances are not the same as real shortest distances on the surface of the globe.)

Categories

Resources