I'm building a program in Python that loads in the geographic location of all addresses in my country and then calculates a "remoteness" value for each address by considering its distance to all the other addresses.
Since there are 3-5 million addresses in my country, I ultimately need to run 3-5 million iterations of the algorithm that calculates the index.
I have already taken some measures to bring down running time, including:
trying to process the data efficiently using numPy
instead of each address looking at its distance to each other address, I am dividing the country into zones, which have each already been assigned a population, and each address is then merely aware of how many of those zone centers fall within each of the distance values enumerated in "RADII"
It's still taking 23 seconds to get through a list of 5000 addresses, which means I expect 24h+ running time for the full dataset of 3-5 million.
So my question is: Which parts of my algorithm should be improved in order to save time? What stands out to you as slow, redundant or inefficient code?
Does my use of JSON and numPy make sense, for example? Could simplifying the "distance" function be of much help? Or would I gain a significant advantage by throwing it all out and using another language than Python?
Any advice would be much appreciated. I'm a newbie programmer so there could easily be issues I'm not aware of.
Here's the code. It's currently working on a limited dataset (one minor island) which makes up about 0,1% of the full dataset. The algorithm starts at line 30 (#Main loop):
import json
import math
import numpy as np
RADII = [888, 1480, 2368, 3848, 6216, 10064, 16280]
R = 6371000
remoteness = []
db = SQL("sqlite:///samsozones.db")
def main():
with open("samso.json", "r") as json_data:
data = json.load(json_data)
rows = db.execute("SELECT * FROM zones")
#Establish amount of zones with addresses in them
ZONES = len(rows)
#Initialize matrix with location of the center of each zone and the population of the zone
zonematrix = np.zeros((ZONES, 3), dtype="float")
for i, row in enumerate(rows):
zonematrix[i,:] = row["x"], row["y"], row["population"]
#Initialize matrix with distance from current address to centers of each zone and the population of the zone (will be filled out for each address in the main loop)
distances = np.zeros((ZONES, 2), dtype="float")
#Main loop (calculate remoteness index for each address)
for address in data:
#Reset remoteness index for new address
index = 0
#Calculate distance from address to each zone center and insert the distances into the distances matrix along with population
for j in range(ZONES):
distances[j, 0] = distance(address["x"], address["y"], zonematrix[j, 0], zonematrix[j, 1])
distances[j, 1] = zonematrix[j, 2]
#Calculate remoteness index
for radius in RADII:
#Calculate amount of zone centers within each radius and add up their population
allwithincircle = distances[distances[:,0] < radius]
count = len(allwithincircle)
pop = allwithincircle[:,1].sum()
#Calculate average within-radius zone population
try:
factor = pop / count
except:
factor = 0
#Increment remoteness index:
index += (1 / radius) * factor
remoteness.append((address["betegnelse"], index))
# Haversine function by Deduplicator, adapted from https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula
def distance(lat1,lon1,lat2,lon2):
dLat = deg2rad(lat2-lat1)
dLon = deg2rad(lon2-lon1)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) * math.sin(dLon/2) * math.sin(dLon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c
return d;
def deg2rad(deg):
return deg * (math.pi / 180)
if __name__ == "__main__":
main()
# Running time (N = 5070): 23 seconds
# Results quite reasonable
# Scales more or less linearly```
I can recommend an number of potential speed ups:
Don't use Haversine to calculate your distances. Find a good local projection for your country (you don't say where you are, so I can't) and reproject your data into that CRS which will allow you to use simple Euclidean distances which are much faster to compute (especially if you work in the square of the distance to save a bunch of square roots).
I would avoid calculating the distances from all the possible zones by converting the calculation to a raster surface of population and then calculating the remoteness for each cell and then looking the point of the address on that raster. If one house on my street is remote then I don't need to look the rest up! I would look at, for example, Wilderness attribute mapping in the United Kingdom by my former colleagues Carver, Evan and Fritz for good examples.
My answer focuses on the computational tricks you can try and implement in these kinds of situations rather than on domain-specific optimizations (which you should definitely prioritize as they'll usually give you the largest speedup).
For that, I use cython, which adds static typing to python and tries to transpile as much as the code as possible to c.
calc_dist takes a Nx2 numpy array of N latitudes and longitudes as an input, and uses it to compute a NxN distance array where every value above the diagonal at coordinates i,j represents the distance between locations i and j:
# To test, I recommend using a notebook with the %%cython cell magic
# %load_ext cython
# %%cython
import numpy as np
cimport numpy as np
from libc.math cimport sin, cos, asin, sqrt, pi
import cython
ctypedef np.float64_t dtype_t
#cython.boundscheck(False)
#cython.wraparound(False)
def calc_dist(dtype_t[:, :] coords):
result = np.zeros((coords.shape[0], coords.shape[0]), dtype=np.float64)
cdef dtype_t[:, :] result_view = result
cdef int d1, d2
for d1 in range(coords.shape[0]):
for d2 in range(d1, coords.shape[0]):
result_view[d1][d2] = haversine(coords[d1][0], coords[d1][1], coords[d2][0], coords[d2][1])
return result
#cython.cdivision(True)
cdef dtype_t haversine(dtype_t lat1, dtype_t lon1, dtype_t lat2, dtype_t lon2):
"""Pure C haversine. Based on https://stackoverflow.com/a/29546836/13145954"""
cdef dtype_t deg_to_rad = pi/180
lon1 = lon1 * deg_to_rad
lon2 = lon2 * deg_to_rad
lat1 = lat1 * deg_to_rad
lat2 = lat2 * deg_to_rad
cdef dtype_t dlon = lon2 - lon1
cdef dtype_t dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
Running calc_dist on the following random coordinates takes only 780ms on my laptop, which translates to a running time of about 13 minutes for 5 million rows.
coords = np.random.random((5000,2))
coords[:,0] = (coords[:,0]*2-1)*90
coords[:,1] = (coords[:,1]*2-1)*180
NB: the code will have to adapted and process the data in chunks in order to to avoid OOM errors. on the full database.
Related
I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id
Latitude
Longitude
Class_id
10131188
45.146973
6.416794
101
10799362
46.783695
-2.072855
700
10392536
48.604866
-2.825003
1456
...
...
...
...
22068176
29.806055
-98.41853
5532
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
My code will take forever to run.
The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.
Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.
Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.
Note 1) there are large parts of this that can be vectorized to improve performance.
Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.
Note 3) I would be surprised if this code is bug free.
# generate dummy data --------------------------
import pandas as pd
import random
random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))
# bin data points -------------------------------
from collections import defaultdict
def find_bin(lat, lon, bin_size=60):
"""Approximately bin data points
https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
"""
return int(110*lat//bin_size), int(110*cos(lon)//60)
bins = defaultdict(list)
for i, row in df.iterrows(): # O(len(df))
bins[find_bin(row['lat'], row['lon'])].append(int(i)) # this is slow, it can be vectorized less elegantly but only needs run once
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')
# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt
def compute_distance(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
def neighbours(x, y):
yield x-1, y-1
yield x-1, y
yield x-1, y+1
yield x , y-1
yield x , y
yield x , y+1
yield x+1, y-1
yield x+1, y
yield x+1, y+1
obs = 1000 # 10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for adj_bin in neighbours(*b): # O(9)
for i in bins[adj_bin]: # O(avg. bin size)
lat2, lon2, class_ = df.loc[i].values
dist = compute_distance(lat1, lon1, lat2, lon2)
if dist < list_obs[int(class_)]:
print(f'dist {dist} to {i} ({class_})')
list_obs[int(class_)] = dist
I have longitude and latitude arrays of fixed resolutions i.e. .1. This gives me 1800 lats and 3600 lons. I want to create a matrix of 1800x 3600 that will store area for each grid based on the formula here . i.e.
A = 2piR^2 |sin(lat1)-sin(lat2)| |lon1-lon2|/360
I have lons are lats already in arrays which represents centre of the grid.
Currently I use a formula, which calculates area for a given rectangle box.
def grid_area(lat1, lon1, lat2, lon2, radius= 6365000):
"""
Calculate grid area based on lat-long points of rectangle/square grid size by degrees.
Calculations are without any prohection system.
radius in meters is used to make it generic. Defaults to Earth
Formuala from : https://www.pmel.noaa.gov/maillists/tmap/ferret_users/fu_2004/msg00023.html
"""
import numpy as np
area = (np.pi/180)*(radius**2) *np.abs(np.sin(np.radians(lat1)) - np.sin(np.radians(lat2))) * np.abs(lon1 -lon2)/360
return area
I use this in a double loop for each lat/lon combination to get the area_grid.
grid_areas = np.zeros((len(lats), len(longs)))
for ll in range(len(longs)-1):
for lt in range(len(lats)-1):
lt1 = np.round(lats[lt]+.05,2)
ll1 = np.round(longs[ll]-.05,2)
lt2 = np.round(lats[lt]-.05,2)
ll2 = np.round(longs[ll]+.05,2)
grid_areas[lt,ll] = grid_area(lt1,ll1,lt2,ll2)
This as expected is slow. I am not sure which approach I can use to make it efficient.
I looked through the forum to create NxM matrixes, but not able to get the solution for this problem.
While writing this question, came across this thread on stackoverflow to use itertools.chain. Will try to change my code as per this, if that helps. Will update my findings on that.
In the meantime, any help in the right direction would help.
UPDATE:
I changed my code using itertools.product
lat_longs = np.array(list(itertools.product(*[lats.tolist(),longs.tolist()])))
and updated the function to accept centroids.
def grid_area(lat=None, lon=None, grid_size=.1, radius= 6365000):
"""
Calculate grid area based on lat-long points of rectangle/square grid size by degrees.
Calculations are without any prohection system.
radius in meters is used to make it generic. Defaults to Earth
Formuala from : https://www.pmel.noaa.gov/maillists/tmap/ferret_users/fu_2004/msg00023.html
"""
import numpy as np
grid_delta = grid_size/2
lat1 = lat+grid_delta
lat2 = lat-grid_delta
lon1 = lon - grid_delta
lon2 = lon + grid_delta
area = (np.pi/180)*(radius**2) *np.abs(np.sin(np.radians(lat1)) - np.sin(np.radians(lat2))) * np.abs(lon1 -lon2)/360
return area
I then rearrange the return area array using
areas_mat = areas.reshape((lats.shape[0], longs.shape[0]))
Now the longest part of the code is the itertools.product. it takes about 4.5 seconds, while the area calculation takes only about 350ms.
Any other way to get that first combination faster?
Update2: Final code
Once I tried, I found that area was not correct, even when the code was aligned with formula in the link. used the 2nd source for final version. Final code is
def grid_area_vec(lat=None, lon=None, grid_size=.1, radius= 6365000):
"""
Calculate grid area based on lat-long points of rectangle/square grid size by degrees.
Calculations are without any prohection system.
radius in meters is used to make it generic. Defaults to Earth
Orig Formula from : https://www.pmel.noaa.gov/maillists/tmap/ferret_users/fu_2004/msg00023.html
Another source for formula, finally used
https://gis.stackexchange.com/questions/413349/calculating-area-of-lat-lon-polygons-without-transformation-using-geopandas
"""
import numpy as np
grid_delta = 0.5 * grid_size
# dlon: (3600,)
dlon = np.full(lon.shape, np.deg2rad(grid_size))
# dlat: (1800, 1)
dlat = np.abs(np.sin(np.deg2rad(lat + grid_delta)) -
np.sin(np.deg2rad(lat - grid_delta)))[:, None]
# area: (1800, 3600)
# area = np.deg2rad(radius**2 * dlat * dlon)
area = radius**2 * (dlat * dlon)
return area
You can trivially vectorize this operation across all your arrays. Given an array lats with shape (1800,), and an array lons with shape (3600,), you can reshape them so that the broadcasted computation yields an array of the correct shape.
grid_delta = 0.5 * grid_size
# dlon: (3600,)
dlon = np.full(lons.shape, np.rad2deg(grid_size))
# dlat: (1800, 1)
dlat = np.abs(np.sin(np.deg2rad(lats + grid_delta)) -
np.sin(np.deg2rad(lats - grid_delta)))[:, None]
# area: (1800, 3600)
area = np.rad2deg(radius**2 * dlat * dlon)
I've got two dataframes, each with a set of coordinates. Dataframe 1 is a list of biomass sites, with coordinates in columns 'lat' and 'lng'. Dataframe 2 is a list of postcode coordinates, linked to sale price, with coordinates in columns 'pc_lat' and 'pc_lng'.
I've used this stackoverflow question to work out the closest biomass site to each property. This is the code I am using:
def dist(lat1, long1, lat2, long2):
return np.abs((lat1-lat2)+(long1-long2))
def find_site(lat, long):
distances = biomass.apply(
lambda row: dist(lat, long, row['lat'], row['lng']),
axis=1)
return biomass.loc[distances.idxmin(),'Site Name']
hp1995['BiomassSite'] = hp1995.apply(
lambda row: find_site(row['pc_lat'], row['pc_long']),
axis=1)
print(hp1995.head())
This has worked well, in that I've got the name of the closest Biomass generation site, however I want to know the distance calculated between these two sites.
How would I calculate the distance?
What metric would the output distance be in? I am trying to find properties within 2km from the biomass site.
To calculate distance between two global coordinates you should use the Haversine Formula, based on this page I have implemented the following method:
import math
def distanceBetweenCm(lat1, lon1, lat2, lon2):
dLat = math.radians(lat2-lat1)
dLon = math.radians(lon2-lon1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLon/2) * math.sin(dLon/2) * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
return c * 6371 * 100000 #multiply by 100k to get distance in cm
You can also modify it to return different units, by multiplying by different powers of 10. In the example a multiplication by 100k results in units in centimeters. Without multiplying the method returns distance in km. From there you could perform more unit conversions if necessary .
Edit: As suggested in the comments, one possible optimization for this would be using power operators instead of regular multiplication, like this:
a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)
Take a look at this question to read more about different speed complexities of calculating powers in python.
I'm writing a flask application, using some data extracted from a GPS sensor. I am able to draw the route on a Map and I want to calculate the distance the GPS sensor traveled. One way could be to just get the start and end coordinates, however due to the way the sensor travels this is quite inaccurate. Therefore I do sampling of each 50 sensor samples. If the real sensor sample size was 1000 I will now have 20 samples (by extracting each 50 sample).
Now I want to be able to put my list of samples through a function to calculate distance. So far I've been able to use the package geopy, but when I take large gps sample sets I do get "too many requests" errors, not to mention I will have extra processing time from processing the requests, which is not what I want.
Is there a better approach to calculating the cumulative distance of a list element containing latitude and longitude coordinates?
positions = [(lat_1, lng_1), (lat_2, lng_2), ..., (lat_n, lng_n)]
I found methods for lots of different mathematical ways of calculating distance using just 2 coordinates (lat1, lng1 and lat2 and lng2), but none supporting a list of coordinates.
Here's my current code using geopy:
from geopy.distance import vincenty
def calculate_distances(trips):
temp = {}
distance = 0
for trip in trips:
positions = trip['positions']
for i in range(1, len(positions)):
distance += ((vincenty(positions[i-1], positions[i]).meters) / 1000)
if i == len(positions):
temp = {'distance': distance}
trip.update(temp)
distance = 0
trips is a list element containing dictionaries of key-value pairs of information about a trip (duration, distance, start and stop coordinates and so forth) and the positions object inside trips is a list of tuple coordinates as visualized above.
trips = [{data_1}, {data_2}, ..., {data_n}]
Here's the solution I ended up using. It's called the Haversine (distance) function if you want to look up what it does for yourself.
I changed my approach a little as well. My input (positions) is a list of tuple coordinates:
def calculate_distance(positions):
results = []
for i in range(1, len(positions)):
loc1 = positions[i - 1]
loc2 = positions[i]
lat1 = loc1[0]
lng1 = loc1[1]
lat2 = loc2[0]
lng2 = loc2[1]
degreesToRadians = (math.pi / 180)
latrad1 = lat1 * degreesToRadians
latrad2 = lat2 * degreesToRadians
dlat = (lat2 - lat1) * degreesToRadians
dlng = (lng2 - lng1) * degreesToRadians
a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(latrad1) * \
math.cos(latrad2) * math.sin(dlng / 2) * math.sin(dlng / 2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
r = 6371000
results.append(r * c)
return (sum(results) / 1000) # Converting from m to km
I'd recommend transform your (x, y) coordinates into complex, as it is computational much easier to calculate distances. Thus, the following function should work:
def calculate_distances(trips):
for trip in trips:
positions = trip['positions']
c_pos = [complex(c[0],c[1]) for c in positions]
distance = 0
for i in range(1, len(c_pos)):
distance += abs(c_pos[i] - c_pos[i-1])
trip.update({'distance': distance})
What I'm doing is converting every (lat_1, lng_1) touple into a single complex number c1 = lat_1 + j*lng_1, and creates a list formed by [c1, c2, ... , cn].
A complex number is, all in all, a 2-dimensional number and, therefore, you can make this if you have 2D coordinates, which is perfect for geolocalization, but wouldn't be possible for 3D space coordinates, for instance.
Once you got this, you can easily compute the distance between two complex numbers c1 and c2 as dist12 = abs(c2 - c1). Doing this recursively you obtain the total distance.
Hope this helped!
I need to filter geocodes for near-ness to a location. For example, I want to filter a list of restaurant geocodes to identify those restaurants within 10 miles of my current location.
Can someone point me to a function that will convert a distance into latitude & longitude deltas? For example:
class GeoCode(object):
"""Simple class to store geocode as lat, lng attributes."""
def __init__(self, lat=0, lng=0, tag=None):
self.lat = lat
self.lng = lng
self.tag = None
def distance_to_deltas(geocode, max_distance):
"""Given a geocode and a distance, provides dlat, dlng
such that
|geocode.lat - dlat| <= max_distance
|geocode.lng - dlng| <= max_distance
"""
# implementation
# uses inverse Haversine, or other function?
return dlat, dlng
Note: I am using the supremum norm for distance.
There seems not to have been a good Python implementation. Fortunately the SO "Related articles" sidebar is our friend. This SO article points to an excellent article that gives the maths and a Java implementation. The actual function that you require is rather short and is embedded in my Python code below. Tested to extent shown. Read warnings in comments.
from math import sin, cos, asin, sqrt, degrees, radians
Earth_radius_km = 6371.0
RADIUS = Earth_radius_km
def haversine(angle_radians):
return sin(angle_radians / 2.0) ** 2
def inverse_haversine(h):
return 2 * asin(sqrt(h)) # radians
def distance_between_points(lat1, lon1, lat2, lon2):
# all args are in degrees
# WARNING: loss of absolute precision when points are near-antipodal
lat1 = radians(lat1)
lat2 = radians(lat2)
dlat = lat2 - lat1
dlon = radians(lon2 - lon1)
h = haversine(dlat) + cos(lat1) * cos(lat2) * haversine(dlon)
return RADIUS * inverse_haversine(h)
def bounding_box(lat, lon, distance):
# Input and output lats/longs are in degrees.
# Distance arg must be in same units as RADIUS.
# Returns (dlat, dlon) such that
# no points outside lat +/- dlat or outside lon +/- dlon
# are <= "distance" from the (lat, lon) point.
# Derived from: http://janmatuschek.de/LatitudeLongitudeBoundingCoordinates
# WARNING: problems if North/South Pole is in circle of interest
# WARNING: problems if longitude meridian +/-180 degrees intersects circle of interest
# See quoted article for how to detect and overcome the above problems.
# Note: the result is independent of the longitude of the central point, so the
# "lon" arg is not used.
dlat = distance / RADIUS
dlon = asin(sin(dlat) / cos(radians(lat)))
return degrees(dlat), degrees(dlon)
if __name__ == "__main__":
# Examples from Jan Matuschek's article
def test(lat, lon, dist):
print "test bounding box", lat, lon, dist
dlat, dlon = bounding_box(lat, lon, dist)
print "dlat, dlon degrees", dlat, dlon
print "lat min/max rads", map(radians, (lat - dlat, lat + dlat))
print "lon min/max rads", map(radians, (lon - dlon, lon + dlon))
print "liberty to eiffel"
print distance_between_points(40.6892, -74.0444, 48.8583, 2.2945) # about 5837 km
print
print "calc min/max lat/lon"
degs = map(degrees, (1.3963, -0.6981))
test(*degs, dist=1000)
print
degs = map(degrees, (1.3963, -0.6981, 1.4618, -1.6021))
print degs, "distance", distance_between_points(*degs) # 872 km
This is how you calculate distances between lat/long pairs using the haversine formula:
import math
R = 6371 # km
dLat = (lat2-lat1) # Make sure it's in radians, not degrees
dLon = (lon2-lon1) # Idem
a = math.sin(dLat/2) * math.sin(dLat/2) +
math.cos(lat1) * math.cos(lat2) *
math.sin(dLon/2) * math.sin(dLon/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
d = R * c;
It is now trivial to test "d" (also in km) against your threshold. If you want something else than km, adjust the radius.
I'm sorry I can't give you a drop-in solution, but I do not understand your code skeleton (see comment).
Also note that these days you probably want to use the spherical law of cosines rather than Haversine. The advantages in numerical stability are no longer worth it, and it's a hell of a lot simple to understand, code and use.
If you store data in MongoDB, it does nicely indexed geolocation searches for you, and is superior to the pure-Python solutions above because it will handle optimization for you.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
John Machin's answer helped me much. There is just a small mistake: latitudes and longitudes are swapped in boundigbox:
dlon = distance / RADIUS
dlat = asin(sin(dlon) / cos(radians(lon)))
return degrees(dlat), degrees(dlon)
this solves the problem. The reason is that longitudes don't changes their distance per degree - but latitudes do. Their distance is depending on the longitude.