Struggling to iterate through a list with a list

Struggling to iterate through a list with a list - python

i am in need of help. The idea is to calculate distance between two points via lat and long. That being between customer's lat and long and the store's lat and long. Repeat this for other stores. This seems to me, i have to iterate through a list with another list. I have been struggling to implement this for several hours now.
EXCEL VIEW
# Backup.xlsx/AccountLocation.xlsx
# column 0 = account numbers
# column 1 = postcodes
# Column 2 = latitude
# Column 3 = longitude
# Column 4 = order total (men)
# Column 5 = order total (women)
# Column 6 = order total (children)
# location.xlsx
# Column 1 = City name
# Column 2 = latitude
# Column 3 = longitude
# Income.xlsx
# Column 1 = City name
# Column 2 = estimated income (Men)
# Column 3 = estimated income (Woman)
# Column 4 = estimated income (Children)
Code
import pandas as pd
import math
from openpyxl import load_workbook
def distance2Point(inputLat1, inputLon1, inputLat2, inputLon2):
lat1 = float(inputLat1)
lon1 = float(inputLon1)
lat2 = float(inputLat2)
lon2 = float(inputLon2)
R = 6371 #metres
radLat1 = lat1 * (numpy.pi)/180 #φ, λ in radians
radLat2 = lat2 * (numpy.pi)/180
diffLat = (lat2-lat1) * (numpy.pi)/180
diffLong = (lon2-lon1) * (numpy.pi)/180
a = numpy.sin(diffLat/2) * numpy.sin(diffLat/2) + numpy.cos(radLat1) * numpy.cos(radLat2) * numpy.sin(diffLong/2) * numpy.sin(diffLong/2)
c = 2 * math.atan((math.sqrt(a))/(math.sqrt(1-a)))
d = R * c #in metres
return d
accountLocation = pd.read_excel("Backup.xlsx", header=None)
storeLocation = pd.read_excel("locations.xlsx", header=None)
accountLatitude = (accountLocation.iloc[:,2]).tolist()
accountLongitude = (accountLocation.iloc[:,3]).tolist()
storeLatitude = (storeLocation.iloc[:,1]).tolist()
storeLongitude = (storeLocation.iloc[:,2]).tolist()
londonDistance = []
for list in a:
for number in list:
print number
Edit: Sorry, i forgot to mention. Considering i had to use my own haversine formula. Which i got to work, but i am just simply struggling to iterate through this list.

There's a lot I would change about this code. For starters, do not provide lat and lon as separate variables. It's confusing. Instead use a tuple which is a standard way of dealing with coordinates.
Secondly, there's no reason to implement your own version of the haversine distance when reputable packages have it implemented already, for example, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html.
So the code in the end could look something like:
from __future__ import annotations
from math import radians
from sklearn.metrics.pairwise import haversine_distances
def distance2point(point1: tuple[float, float], point2: tuple[float, float]):
lats = [radians(point1[0]), radians(point2[0])]
lons = [radians(point1[1]), radians(point2[1])]
return haversine_distances(lats, lons)
or even simpler for multiple points:
def distance2point(points: list[tuple[float, float]]):
lats = [radians(p[0]) for p in points]
lons = [radians(p[1]) for p in points]
return haversine_distances(lats, lons)
EDIT: Oopsie, forgot that sklearn expects radians. Fixed now.

Related

Python haversine formula in degrees is way off

This is all of my python code and it is very far off from returning the correct distance. I broke apart the haversine formula and know that it is going wrong somewhere at C. C is way too large of a number to allow for D to return the correct distance.
from math import sin, cos, atan2, sqrt, pi
First are my functions then my main part of the code
#-----FUNCTIONS------
#Header function
def Header():
print("This program will calculate the distance between two geographic points!")
#get_location function
def Get_location():
userLat = input("\n\n Please enter the latitude of your location in decimal degrees: ")
userLon = input("Enter the longitude of the location in decimal degrees: ")
return (userLat, userLon)
#Calculate distance function
#def Distance(lat1, lon1, lat2, lon2):
def Distance(location1, location2):
radEarth = 6371 #km
#location1 = Get_location()
#location2 = Get_location()
lat1 = location1[0]
lon1 = location1[1]
lat2 = location2[0]
lon2 = location2[1]
B = sin((lat1-lat2)/2)**2
S = sin((lon1-lon2)/2)**2
F = (cos(lat1))
A = B + (F * (cos(lat2)) * S)
C = 2 * (atan2(sqrt(A),sqrt(1-A)) * (180/pi))
print(C)
D = radEarth * C
return D
This is the main part of my program
#-------MAIN---------
#Call header function
Header()
Begin do another loop while user continues:
doAnother = 'y'
while doAnother == 'y':
#Collect location points from user
location1 = Get_location()
location2 = Get_location()
print(location1)
print(location2)
#Calculate distance between locations
distance = Distance(location1, location2)
print('The distance between your two locations is: ' + str(distance))
doAnother = raw_input('Do another (y/n)?'.lower())
#Display goodbye
print('Goodbye!')

It looks like you're implementing the Haversine formula as described here. (I've had to do the exact thing BTW.) You're correct there is a problem in C.
Your code (Python):
C = 2 * (atan2(sqrt(A),sqrt(1-A)) * (180/pi))
Code from the URL above (Javascript):
var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
Problem is that you are converting C to degrees (with that (180/pi)), but the next calculation D = radEarth * C only makes mathematical sense if C is in radians.

Calculating distances in TSPLIB

Hello i have a problem with calculating distances between cities from tsp library: http://www.math.uwaterloo.ca/tsp/world/countries.html. I have this set of data (cities in djibouti): http://www.math.uwaterloo.ca/tsp/world/dj38.tsp. I used this function to calculate distaces in this QaA here: http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/TSPFAQ.html. i programed this in python and now it looks like this, here is my code:
cityCoords = {
1:(11003.611100,42102.500000),
2:(11108.611100,42373.888900),
3:(11133.333300,42885.833300),
4:(11155.833300,42712.500000),
5:(11183.333300,42933.333300),
6:(11297.500000,42853.333300),
7:(11310.277800,42929.444400),
8:(11416.666700,42983.333300),
9:(11423.888900,43000.277800),
10:(11438.333300,42057.222200),
11:(11461.111100,43252.777800),
12:(11485.555600,43187.222200),
13:(11503.055600,42855.277800),
14:(11511.388900,42106.388900),
15:(11522.222200,42841.944400),
16:(11569.444400,43136.666700),
17:(11583.333300,43150.000000),
18:(11595.000000,43148.055600),
19:(11600.000000,43150.000000),
20:(11690.555600,42686.666700),
21:(11715.833300,41836.111100),
22:(11751.111100,42814.444400),
23:(11770.277800,42651.944400),
24:(11785.277800,42884.444400),
25:(11822.777800,42673.611100),
26:(11846.944400,42660.555600),
27:(11963.055600,43290.555600),
28:(11973.055600,43026.111100),
29:(12058.333300,42195.555600),
30:(12149.444400,42477.500000),
31:(12286.944400,43355.555600),
32:(12300.000000,42433.333300),
33:(12355.833300,43156.388900),
34:(12363.333300,43189.166700),
35:(12372.777800,42711.388900),
36:(12386.666700,43334.722200),
37:(12421.666700,42895.555600),
38:(12645.000000,42973.333300)
}
def calcCityDistances(coordDict):
cities = list(coordDict.keys())
n = len(cities)
distances = {}
latitude = []
longitude = []
RRR = 6378.388;
PI = 3.141592;
for i in range(1,n+1):
cityA = cities[i-1]
latA, longA = coordDict[cityA]
deg = int(latA)
Min = latA - deg
latitude.append(PI * (deg + 5 * Min / 3) / 180)
deg = int(longA);
Min = longA - deg;
longitude.append(PI * (deg + 5 * Min / 3) / 180)
for i in range(1,n+1):
for j in range(i + 1, n + 1):
q1 = cos(longitude[i-1] - longitude[j-1]);
q2 = cos(latitude[i-1] - latitude[j-1]);
q3 = cos(latitude[i-1] + latitude[j-1]);
key = frozenset((i, j))
distances[key] = {}
dist = RRR * acos(0.5 * ((1.0 + q1) * q2 - (1.0 - q1) * q3)) + 1.0
distances[key]['dist'] = dist
distances[key]['pher'] = init_fer
distances[key]['vis'] = 0
return distances
distances = calcCityDistances(cityCoords)
My problem is that the distances calculated in this algorithm are off mark in huge scale. average lenght of one route between cities is 10 000 km and the problem is that the optimal TSP route is 6635. you can imagine that when i apply this to my Ant Colony System algorithm the result is around 110 000 km. this is really different from 6 thousand. Can someone explain what am i doing wrong please ?

I'm not familiar with the distance calculation listed in the TSP FAQ. Here's the resource I've used in the past: http://www.movable-type.co.uk/scripts/latlong.html
He gives two great circle distance calculation methods. Neither one looks like the one TSP provided. But, they both produced a distance that seemed to match reality (that Diksa and Dikhil are about 31k apart).
The input data is in 1000ths of a degree, and I'm not sure if the conversion to radians given takes that into account.
Here's an implementation that might give you better results: note I updated the input data to degrees:
import cmath
import math
cityCoords = {
1:(11.0036111,42.1025),
2:(11.1086111,42.3738889)
}
def spherical_cosines(coordDict):
R = 6371; # kilometers
cities = list(coordDict.keys())
n = len(cities)
for i in range(1,n+1):
for j in range(i + 1, n + 1):
cityA = cities[i-1]
lat1, lon1 = coordDict[cityA]
cityB = cities[j-1]
lat2, lon2 = coordDict[cityB]
lat1_radians = math.radians(lat1)
lat2_radians = math.radians(lat2)
lon1_radians = math.radians(lon1)
lon2_radians = math.radians(lon2)
print('A={},{} B={},{}'.format(lat1_radians, lon1_radians, lat2_radians, lon2_radians))
delta_lon_radians = math.radians(lon2-lon1)
distance = cmath.acos(cmath.sin(lat1_radians) * cmath.sin(lat2_radians) + cmath.cos(lat1_radians) *
math.cos(lat2_radians) * cmath.cos(delta_lon_radians)) * R;
print('spherical_cosines distance={}'.format(distance))
spherical_cosines(cityCoords)
update:
The code you posted is not producing the correct distance values. Here's the first two cities using calcCityDistances and sperical cosines:
input loc=11003.6111, 42102.5
input loc=11108.6111, 42373.8889
radians A = 192.05631381917777,734.8329132074075
B=193.88890915251113,739.5740671363777
calcCityDistances distance = 8078.816781077703
input degrees A=11.0036111,42.1025 B=11.1086111,42.3738889
radians A=0.19204924330399503,0.7348272483209126
B=0.19388183901858905,0.7395638781792782
spherical_cosines> distance=(31.835225475974934+0j)
Units is kilometers. Spherical cosines produces approximately the right value. Is the code you're using the same as what you posted? Notice the radians conversion doesn't seem to take into account that the input is thousandths of a degree

Speeding up a nested for loop through two Pandas DataFrames

I have a latitude and longitude stored in a pandas dataframe (df) with filler spots as NaN for stop_id, stoplat, stoplon, and in another dataframe areadf, which contains more lats/lons and an arbitrary id; this is the information that is to be populated into df.
I'm trying to connect the two so that the stop columns in df contain information about the stop closest to that lat/lon point, or leave it as NaN if there is no stop within a radius R of the point.
Right now my code is as follows, but it takes a reaaaaallly long time (>40 minutes for what I'm running at the moment, before changing area to a df and using itertuples; not sure of what magnitude of difference this will make?) as there are thousands of lat/lon points and stops for each set of data, which is a problem because I need to run this on multiple files. I'm looking for suggestions to make it run faster. I've already made some very minor improvements (e.g. moving to a dataframe, using itertuples instead of iterrows, defining lats and lons outside of the loop to avoid having to retrieve it from df on every loop) but I'm out of ideas for speeding it up. getDistance uses the Haversine formula as defined to get the distance between the stop sign and the given lat,lon point.
import pandas as pd
from math import cos, asin, sqrt
R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
for index in df.index:
if getDistance(lats[index],lons[index],
stop[1],stop[2]) < R:
df.at[index,'stop_id'] = stop[0] # id
df.at[index,'stoplat'] = stop[1] # lat
df.at[index,'stoplon'] = stop[2] # lon
def getDistance(lat1,lon1,lat2,lon2):
p = 0.017453292519943295 #Pi/180
a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) *
cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
return 12742 * asin(sqrt(a)) * 100
Sample data:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 NaN NaN NaN
43.694324 -79.334555 NaN NaN NaN
areadf
stop_id stoplat stoplon
0 43.657675 -79.380145
1 45.435143 -90.543253
Desired:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 0 43.657675 -79.380145
43.694324 -79.334555 NaN NaN NaN

One way would be to use the numpy haversine function from here, just slightly modified so that you can account for the radius you want.
The just iterate through your df with apply and find the closest value within a given radius
def haversine_np(lon1, lat1, lon2, lat2,R):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
if km.min() <= R:
return km.argmin()
else:
return -1
df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)
Then merge the two dataframes.
df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)
lat lon stop_id stoplat stoplon
0 43.657676 -79.380146 0.0 43.657675 -79.380145
1 43.694324 -79.334555 NaN NaN NaN
NOTE: If you choose to follow this method, you must be sure that both dataframes indexes are reset or that they are sequentially ordered from 0 to total len of df. So be sure to reset the indexes before you run this.
df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True)

Pairwise calculations of distances between two sets of points

I'm facing some troubles with doing a pairwise calculation in Python.
I have two sets of nodes (e.g. suppliers and customers).
Set 1: SupplierCO = (Xco, Yco) for multiple suppliers
Set 2: Customer CO = (Xco, Yco) for multiple customers
I want to calculate the distances between a customer and all the suppliers, and save the shortest distance. This should be looped for all customers.
I realize I will have to work with two for loops, and an if function. But I don't understand how to select the coordinates from the correct points while looping.
Thanks for the responses!
Some more information:
- Haversine distance
- Each point in set 1 has to be compared to all the points of set 2
- This is what I've so far
import urllib.parse
from openpyxl import load_workbook, Workbook
import requests
from math import radians, cos, sin, asin, sqrt
"""load datafile"""
workbook = load_workbook('Macro.xlsm')
Companysheet = workbook.get_sheet_by_name("Customersheet")
Networksheet = workbook.get_sheet_by_name("Suppliersheet")
"""search for column with latitude/longitude - customers"""
numberlatC = -1
i = 0
for col in Customersheet.iter_cols():
if col[2].value == "Latitude" :
numberlatC = i
i+=1
numberlongC = -1
j = 0
for col in Customersheet.iter_cols():
if col[2].value == "Longitude" :
numberlongC = j
j+=1
latC = [row[numberlatC].value for row in Companysheet.iter_rows() ]
longC = [row[numberlongC].value for row in Companysheet.iter_rows()]
# haversine formula
dlon = lonC - lonS
dlat = latC - latS
a = sin(dlat/2)**2 + cos(latC) * cos(latS) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
distance = c*r
distances.append([distance])
return distances
customers = [latC, longC]
Thanks!

This should give you the general idea. In the following example I've just used regular coordinates, however, you should be able to convert this to your need.
supplier = [(1,3),(2,4),(8,7),(15,14)]
customer = [(0,2),(8,8)]
def CoordinatesDistance(A, B):
import math
x1, x2 = A
y1, y2 = B
return math.sqrt(math.exp((x2-x1)+(y2-y1)))
def shortest_distance_pair(Cust, Sup):
pairs = []
for X in Cust:
shortest_distance = 999999
for Y in Sup:
distance = CoordinatesDistance(X,Y)
#customer_distance.append(distance)
if distance < shortest_distance:
shortest_distance = distance
sdp = (X,Y)
pairs.append(sdp)
return pairs
print(shortest_distance_pair(customer,supplier))
print(shortest_distance_pair(customer,supplier))
[((0, 2), (8, 7)), ((8, 8), (8, 7))]
Now if you create two lists, 1. Customer coordinates, and 2. Supplier coordinates; you should be able to utilize the above.

Detecting geographic clusters

I have a R data.frame containing longitude, latitude which spans over the entire USA map. When X number of entries are all within a small geographic region of say a few degrees longitude & a few degrees latitude, I want to be able to detect this and then have my program then return the coordinates for the geographic bounding box. Is there a Python or R CRAN package that already does this? If not, how would I go about ascertaining this information?

I was able to combine Joran's answer along with Dan H's comment. This is an example ouput:
The python code emits functions for R: map() and rect(). This USA example map was created with:
map('state', plot = TRUE, fill = FALSE, col = palette())
and then you can apply the rect()'s accordingly from with in the R GUI interpreter (see below).
import math
from collections import defaultdict
to_rad = math.pi / 180.0 # convert lat or lng to radians
fname = "site.tsv" # file format: LAT\tLONG
threshhold_dist=50 # adjust to your needs
threshhold_locations=15 # minimum # of locations needed in a cluster
def dist(lat1,lng1,lat2,lng2):
global to_rad
earth_radius_km = 6371
dLat = (lat2-lat1) * to_rad
dLon = (lng2-lng1) * to_rad
lat1_rad = lat1 * to_rad
lat2_rad = lat2 * to_rad
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLon/2) * math.sin(dLon/2) * math.cos(lat1_rad) * math.cos(lat2_rad)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a));
dist = earth_radius_km * c
return dist
def bounding_box(src, neighbors):
neighbors.append(src)
# nw = NorthWest se=SouthEast
nw_lat = -360
nw_lng = 360
se_lat = 360
se_lng = -360
for (y,x) in neighbors:
if y > nw_lat: nw_lat = y
if x > se_lng: se_lng = x
if y < se_lat: se_lat = y
if x < nw_lng: nw_lng = x
# add some padding
pad = 0.5
nw_lat += pad
nw_lng -= pad
se_lat -= pad
se_lng += pad
# sutiable for r's map() function
return (se_lat,nw_lat,nw_lng,se_lng)
def sitesDist(site1,site2):
#just a helper to shorted list comprehension below
return dist(site1[0],site1[1], site2[0], site2[1])
def load_site_data():
global fname
sites = defaultdict(tuple)
data = open(fname,encoding="latin-1")
data.readline() # skip header
for line in data:
line = line[:-1]
slots = line.split("\t")
lat = float(slots[0])
lng = float(slots[1])
lat_rad = lat * math.pi / 180.0
lng_rad = lng * math.pi / 180.0
sites[(lat,lng)] = (lat,lng) #(lat_rad,lng_rad)
return sites
def main():
sites_dict = {}
sites = load_site_data()
for site in sites:
#for each site put it in a dictionary with its value being an array of neighbors
sites_dict[site] = [x for x in sites if x != site and sitesDist(site,x) < threshhold_dist]
results = {}
for site in sites:
j = len(sites_dict[site])
if j >= threshhold_locations:
coord = bounding_box( site, sites_dict[site] )
results[coord] = coord
for bbox in results:
yx="ylim=c(%s,%s), xlim=c(%s,%s)" % (results[bbox]) #(se_lat,nw_lat,nw_lng,se_lng)
print('map("county", plot=T, fill=T, col=palette(), %s)' % yx)
rect='rect(%s,%s, %s,%s, col=c("red"))' % (results[bbox][2], results[bbox][0], results[bbox][3], results[bbox][2])
print(rect)
print("")
main()
Here is an example TSV file (site.tsv)
LAT LONG
36.3312 -94.1334
36.6828 -121.791
37.2307 -121.96
37.3857 -122.026
37.3857 -122.026
37.3857 -122.026
37.3895 -97.644
37.3992 -122.139
37.3992 -122.139
37.402 -122.078
37.402 -122.078
37.402 -122.078
37.402 -122.078
37.402 -122.078
37.48 -122.144
37.48 -122.144
37.55 126.967
With my data set, the output of my python script, shown on the USA map. I changed the colors for clarity.
rect(-74.989,39.7667, -73.0419,41.5209, col=c("red"))
rect(-123.005,36.8144, -121.392,38.3672, col=c("green"))
rect(-78.2422,38.2474, -76.3,39.9282, col=c("blue"))
Addition on 2013-05-01 for Yacob
These 2 lines give you the over all goal...
map("county", plot=T )
rect(-122.644,36.7307, -121.46,37.98, col=c("red"))
If you want to narrow in on a portion of a map, you can use ylim and xlim
map("county", plot=T, ylim=c(36.7307,37.98), xlim=c(-122.644,-121.46))
# or for more coloring, but choose one or the other map("country") commands
map("county", plot=T, fill=T, col=palette(), ylim=c(36.7307,37.98), xlim=c(-122.644,-121.46))
rect(-122.644,36.7307, -121.46,37.98, col=c("red"))
You will want to use the 'world' map...
map("world", plot=T )
It has been a long time since I have used this python code I have posted below so I will try my best to help you.
threshhold_dist is the size of the bounding box, ie: the geographical area
theshhold_location is the number of lat/lng points needed with in
the bounding box in order for it to be considered a cluster.
Here is a complete example. The TSV file is located on pastebin.com. I have also included an image generated from R that contains the output of all of the rect() commands.
# pyclusters.py
# May-02-2013
# -John Taylor
# latlng.tsv is located at http://pastebin.com/cyvEdx3V
# use the "RAW Paste Data" to preserve the tab characters
import math
from collections import defaultdict
# See also: http://www.geomidpoint.com/example.html
# See also: http://www.movable-type.co.uk/scripts/latlong.html
to_rad = math.pi / 180.0 # convert lat or lng to radians
fname = "latlng.tsv" # file format: LAT\tLONG
threshhold_dist=20 # adjust to your needs
threshhold_locations=20 # minimum # of locations needed in a cluster
earth_radius_km = 6371
def coord2cart(lat,lng):
x = math.cos(lat) * math.cos(lng)
y = math.cos(lat) * math.sin(lng)
z = math.sin(lat)
return (x,y,z)
def cart2corrd(x,y,z):
lon = math.atan2(y,x)
hyp = math.sqrt(x*x + y*y)
lat = math.atan2(z,hyp)
return(lat,lng)
def dist(lat1,lng1,lat2,lng2):
global to_rad, earth_radius_km
dLat = (lat2-lat1) * to_rad
dLon = (lng2-lng1) * to_rad
lat1_rad = lat1 * to_rad
lat2_rad = lat2 * to_rad
a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLon/2) * math.sin(dLon/2) * math.cos(lat1_rad) * math.cos(lat2_rad)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a));
dist = earth_radius_km * c
return dist
def bounding_box(src, neighbors):
neighbors.append(src)
# nw = NorthWest se=SouthEast
nw_lat = -360
nw_lng = 360
se_lat = 360
se_lng = -360
for (y,x) in neighbors:
if y > nw_lat: nw_lat = y
if x > se_lng: se_lng = x
if y < se_lat: se_lat = y
if x < nw_lng: nw_lng = x
# add some padding
pad = 0.5
nw_lat += pad
nw_lng -= pad
se_lat -= pad
se_lng += pad
#print("answer:")
#print("nw lat,lng : %s %s" % (nw_lat,nw_lng))
#print("se lat,lng : %s %s" % (se_lat,se_lng))
# sutiable for r's map() function
return (se_lat,nw_lat,nw_lng,se_lng)
def sitesDist(site1,site2):
# just a helper to shorted list comprehensioin below
return dist(site1[0],site1[1], site2[0], site2[1])
def load_site_data():
global fname
sites = defaultdict(tuple)
data = open(fname,encoding="latin-1")
data.readline() # skip header
for line in data:
line = line[:-1]
slots = line.split("\t")
lat = float(slots[0])
lng = float(slots[1])
lat_rad = lat * math.pi / 180.0
lng_rad = lng * math.pi / 180.0
sites[(lat,lng)] = (lat,lng) #(lat_rad,lng_rad)
return sites
def main():
color_list = ( "red", "blue", "green", "yellow", "orange", "brown", "pink", "purple" )
color_idx = 0
sites_dict = {}
sites = load_site_data()
for site in sites:
#for each site put it in a dictionarry with its value being an array of neighbors
sites_dict[site] = [x for x in sites if x != site and sitesDist(site,x) < threshhold_dist]
print("")
print('map("state", plot=T)') # or use: county instead of state
print("")
results = {}
for site in sites:
j = len(sites_dict[site])
if j >= threshhold_locations:
coord = bounding_box( site, sites_dict[site] )
results[coord] = coord
for bbox in results:
yx="ylim=c(%s,%s), xlim=c(%s,%s)" % (results[bbox]) #(se_lat,nw_lat,nw_lng,se_lng)
# important!
# if you want an individual map for each cluster, uncomment this line
#print('map("county", plot=T, fill=T, col=palette(), %s)' % yx)
if len(color_list) == color_idx:
color_idx = 0
rect='rect(%s,%s, %s,%s, col=c("%s"))' % (results[bbox][2], results[bbox][0], results[bbox][3], results[bbox][1], color_list[color_idx])
color_idx += 1
print(rect)
print("")
main()

I'm doing this on a regular basis by first creating a distance matrix and then running clustering on it. Here is my code.
library(geosphere)
library(cluster)
clusteramounts <- 10
distance.matrix <- (distm(points.to.group[,c("lon","lat")]))
clustersx <- as.hclust(agnes(distance.matrix, diss = T))
points.to.group$group <- cutree(clustersx, k=clusteramounts)
I'm not sure if it completely solves your problem. You might want to test with different k, and also perhaps do a second run of clustering of some of the first clusters in case they are too big, like if you have one point in Minnesota and a thousand in California.
When you have the points.to.group$group, you can get the bounding boxes by finding max and min lat lon per group.
If you want X to be 20, and you have 18 points in New York and 22 in Dallas, you must decide if you want one small and one really big box (20 points each), if it is better to have have the Dallas box include 22 points, or if you want to split the 22 points in Dallas to two groups. Clustering based on distance can be good in some of these cases. But it of course depend on why you want to group the points.
/Chris

A few ideas:
Ad-hoc & approximate: The "2-D histogram". Create arbitrary "rectangular" bins, of the degree width of your choice, assign each bin an ID. Placing a point in a bin means "associate the point with the ID of the bin". Upon each add to a bin, ask the bin how many points it has. Downside: doesn't correctly "see" a cluster of points that stradle a bin boundary; and: bins of "constant longitudinal width" actually are (spatially) smaller as you move north.
Use the "Shapely" library for Python. Follow it's stock example for "buffering points", and do a cascaded union of the buffers. Look for globs over a certain area, or that "contain" a certain number of original points. Note that Shapely is not intrinsically "geo-savy", so you'll have to add corrections if you need them.
Use a true DB with spatial processing. MySQL, Oracle, Postgres (with PostGIS), MSSQL all (I think) have "Geometry" and "Geography" datatypes, and you can do spatial queries on them (from your Python scripts).
Each of these has different costs in dollars and time (in the learning curve)... and different degrees of geospatial accuracy. You have to pick what suits your budget and/or requirements.

if you use shapely, you could extend my cluster_points function
to return the bounding box of the cluster via the .bounds property of the shapely geometry , for example like this:
clusterlist.append(cluster, (poly.buffer(-b)).bounds)

maybe something like
def dist(lat1,lon1,lat2,lon2):
#just return normal x,y dist
return sqrt((lat1-lat2)**2+(lon1-lon2)**2)
def sitesDist(site1,site2):
#just a helper to shorted list comprehensioin below
return dist(site1.lat,site1.lon,site2.lat,site2.lon)
sites_dict = {}
threshhold_dist=5 #example dist
for site in sites:
#for each site put it in a dictionarry with its value being an array of neighbors
sites_dict[site] = [x for x in sites if x != site and sitesDist(site,x) < threshhold_dist]
print "\n".join(sites_dict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Struggling to iterate through a list with a list - python

Related

Python haversine formula in degrees is way off

Calculating distances in TSPLIB

Speeding up a nested for loop through two Pandas DataFrames

Pairwise calculations of distances between two sets of points

Detecting geographic clusters

Categories

Resources