Not getting direction value in my velocity calculation - python

I am working on a ground station to track our future cubesat, one of the things needed is to calculate the doppler shift in the frequency as it passes over head. To test this we are using the ISS TLE. I am using Skyfield and it has been super helpful, but I am having a simple issue that I can't seem to figure out. I need the velocity of the object, which has been easily attained, but I also need the direction relative to my position. I assumed it would be part of the velocity vector, since velocity is both magnitude and direction. Maybe I'm missing something in the code that is obvious, the way it is fixed at the moment is to get the distance at two points in time and figure out if its closing in or distancing itself. Then I simply multiply the vector by -1 if its closing in, and 1 otherwise. I figured something like this would be handled with the .velocity function but it does not seem so.
diff = satObsDiff.at(ts_now)
diff1 = satObsDiff.at(ts_next)
velocity = diff.speed().km_per_s * 1000 #converts km to m
print("Velocity: ")
print(velocity)
adjusted_velocity = velocity
range1 = diff.distance().km
range2 = diff1.distance().km
change = (range1 - range2)*1000
direction = 1
if change >= 0:
direction = 1
else:
direction = -1
observed_freq = ((C/(C + (adjusted_velocity * direction))) * emitted_freq)

The dot product of two vectors is positive if they are pointing in the same direction, zero if they are at right angles, and negative if they are pointing away from each other. So I suspect you can tell an approaching from a receding satellite by computing the dot product of the relative position with the velocity and checking whether it is more or less than zero:
d = np.dot(velocity.km, diff.position.km)
if d > 0:
print('receding')
else:
print('approaching')
You can also toss in some division if you want to know how fast the range is changing:
https://stackoverflow.com/a/55226228/85360

Related

Problem With Collision Detection In Turtle [Python]

Im using an if statement to detect whether the player's coordinates after moving up (y coordinate increases) are equal to an open space's coordinates.
Ex:
player_coordinate = player.pos() # We can say that the coordinate returned is (-10.00,10.00)
space_coordinate = space.pos() # And the coordinate returned is (-10.00,20.00)
movement = 10 # How far the player can move at once
if (player_coordinate[0], player_coordinate[1] + movement) == space_coordinate:
player can move
Now, I used this same method, however when the player's position has a negative y value, the statement is false.
For example:
# Works:
if (-90.0, 30.0) == (-90.00,30.00)
# Doesn't Work
if (-90.0, -10.0) == (-90.00,-10.00)
(By the way the first tuple uses the vales stated previously, player_coordinate[0], player_coordinate[1] + movement, so i have no clue why it returns with one decimal place instead of two like in the original .pos() tuple but it shouldn't matter because the problem only occurs when the y-value is negative)
It looks like it is saying that -10.0 is not equal to -10.00. Any ideas on why this might not be working?
Here is the actual code too if it helps. I use 'in' because Im storing all of the space coordinates in a dictionary:
def player_movement(player, total, direction):
current = player.pos()
if direction == 'up':
if (current[0], current[1] + total) in space_coordinates.values():
player.shape('Player_Up.gif') # Changes player sprite
player.goto(current[0], current[1] + total) # Moves player
I already checked if the coordinate I needed was in the dictionary and it was
Turtles wander a floating point plane so exact comparisons are to be avoided. Instead of asking:
turtle_1.position() == turtle_2.position()
you should consider asking:
turtle_1.distance(turtle_2) < 10
That is, are they in close proximity to each other, not on the exact same coordinate.

Inverse square separation of boids repels boids unevenly

I'm new to programming and I'm trying to make a little boids algorithm in python, so far I've written a method to keep the boids apart from one another using an inverse square function, and it looks like this:
def separation(self, boids):
repulsion = Vector(0, 0)
magnitude = 0
for Boid in boids:
if Boid != self:
distance = Boid.position - self.position
if np.linalg.norm(distance) < 100:
magnitude = 100/(np.linalg.norm(distance) ** 2)
direction = math.atan2(distance.y, distance.x)
repulsion = repulsion - Vector(magnitude * cos(direction), magnitude * sin(direction))
return repulsion
Since there's no privilege or anything to one boid, any two boids should repel each other with the same amount of force. However, when I ran a test with 2 boids separated by 10 units and no initial velocity, one boid accelerated noticeably faster than the other. I traced the error to the distance variable which the boids use to calculate the strength of the repulsion, and I made both boids print this variable. On the first frame of time, one boid saw the other as 10 units away while the other saw it as 11 units away (actually, it's -11, but since it gets squared the sign doesn't matter). I then printed out their positions and subtracted them to manually calculate their distance values on the first frame of time to see if it had to do with the equation for distance, and it produced 10 and -10, the correct values. I've tried rewriting the distance variable as self.distance and then writing self.distance = Boid.distance to make both boids see each other as the same distance apart, but it made no difference.
You are updating each boid one at a time. Some of the boids see the others move before they have a chance to. This means they never saw the other as close as the other saw them. This is the source of the asymmetry.
The problem lies in when the result of the separation call is applied to the boid's position.
You need to update all of the positions simultaneously. That is you need all of the separation calls to be done before you update the positions in reaction to the repulsive force.
Think of it like this I have two variables
a = 1
b = 1
I want to add each of them to the other. If I do:
a += b
b += a
I get:
a == 2
b == 3
This isn't what I want.
I want them both to be 2.
If I do this instead like this:
(a, b) = (a + b, b + a)
I get what I wanted.
Now this fancy expression hides a truth. It can be written without as to show this.
a_ = a + b
b_ = b + a
a = a_
b = b_

Need Help Trying to Simplify this algorithm to map points on an arbitrarily large 2d plane to unique integers

So like the title says I need help trying to map points from a 2d plane to a number line in such a way that each point is associated with a unique positive integer. Put another way, I need a function f:ZxZ->Z+ and I need f to be injective. Additionally I need to to run in a reasonable time.
So the way I've though about doing this is to basically just count points, starting at (1,1) and spiraling outwards.
Below I've written some python code to do this for some point (i,j)
def plot_to_int(i,j):
a=max(i,j) #we want to find which "square" we are in
b=(a-1)^2 #we can start the count from the last square
J=abs(j)
I=abs(i)
if i>0 and j>0: #the first quadrant
#we start counting anticlockwise
if I>J:
b+=J
#we start from the edge and count up along j
else:
b+=J+(J-i)
#when we turn the corner, we add to the count, increasing as i decreases
elif i<0 and j>0: #the second quadrant
b+=2a-1 #the total count from the first quadrant
if J>I:
b+=I
else:
b+=I+(I-J)
elif i<0 and j<0: #the third quadrant
b+=(2a-1)2 #the count from the first two quadrants
if I>J:
b+=J
else:
b+=J+(J-I)
else:
b+=(2a-1)3
if J>I:
b+=I
else:
b+=I+(I-J)
return b
I'm pretty sure this works, but as you can see it quite a bulky function. I'm trying to think of some way to simplify this "spiral counting" logic. Or possibly if there's another counting method that is simpler to code that would work too.
Here's a half-baked idea:
For every point, calculate f = x + (y-y_min)/(y_max-y_min)
Find the smallest delta d between any given f_n and f_{n+1}. Multiply all the f values by 1/d so that all f values are at least 1 apart.
Take the floor() of all the f values.
This is sort of like a projection onto the x-axis, but it tries to spread out the values so that it preserves uniqueness.
UPDATE:
If you don't know all the data and will need to feed in new data in the future, maybe there's a way to hardcode an arbitrarily large or small constant for y_max and y_min in step 1, and an arbitrary delta d for step 2 according the boundaries of the data values you expect. Or a way to calculate values for these according to the limits of the floating point arithmetic.

Number of shortest paths

Here is the problem:
Given the input n = 4 x = 5, we must imagine a chessboard that is 4 squares across (x-axis) and 5 squares tall (y-axis). (This input changes, all the up to n = 200 x = 200)
Then, we are asked to determine the minimum shortest path from the bottom left square on the board to the top right square on the board for the Knight (the Knight can move 2 spaces on one axis, then 1 space on the other axis).
My current ideas:
Use a 2d array to store all the possible moves, perform breadth-first
search(BFS) on the 2d array to find the shortest path.
Floyd-Warshall shortest path algorithm.
Create an adjacency list and perform BFS on that (but I think this would be inefficient).
To be honest though I don't really have a solid grasp on the logic.
Can anyone help me with psuedocode, python code, or even just a logical walk-through of the problem?
BFS is efficient enough for this problem as it's complexity is O(n*x) since you explore each cell only one time. For keeping the number of shortest paths, you just have to keep an auxiliary array to save them.
You can also use A* to solve this faster but it's not necessary in this case because it is a programming contest problem.
dist = {}
ways = {}
def bfs():
start = 1,1
goal = 6,6
queue = [start]
dist[start] = 0
ways[start] = 1
while len(queue):
cur = queue[0]
queue.pop(0)
if cur == goal:
print "reached goal in %d moves and %d ways"%(dist[cur],ways[cur])
return
for move in [ (1,2),(2,1),(-1,-2),(-2,-1),(1,-2),(-1,2),(-2,1),(2,-1) ]:
next_pos = cur[0]+move[0], cur[1]+move[1]
if next_pos[0] > goal[0] or next_pos[1] > goal[1] or next_pos[0] < 1 or next_pos[1] < 1:
continue
if next_pos in dist and dist[next_pos] == dist[cur]+1:
ways[next_pos] += ways[cur]
if next_pos not in dist:
dist[next_pos] = dist[cur]+1
ways[next_pos] = ways[cur]
queue.append(next_pos)
bfs()
Output
reached goal in 4 moves and 4 ways
Note that the number of ways to reach the goal can get exponentially big
I suggest:
Use BFS backwards from the target location to calculate (in just O(nx) total time) the minimum distance to the target (x, n) in knight's moves from each other square. For each starting square (i, j), store this distance in d[i][j].
Calculate c[i][j], the number of minimum-length paths starting at (i, j) and ending at the target (x, n), recursively as follows:
c[x][n] = 1
c[i][j] = the sum of c[p][q] over all (p, q) such that both
(p, q) is a knight's-move-neighbour of (i, j), and
d[p][q] = d[i][j]-1.
Use memoisation in step 2 to keep the recursion from taking exponential time. Alternatively, you can compute c[][] bottom-up with a slightly modified second BFS (also backwards) as follows:
c = x by n array with each entry initially 0;
seen = x by n array with each entry initially 0;
s = createQueue();
push(s, (x, n));
while (notEmpty(s)) {
(i, j) = pop(s);
for (each location (p, q) that is a knight's-move-neighbour of (i, j) {
if (d[p][q] == d[i][j] + 1) {
c[p][q] = c[p][q] + c[i][j];
if (seen[p][q] == 0) {
push(s, (p, q));
seen[p][q] = 1;
}
}
}
}
The idea here is to always compute c[][] values for all positions having some given distance from the target before computing any c[][] value for a position having a larger distance, as the latter depend on the former.
The length of a shortest path will be d[1][1], and the number of such shortest paths will be c[1][1]. Total computation time is O(nx), which is clearly best-possible in an asymptotic sense.
My approach to this question would be backtracking as the number of squares in the x-axis and y-axis are different.
Note: Backtracking algorithms can be slow for certain cases and fast for the other
Create a 2-d Array for the chess-board. You know the staring index and the final index. To reach to the final index u need to keep close to the diagonal that's joining the two indexes.
From the starting index see all the indexes that the knight can travel to, choose the index which is closest to the diagonal indexes and keep on traversing, if there is no way to travel any further backtrack one step and move to the next location available from there.
PS : This is a bit similar to a well known problem Knight's Tour, in which choosing any starting point you have to find that path in which the knight whould cover all squares. I have codes this as a java gui application, I can send you the link if you want any help
Hope this helps!!
Try something. Draw boards of the following sizes: 1x1, 2x2, 3x3, 4x4, and a few odd ones like 2x4 and 3x4. Starting with the smallest board and working to the largest, start at the bottom left corner and write a 0, then find all moves from zero and write a 1, find all moves from 1 and write a 2, etc. Do this until there are no more possible moves.
After doing this for all 6 boards, you should have noticed a pattern: Some squares couldn't be moved to until you got a larger board, but once a square was "discovered" (ie could be reached), the number of minimum moves to that square was constant for all boards not smaller than the board on which it was first discovered. (Smaller means less than n OR less than x, not less than (n * x) )
This tells something powerful, anecdotally. All squares have a number associated with them that must be discovered. This number is a property of the square, NOT the board, and is NOT dependent on size/shape of the board. It is always true. However, if the square cannot be reached, then obviously the number is not applicable.
So you need to find the number of every square on a 200x200 board, and you need a way to see if a board is a subset of another board to determine if a square is reachable.
Remember, in these programming challenges, some questions that are really hard can be solved in O(1) time by using lookup tables. I'm not saying this one can, but keep that trick in mind. For this one, pre-calculating the 200x200 board numbers and saving them in an array could save a lot of time, whether it is done only once on first run or run before submission and then the results are hard coded in.
If the problem needs move sequences rather than number of moves, the idea is the same: save move sequences with the numbers.

speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?
Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.
As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)
If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.
I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.
I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty
A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.
This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s
This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.

Categories

Resources