Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I'm working on a Project Euler problem (number 15, lattice paths). I've solved the problem another way, but I'm curious as to how to optimize the algorithm I used to initially try to solve the problem because it grows very quickly and am kind of surprised at how long it actually takes. So I really looking to learn how to analyze and continue to optimize the algorithm.
This algorithm's approach is to use the corners as points - (0,0) in top left, (2,2) in bottom left for a 2x2 grid. From the top point, the path will only be x+1 or y+1. So I pretty much iteratively form these paths by checking if the next allowable move exists in the space of points in the grid.
I initially started from the top left (x+1, y+1), but found it to be more efficient to go backwards from the bottom, removed some redundancies, and start to store only the valuable data in memory. So that's where I am now. Can it be optimized any further? and what other types of applications would this have?
the givenPoints is a list of all the points in the grid, stored as a string - ie '0202'. the algorithm stores the most recent point of the unique paths as opposed to the whole path, and at the end the number of entries in the list is equivalent to the number of unique paths.
def calcPaths4(givenPoints):
paths = []
paths.append(givenPoints[-1])
dims = int(math.sqrt(len(givenPoints))) - 1
numMoves = 2*dims
numPaths = 0
for x in range(0,numMoves):
t0= time.clock()
newPaths = []
for i in paths:
origin = int(i)
dest1 = origin - 1
dest3 = origin - 100
if ('%04d' % dest1) in givenPoints:
newPaths.append(('%04d' % dest1))
numPaths +=1
if ('%04d' % dest3) in givenPoints:
newPaths.append(('%04d' % dest3))
numPaths +=1
t= time.clock() - t0
paths = newPaths
print(str(x)+": " +str(t)+": " +str(len(paths)) )
return(paths)
You've got the wrong approach. Starting from the top left going to the bottom right corner takes 20 moves to the right and 20 moves down.
So you can think any path as a sequence of length 20 with 10 elements that are right and 10 elements that are down. You simply have to count how many arrengements are there.
Once you have fixed the, say, right moves the down ones are fixed, so the whole problem reduces to: in how many ways can you choose 10 positions from a set of 20?
This is simply solved by the binomial coefficient.
Hence a solution is:
from math import factorial
def number_of_paths(k):
"""Number of paths from the top left and bottom right corner in a kxk grid."""
return factorial(2*k)//(factorial(k)**2)
Which can be made more efficient by noting that n!/(k!*k!) = (n·(n-1)···(k+1))/k!:
import operator as op
from functools import reduce
def number_of_paths(k):
"""Number of paths from the top left and bottom right corner in a kxk grid."""
return reduce(op.mul, range(2*k, k, -1), 1)//factorial(k)
Note that the number of paths grows rapidly, which means any algorithm that works by creating the different paths is going to be slow. The only way to seriously "optimize" this is to change approach and avoid creating the paths but just counting them.
I''l point out a different, more general, approach: recursion and memoization/dynamic programming.
When the path is at a certain position (x,y) it can either go right to (x-1,y) or go down to (x, y-1). So the number of paths from that point to the bottom right is the sum of the number of paths that reach the bottom right from (x-1,y) and those that reach the bottom right from (x, y-1):
Base case is when you are on the edge, i.e. x==0 or y==0.
def number_of_paths(x, y):
if not x or not y:
return 1
return number_of_paths(x-1, y) + number_of_paths(x, y-1)
This solution follows your reasoning, but it only keeps track of the number of paths. You can see that again it is very inefficient.
The problem is that when we try to compute number_of_paths(x, y)
we end up doing the following steps:
Compute number_of_paths(x-1, y)
This is done by computing number_of_paths(x-2, y) and number_of_paths(x-1, y-1)
Compute number_of_paths(x, y-1)
This is done by computing number_of_paths(x-1, y-1) and number_of_paths(x, y-2)
Note how number_of_paths(x-1, y-1) is computed twice. But the result is obviously the same! So we can just computing it the first time and the next time we see that call we return the already known result:
def number_of_paths(x, y, table=None):
table = table if table is not None else {(0,0):1}
try:
# first look if we already computed this:
return table[x,y]
except KeyError:
# okay we didn't compute it, so we do it now:
if not x or not y:
result = table[x,y] = 1
else:
result = table[x,y] = number_of_paths(x-1, y, table) + number_of_paths(x, y-1, table)
return result
And now this executes pretty fast:
>>> number_of_paths(20,20)
137846528820
You could think "performing a call twice, isn't a big deal" but you have to take into account that if the call for (x-1,y-1) is computed twice, for each time it does two calls for (x-2, y-2) thus resulting in computing (x-2, y-2) four times. And then (x-3, y-3) eight times, ... and then (x-20, y-20) 1048576 times!
Alternatively we could have built a kxk matrix and fille it from the bottom right:
def number_of_paths(x, y):
table = [[0]*(x+1) for _ in range(y+1)]
table[-1][-1] = 1
for i in reversed(range(x+1)):
for j in reversed(range(y+1)):
if i == x or j == y:
table[i][j] = 1
else:
table[i][j] = table[i+1][j] + table[i][j+1]
return table[0][0]
Note that here the table represents the intersections so we end up with a +1 in the sizes.
This technique of memorizing previous call to reuse them later is called memoization. A more general principle is dynamic programming where you basically reduce the problem to filling a tabular data structure as we did here, using recursion and memoization and then you "backtrack" on the cells by using pointers you filled earlier to obtain a solution to the original problem.
Related
For the following problem, I used a dictionary to track values while the provided answer used a list. Is there a quick way to determine the most efficient data structures for problems like these?
A robot moves in a plane starting from the original point (0,0). The
robot can move toward UP, DOWN, LEFT and RIGHT with a given steps. The
trace of robot movement is shown as the following: UP 5 DOWN 3 LEFT 3
RIGHT 2. The numbers after the direction are steps. Please write a
program to compute the distance from current position after a sequence
of movement and original point. If the distance is a float, then just
print the nearest integer. Example: If the following tuples are given
as input to the program: UP 5 DOWN 3 LEFT 3 RIGHT 2 Then, the output
of the program should be: 2
My answer uses a dictionary (origin["y"] for y and origin["x"] for x):
direction = 0
steps = 0
command = (direction, steps)
command_list = []
origin = {"x": 0, "y": 0}
while direction is not '':
direction = input("Direction (U, D, L, R):")
steps = input("Number of steps:")
command = (direction, steps)
command_list.append(command)
print(command_list)
while len(command_list) > 0:
current = command_list[-1]
if current[0] == 'U':
origin["y"] += int(current[1])
elif current[0] == 'D':
origin["y"] -= int(current[1])
elif current[0] == 'L':
origin["x"] -= int(current[1])
elif current[0] == 'R':
origin["x"] += int(current[1])
command_list.pop()
distance = ((origin["x"])**2 + (origin["y"])**2)**0.5
print(distance)
The provided answer uses a list (pos[0] for y, and pos[1] for x):
import math
pos = [0,0]
while True:
s = raw_input()
if not s:
break
movement = s.split(" ")
direction = movement[0]
steps = int(movement[1])
if direction=="UP":
pos[0]+=steps
elif direction=="DOWN":
pos[0]-=steps
elif direction=="LEFT":
pos[1]-=steps
elif direction=="RIGHT":
pos[1]+=steps
else:
pass
print int(round(math.sqrt(pos[1]**2+pos[0]**2)))
I'll offer a few points on your question because I strongly disagree with the close recommendations. There's much in your question that's not opinion.
In general, your choice of dictionary wasn't appropriate. For a toy program like this it doesn't make much difference, but I assume you're interested in best practice for serious programs. In production software, you wouldn't make this choice. Why?
Error prone-ness. A typo in future code, e.g. origin["t"] = 3 when you meant origin["y"] = 3 is a nasty bug, maybe difficult to find. t = 3 is more likely to cause a "fast failure." (In a statically typed language like C++ or Java, it's a sure compile-time error.)
Space overhead. A simple scalar variable requires essentially no space beyond the value itself. An array has a fixed overhead for the "dope vector" that tracks its location, current, and maximum size. A dictionary requires yet more extra space for open addressing, unused hash buckets, and fill tracking.
Speed.
Accessing a scalar variable is very fast: just a few processor instructions.
Accessing a tuple or array element when you know its index is also very fast, though not as fast as variable access. Extra instructions are needed to check array bounds. Adding one element to an array may take O(current array size) to copy current contents into a larger block of memory. The advantage of tuples and arrays is that you can access elements quickly based on a computed integer index. Scalar variables don't do this. Choose an array/tuple when you need integer index access. Favor tuples when you know the exact size and it's unlikely to change. Their immutability tends to make code more understandable (and thread safe).
Accessing a dictionary element is still more expensive because a hash value must be computed and buckets traversed with possible collision resolution. Adding a single element can also trigger a table reorganization, which is O(table size) with constant factor much bigger than list reorganization because all the elements must be rehashed. The big advantage of dictionaries is that accessing all stored pairs is likely to take the same amount of time. You should choose a dict only when you need that capability: to store a "map" from keys to values.
Conclude from all the above that the best choice for your origin coordinates would have been simple variables. If you later enhance the program in a way that requires passing (x, y) pairs to/from methods, then you'd consider a Point class.
Here is the problem:
Given the input n = 4 x = 5, we must imagine a chessboard that is 4 squares across (x-axis) and 5 squares tall (y-axis). (This input changes, all the up to n = 200 x = 200)
Then, we are asked to determine the minimum shortest path from the bottom left square on the board to the top right square on the board for the Knight (the Knight can move 2 spaces on one axis, then 1 space on the other axis).
My current ideas:
Use a 2d array to store all the possible moves, perform breadth-first
search(BFS) on the 2d array to find the shortest path.
Floyd-Warshall shortest path algorithm.
Create an adjacency list and perform BFS on that (but I think this would be inefficient).
To be honest though I don't really have a solid grasp on the logic.
Can anyone help me with psuedocode, python code, or even just a logical walk-through of the problem?
BFS is efficient enough for this problem as it's complexity is O(n*x) since you explore each cell only one time. For keeping the number of shortest paths, you just have to keep an auxiliary array to save them.
You can also use A* to solve this faster but it's not necessary in this case because it is a programming contest problem.
dist = {}
ways = {}
def bfs():
start = 1,1
goal = 6,6
queue = [start]
dist[start] = 0
ways[start] = 1
while len(queue):
cur = queue[0]
queue.pop(0)
if cur == goal:
print "reached goal in %d moves and %d ways"%(dist[cur],ways[cur])
return
for move in [ (1,2),(2,1),(-1,-2),(-2,-1),(1,-2),(-1,2),(-2,1),(2,-1) ]:
next_pos = cur[0]+move[0], cur[1]+move[1]
if next_pos[0] > goal[0] or next_pos[1] > goal[1] or next_pos[0] < 1 or next_pos[1] < 1:
continue
if next_pos in dist and dist[next_pos] == dist[cur]+1:
ways[next_pos] += ways[cur]
if next_pos not in dist:
dist[next_pos] = dist[cur]+1
ways[next_pos] = ways[cur]
queue.append(next_pos)
bfs()
Output
reached goal in 4 moves and 4 ways
Note that the number of ways to reach the goal can get exponentially big
I suggest:
Use BFS backwards from the target location to calculate (in just O(nx) total time) the minimum distance to the target (x, n) in knight's moves from each other square. For each starting square (i, j), store this distance in d[i][j].
Calculate c[i][j], the number of minimum-length paths starting at (i, j) and ending at the target (x, n), recursively as follows:
c[x][n] = 1
c[i][j] = the sum of c[p][q] over all (p, q) such that both
(p, q) is a knight's-move-neighbour of (i, j), and
d[p][q] = d[i][j]-1.
Use memoisation in step 2 to keep the recursion from taking exponential time. Alternatively, you can compute c[][] bottom-up with a slightly modified second BFS (also backwards) as follows:
c = x by n array with each entry initially 0;
seen = x by n array with each entry initially 0;
s = createQueue();
push(s, (x, n));
while (notEmpty(s)) {
(i, j) = pop(s);
for (each location (p, q) that is a knight's-move-neighbour of (i, j) {
if (d[p][q] == d[i][j] + 1) {
c[p][q] = c[p][q] + c[i][j];
if (seen[p][q] == 0) {
push(s, (p, q));
seen[p][q] = 1;
}
}
}
}
The idea here is to always compute c[][] values for all positions having some given distance from the target before computing any c[][] value for a position having a larger distance, as the latter depend on the former.
The length of a shortest path will be d[1][1], and the number of such shortest paths will be c[1][1]. Total computation time is O(nx), which is clearly best-possible in an asymptotic sense.
My approach to this question would be backtracking as the number of squares in the x-axis and y-axis are different.
Note: Backtracking algorithms can be slow for certain cases and fast for the other
Create a 2-d Array for the chess-board. You know the staring index and the final index. To reach to the final index u need to keep close to the diagonal that's joining the two indexes.
From the starting index see all the indexes that the knight can travel to, choose the index which is closest to the diagonal indexes and keep on traversing, if there is no way to travel any further backtrack one step and move to the next location available from there.
PS : This is a bit similar to a well known problem Knight's Tour, in which choosing any starting point you have to find that path in which the knight whould cover all squares. I have codes this as a java gui application, I can send you the link if you want any help
Hope this helps!!
Try something. Draw boards of the following sizes: 1x1, 2x2, 3x3, 4x4, and a few odd ones like 2x4 and 3x4. Starting with the smallest board and working to the largest, start at the bottom left corner and write a 0, then find all moves from zero and write a 1, find all moves from 1 and write a 2, etc. Do this until there are no more possible moves.
After doing this for all 6 boards, you should have noticed a pattern: Some squares couldn't be moved to until you got a larger board, but once a square was "discovered" (ie could be reached), the number of minimum moves to that square was constant for all boards not smaller than the board on which it was first discovered. (Smaller means less than n OR less than x, not less than (n * x) )
This tells something powerful, anecdotally. All squares have a number associated with them that must be discovered. This number is a property of the square, NOT the board, and is NOT dependent on size/shape of the board. It is always true. However, if the square cannot be reached, then obviously the number is not applicable.
So you need to find the number of every square on a 200x200 board, and you need a way to see if a board is a subset of another board to determine if a square is reachable.
Remember, in these programming challenges, some questions that are really hard can be solved in O(1) time by using lookup tables. I'm not saying this one can, but keep that trick in mind. For this one, pre-calculating the 200x200 board numbers and saving them in an array could save a lot of time, whether it is done only once on first run or run before submission and then the results are hard coded in.
If the problem needs move sequences rather than number of moves, the idea is the same: save move sequences with the numbers.
I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data.
I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.
Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):
from geopy.point import Point
from geopy.distance import vincenty
import csv
class CustomGeoPoint(object):
def __init__(self, latitude, longitude):
self.location = Point(latitude, longitude)
self.close_to_another_point = False
try:
output = open('output.csv','w')
writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# 5 miles
close_limit = 5
geo_points = []
with open('geo_input.csv', newline='') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
for row in reader:
geo_points.append(CustomGeoPoint(row[0], row[1]))
# for every point, look at every point until one is found within 5 miles
for geo_point in geo_points:
for geo_point2 in geo_points:
dist = vincenty(geo_point.location, geo_point2.location).miles
if 0 < dist <= close_limit: # (0,close_limit]
geo_point.close_to_another_point = True
break
writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
geo_point.close_to_another_point])
finally:
output.close()
As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!
I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.
So any pointers on how to make this faster?
Let's look at what you're doing.
You read all the points into a list named geo_points.
Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.
Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.
This is what O(n²) means.
The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?
Anyway, you have some problems.
Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.
Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.
Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).
Answers?
Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.
As #SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!
Sorted List
This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.
How would that work? Same way as before, except your inner loop would have some checks like this:
five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times
for i, pi in enumerate(geo_points):
if pi.close_to_another_point:
continue # Remember if close to an earlier point
pi0max = pi[0] + five_miles
pi1min = pi[1] - five_miles
pi1max = pi[1] + five_miles
for j in range(i+1, list_len):
pj = geo_points[j]
# Assumes geo_points is sorted on [0] then [1]
if pj[0] > pi0max:
# Can't possibly be close enough, nor any later points
break
if pj[1] < pi1min or pj[1] > pi1max:
# Can't be close enough, but a later point might be
continue
# Now do "real" comparison using accurate functions.
if ...:
pi.close_to_another_point = True
pj.close_to_another_point = True
break
What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.
If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.
Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.
Unsorted List
But what if the list is not sorted?
#RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:
def insert_node(node, treenode, depth=0):
dimension = depth % 2 # even/odd -> lat/long
dn = node.coord[dimension]
dt = treenode.coord[dimension]
if dn < dt:
# go left
if treenode.left is None:
treenode.left = node
else:
insert_node(node, treenode.left, depth+1)
else:
# go right
if treenode.right is None:
treenode.right = node
else:
insert_node(node, treenode.right, depth+1)
What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)
It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from #RootTwo provides an algorithm).
Advice
My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.
Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.
As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.
An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?
Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.
import itertools as it
import operator as op
import sortedcontainers # handy library on Pypi
import time
from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform
Point = namedtuple("Point", "lat long has_close_neighbor")
miles_per_degree = 69
number_of_points = 5000000
data = [Point(uniform( -88.0, 88.0), # lat
uniform(-180.0, 180.0), # long
True
)
for _ in range(number_of_points)
]
start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()
print(time.time() - start)
# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0 # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2
# Sliding lattitude window. Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))
# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)
milepost = len(data)//10
# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
if i == milepost:
print('.', end=' ')
milepost += len(data)//10
# Dec of lead_obs represents the leading edge of window.
window.add(lead_pt)
# Remove observations further than the trailing edge of window.
while lead_pt.lat - lag_pt.lat > lat_span:
window.discard(lag_pt)
lag_pt = next(point)
# Calculate 'east-west' width of window_size at dec of lead_obs
long_span = lat_span / cos(radians(lead_pt.lat))
east_long = lead_pt.long + long_span
west_long = lead_pt.long - long_span
# Check all observations in the sliding window within
# long_span of lead_pt.
for other_pt in window.irange_key(west_long, east_long):
if other_pt != lead_pt:
# lead_pt is at the top center of a box 2 * long_span wide by
# 1 * long_span tall. other_pt is is in that box. If desired,
# put additional fine-grained 'closeness' tests here.
# coarse check if any pts within 80% of threshold distance
# then don't need to check distance to any more neighbors
average_lat = (other_pt.lat + lead_pt.lat) / 2
delta_lat = other_pt.lat - lead_pt.lat
delta_long = (other_pt.long - lead_pt.long)/cos(radians(average_lat))
if delta_lat**2 + delta_long**2 <= coarse_threshold:
break
# put vincenty test here
#if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
# break
else:
data[i] = data[i]._replace(has_close_neighbor=False)
print()
print(time.time() - start)
If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.
I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.
Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.
I would redo algorithm in three steps:
Use great-circle distance, and assume 1% error so make limit equal to 1.01*limit.
Code great-circle distance as inlined function, this test should be fast
You'll get some false positives, which you could further test with vincenty
A better solution generated from Oscar Smith. You have a csv file and just sorted it in excel it is very efficient). Then utilize binary search in your program to find the cities within 5 miles(you can make small change to binary search method so it will break if it finds one city satisfying your condition).
Another improvement is to set a map to remember the pair of cities when you find one city is within another one. For example, when you find city A is within 5 miles of city B, use Map to store the pair (B is the key and A is the value). So next time you meet B, search it in the Map first, if it has a corresponding value, you do not need to check it again. But it may use more memory so care about it. Hope it helps you.
This is just a first pass, but I've sped it up by half so far by using great_circle() instead of vincinty(), and cleaning up a couple of other things. The difference is explained here, and the loss in accuracy is about 0.17%:
from geopy.point import Point
from geopy.distance import great_circle
import csv
class CustomGeoPoint(Point):
def __init__(self, latitude, longitude):
super(CustomGeoPoint, self).__init__(latitude, longitude)
self.close_to_another_point = False
def isCloseToAnother(pointA, points):
for pointB in points:
dist = great_circle(pointA, pointB).miles
if 0 < dist <= CLOSE_LIMIT: # (0, close_limit]
return True
return False
with open('geo_input.csv', 'r') as geo_csv:
reader = csv.reader(geo_csv)
next(reader, None) # skip the headers
geo_points = sorted(map(lambda x: CustomGeoPoint(x[0], x[1]), reader))
with open('output.csv', 'w') as output:
writer = csv.writer(output, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])
# for every point, look at every point until one is found within a mile
for point in geo_points:
point.close_to_another_point = isCloseToAnother(point, geo_points)
writer.writerow([point.latitude, point.longitude,
point.close_to_another_point])
I'm going to improve this further.
Before:
$ time python geo.py
real 0m5.765s
user 0m5.675s
sys 0m0.048s
After:
$ time python geo.py
real 0m2.816s
user 0m2.716s
sys 0m0.041s
This problem can be solved with a VP tree. These allows querying data
with distances that are a metric obeying the triangle inequality.
The big advantage of VP trees over a k-D tree is that they can be blindly
applied to geographic data anywhere in the world without having to worry
about projecting it to a suitable 2D space. In addition a true geodesic
distance can be used (no need to worry about the differences between
geodesic distances and distances in the projection).
Here's my test: generate 5 million points randomly and uniformly on the
world. Put these into a VP tree.
Looping over all the points, query the VP tree to find any neighbor a
distance in (0km, 10km] away. (0km is not include in this set to avoid
the query point being found.) Count the number of points with no such
neighbor (which is 229573 in my case).
Cost of setting up the VP tree = 5000000 * 20 distance calculations.
Cost of the queries = 5000000 * 23 distance calculations.
Time for setup and queries is 5m 7s.
I am using C++ with GeographicLib for calculating distances, but
the algorithm can of course be implemented in any language and here's
the python version of GeographicLib.
ADDENDUM: The C++ code implementing this approach is given here.
I am trying to solve Euler problem 18 where I am required to find out the maximum total from top to bottom. I am trying to use recursion, but am stuck with this.
I guess I didn't state my problem earlier. What I am trying to achieve by recursion is to find the sum of the maximum number path. I start from the top of the triangle, and then check the condition is 7 + findsum() bigger or 4 + findsum() bigger. findsum() is supposed to find the sum of numbers beneath it. I am storing the sum in variable 'result'
The problem is I don't know the breaking case of this recursion function. I know it should break when it has reached the child elements, but I don't know how to write this logic in the program.
pyramid=[[0,0,0,3,0,0,0,],
[0,0,7,0,4,0,0],
[0,2,0,4,0,6,0],
[8,0,5,0,9,0,3]]
pos=[0,3]
def downleft(pyramid,pos):#returns down left child
try:
return(pyramid[pos[0]+1][pos[1]-1])
except:return(0)
def downright(pyramid,pos):#returns down right child
try:
return(pyramid[pos[0]+1][pos[1]+1])
except:
return(0)
result=0
def find_max(pyramid,pos):
global result
if downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1]) > downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1]):
new_pos=[pos[0]+1,pos[1]-1]
result+=downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1])
elif downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1]) > downleft(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]-1]):
new_pos=[pos[0]+1,pos[1]+1]
result+=downright(pyramid,pos)+find_max(pyramid,[pos[0]+1,pos[1]+1])
else :
return(result)
find_max(pyramid,pos)
A big part of your problem is that you're recursing a lot more than you need to. You should really only ever call find_max twice recursively, and you need some base-case logic to stop after the last row.
Try this code:
def find_max(pyramid, x, y):
if y >= len(pyramid): # base case, we're off the bottom of the pyramid
return 0 # so, return 0 immediately, without recursing
left_value = find_max(pyramid, x - 1, y + 1) # first recursive call
right_value = find_max(pyramid, x + 1, y + 1) # second recursive call
if left_value > right_value:
return left_value + pyramid[y][x]
else:
return right_value + pyramid[y][x]
I changed the call signature to have separate values for the coordinates rather than using a tuple, as this made the indexing much easier to write. Call it with find_max(pyramid, 3, 0), and get rid of the global pos list. I also got rid of the result global (the function returns the result).
This algorithm could benefit greatly from memoization, as on bigger pyramids you'll calculate the values of the lower-middle areas many times. Without memoization, the code may be impractically slow for large pyramid sizes.
Edit: I see that you are having trouble with the logic of the code. So let's have a look at that.
At each position in the tree you want to make a choice of selecting
the path from this point on that has the highest value. So what
you do is, you calculate the score of the left path and the score of
the right path. I see this is something you try in your current code,
only there are some inefficiencies. You calculate everything
twice (first in the if, then in the elif), which is very expensive. You should only calculate the values of the children once.
You ask for the stopping condition. Well, if you reach the bottom of the tree, what is the score of the path starting at this point? It's just the value in the tree. And that is what you should return at that point.
So the structure should look something like this:
function getScoreAt(x, y):
if at the end: return valueInTree(x, y)
valueLeft = getScoreAt(x - 1, y + 1)
valueRight = getScoreAt(x + 1, y + 1)
valueHere = min(valueLeft, valueRight) + valueInTree(x, y)
return valueHere
Extra hint:
Are you aware that in Python negative indices wrap around to the back of the array? So if you do pyramid[pos[0]+1][pos[1]-1] you may actually get to elements like pyramid[1][-1], which is at the other side of the row of the pyramid. What you probably expect is that this raises an error, but it does not.
To fix your problem, you should add explicit bound checks and not rely on try blocks (try blocks for this is also not a nice programming style).
I am trying to code new data structures I learn in Python, and the following function is part of segment tree.
def query(root,interval,xy=ref_ll([False,False])):
print interval,root
if root.interval == interval or point(root.interval):
return root.quadrant.reflect(root.xy * xy) #Is always gonna be of the form [a,b,c,d]
a = q_list([0,0,0,0])
if interval[0] < root.r.interval[0]:
a = query(root.l,[interval[0],min(interval[1],root.l.interval[1])],root.xy * xy)
if interval[1] > root.l.interval[1]:
a = query(root.r,[max(interval[0],root.r.interval[0]), interval[1]],root.xy * xy)
return a
I am expecting this to run in O(h) time (h is the height of the tree), but it does not, can someone point out the mistake I did. Thanks.
EDIT For an idea of the segment tree, look at http://community.topcoder.com/i/education/lca/RMQ_004.gif
The function's termination condition is if the interval is form of (1,1), i.e. it is a point and not a range. All the functions are implemented.
Working Input:
http://pastebin.com/LuisyYCY
Here is the whole code. http://pastebin.com/6kgtVWAq
It's probably because you are extending a list for every level of the tree. The average time complexity of extending a list is O(k) where k is the size of the list on the right hand side. The size of the list on the right hand side is O(h) so the average overall time complexity is then O(h2).