Solving TSP with GA: Should a distance matrix speed up run-time?

Solving TSP with GA: Should a distance matrix speed up run-time? - python

I am trying to write a GA in Python to solve TSP. I would like to speed it up. Because right now, it takes 24 seconds to run 200 generations with a population size of 200.
I am using a map with 29 cities. Each city has an id and (x,y) coordinates.
I tried implementing a distance matrix, which calculates all the distances once and stores it in a list. So instead of calculating the distance using the sqrt() function 1M+ times, it only uses the function 406 times. Every time a distance between two cities is required, it is just retrieved from the matrix using the id of the two cities as the index.
But even with this, it takes just as much time. I thought sqrt() would be more expensive than just indexing a list. Is it not? Would a dictionary make it faster?

The short answer:
Yes. Dictionary would make it faster.
The long answer:
Lets say, you pre-processing and calculates all distances once - Great! Now, lets say I want to find the distance between A and B. So, all I have to do now is to find that distance where I put it - it is in the list!
What is the time complexity to find something in the list? Thats right - O(n)
And how may times I'm going to use it? My guess according to your question: 1M+ times
Now, that is a huge problem. I suggest you to use a dictionary so you could search in the pre-calculated distace between any two cities in O(1).

Related

Sort coordinates of pointcloud by distance to previous point

Pointcloud of rope with desired start and end point
I have a pointcloud of a rope-like object with about 300 points. I'd like to sort the 3D coordinates of that pointcloud, so that one end of the rope has index 0 and the other end has index 300 like shown in the image. Other pointclouds of that object might be U-shaped so I can't sort by X,Y or Z coordinate. Because of that I also can't sort by the distance to a single point.
I have looked at KDTree by sklearn or scipy to compute the nearest neighbour of each point but I don't know how to go from there and sort the points in an array without getting double entries.
Is there a way to sort these coordinates in an array, so that from a starting point the array gets appended with the coordinates of the next closest point?

First of all, obviously, there is no strict solution to this problem (and even there is no strict definition of what you want to get). So anything you may write will be a heuristic of some sort, which will be failing in some cases, especially as your point cloud gets some non-trivial form (do you allow loops in your rope, for example?)
This said, a simple approach may be to build a graph with the points being the vertices, and every two points connected by an edge with a weight equal to the straight-line distance between these two points.
And then build a minimal spanning tree of this graph. This will provide a kind of skeleton for your point cloud, and you can devise any simple algorithm atop of this skeleton.
For example, sort all points by their distance to the start of the rope, measured along this tree. There is only one path between any two vertices of the tree, so for each vertex of the tree calculate the length of the single path to the rope start, and sort all the vertices by this distance.

As suggested in other answer there is no strict solution to this problem and there can be some edge cases such as loop, spiral, tube, but you can go with heuristic approaches to solve for your use case. Read about some heuristic approaches such as hill climbing, simulated annealing, genetic algorithms etc.
For any heuristic approach you need a method to find how good is a solution, let's say if i give you two array of 3000 elements how will you identify which solution is better compared to other ? This methods depends on your use case.
One approach at top of my mind, hill climbing
method to measure the goodness of the solution : take the euclidian distance of all the adjacent elements of array and take the sum of their distance.
Steps :
create randomised array of all the 3000 elements.
now select two random index out of these 3000 and swap the elements at those indexes, and see if it improves your ans (if sum of euclidian distance of adjacent element reduces)
If it improves your answer then keep those elements swapped
repeat step 2/3 for large number of epochs(10^6)
This solution will lead into stagnation as there is lack of diversity. For better results use simulated annealing, genetic algorithms.

Finding the 10 nearest points in 3D Euclidean space, for EACH element in a 5-million element catalog

Suppose I have a catalog of 5 million points, with their x,y,z location in 3D space. For EACH of these 5 million points, I want to find the 10 points closest to it (straightforward 3D Euclidean distance formula).
In python, if I do a simple for loop over every element in the table, and within the for loop do an array operation (not a second for loop!) to find the distance between the current point and all other points in the catalog, this would take days/weeks. I've tried some stuff involving sorting and computing the distance between points only +/- a couple thousand rows around each table element, but that would still take days.
What is a faster way to do this in python? Is there a way to turn the for loop into some kind of vectorized operation? Would any machine learning techniques (e.g., in scikit-learn) be helpful? Or would somehow parallelizing the code help?

I've used a packaged called RANN in R that finds "Approximate" nearest neighbors. I ran it in a few minutes with 25 M observations and 8 dimensions, and the results were good enough for my use case.
I'm not sure if there is a Python version of the package I used, but I found this link that has a lot of alternatives: Benchmark of ANN Libraries
Benchmark of ANN Libraries

Finding points in space closer than a certain value

In an python application I'm developing I have an array of 3D points (of size between 2 and 100000) and I have to find the points that are within a certain distance from each other (say between two values, like 0.1 and 0.2). I need this for a graphic application and this search should be very fast (~1/10 of a second for a sample of 10000 points)
As a first experiment I tried to use the scipy.spatial.KDTree.query_pairs implementation, and with a sample of 5000 point it takes 5 second to return the indices. Do you know any approach that may work for this specific case?
A bit more about the application:
The points represents atom coordinates and the distance search is useful to determine the bonds between atoms. Bonds are not necessarily fixed but may change at each step, such as in the case of hydrogen bonds.

Great question! Here is my suggestion:
Divide each coordinate by your "epsilon" value of 0.1/0.2/whatever and round the result to an integer. This creates a "quotient space" of points where distance no longer needs to be determined using the distance formula, but simply by comparing the integer coordinates of each point. If all coordinates are the same, then the original points were within approximately the square root of three times epsilon from each other (for example). This process is O(n) and should take 0.001 seconds or less.
(Note: you would want to augment the original point with the three additional integers that result from this division and rounding, so that you don't lose the exact coordinates.)
Sort the points in numeric order using dictionary-style rules and considering the three integers in the coordinates as letters in words. This process is O(n * log(n)) and should take certainly less than your 1/10th of a second requirement.
Now you simply proceed through this sorted list and compare each point's integer coordinates with the previous and following points. If all coordinates match, then both of the matching points can be moved into your "keep" list of points, and all the others can be marked as "throw away." This is an O(n) process which should take very little time.
The result will be a subset of all the original points, which contains only those points that could be possibly involved in any bond, with a bond being defined as approximately epsilon or less apart from some other point in your original set.
This process is not mathematically exact, but I think it is definitely fast and suited for your purpose.

The first thing that comes to my mind is:
If we calculate the distance between each two atoms in the set it will be O(N^2) operations. It is very slow.
What about to introduce the statical orthogonal grid with some cells size (for example close to the distance you are interested) and then determine the atoms belonging to the each cell of the grid (it takes O(N) operations) After this procedure you can reduce the time for searching of the neighbors.

Fastest way to approximately compare values in large numpy arrays?

I have two arrays, array A with ~1M lines and array B with ~400K lines. Each contains, among other things, coordinates of a point. For each point in array A, I need to find how many points in array B are within a certain distance of it. How do I avoid naively comparing everything to everything? Based on its speed at the start, running naively would take 10+ days on my machine. That required nested loops, but the arrays are too large to construct a distance matrix (400G entries!)
I thought the way would be to check only a limited set of B coordinates against each A coordinates. However, I haven't determined an easy way of doing that. That is, what's the easiest/quickest way to make a selection that doesn't require checking all the values in B (which is exactly the same task I'm trying to avoid)?
EDIT: I should've mentioned these aren't 2D (or nD) Cartesian, but spherical surface (lat/long), and distance is great-circle distance.

I cannot give a full answer right now, but some hints to get you started. It will be much more efficient to organise the points in B in a kd-tree. You can use the class scipy.spatial.KDTree to do this easily, and you can use the query() method on this class to request the points within a given distance.

Here is one possible implementation of the cross match between list of points on the sphere using k-d tree.
http://code.google.com/p/astrolibpy/source/browse/my_utils/match_lists.py
Another way is to use healpy module and their get_neighbors method.

Bubble Breaker Game Solver better than greedy?

For a mental exercise I decided to try and solve the bubble breaker game found on many cell phones as well as an example here:Bubble Break Game
The random (N,M,C) board consists N rows x M columns with C colors
The goal is to get the highest score by picking the sequence of bubble groups that ultimately leads to the highest score
A bubble group is 2 or more bubbles of the same color that are adjacent to each other in either x or y direction. Diagonals do not count
When a group is picked, the bubbles disappear, any holes are filled with bubbles from above first, ie shift down, then any holes are filled by shifting right
A bubble group score = n * (n - 1) where n is the number of bubbles in the bubble group
The first algorithm is a simple exhaustive recursive algorithm which explores going through the board row by row and column by column picking bubble groups. Once the bubble group is picked, we create a new board and try to solve that board, recursively descending down
Some of the ideas I am using include normalized memoization. Once a board is solved we store the board and the best score in a memoization table.
I create a prototype in python which shows a (2,15,5) board takes 8859 boards to solve in about 3 seconds. A (3,15,5) board takes 12,384,726 boards in 50 minutes on a server. The solver rate is ~3k-4k boards/sec and gradually decreases as the memoization search takes longer. Memoization table grows to 5,692,482 boards, and hits 6,713,566 times.
What other approaches could yield high scores besides the exhaustive search?
I don't seen any obvious way to divide and conquer. But trending towards larger and larger bubbles groups seems to be one approach
Thanks to David Locke for posting the paper link which talks above a window solver which uses a constant-depth lookahead heuristic.

According to this paper, determining if you can empty the board (which is related to the problem you want to solve) is NP-Complete. That doesn't mean that you won't be able to find a good algorithm, it just means that you likely won't find an efficient one.

I'm thinking you could try a branch and bound search with the following idea:
Given a state of the game S, you branch on S by breaking it up in m sets Si where each Si is the state after taking a legal move of all m legal moves given the state S
You need two functions U(S) and L(S) that compute a lower and upper bound respectively of a given state S.
For the U(S) function I'm thinking calculate the score that you would get if you were able to freely shuffle K bubbles in the board (each move) and arrange the blocks in such a way that would result in the highest score, where K is a value you choose yourself. When your calculating U(S) for a given S it should go quicker if you choose higher K (the conditions are relaxed) so choosing the value of K will be a trade of for quickness of finding U(S) and quality (how tight an upper bound U(S) is.)
For the L(S) function calculate the score that you would get if you simply randomly kept click until you got to a state that could not be solved any further. You can do this several times taking the highest lower bound that you get.
Once you have these two functions you can apply standard Bound and Branch search. Note that the speed of your search is going to greatly depend on how tight your Upper Bound is and how tight your Lower Bound is.

To get a faster solution than exhaustive search, I think what you want is probably dynamic programming. In dynamic programming, you find some sort of "step" that takes you possibly closer to your solution, and keep track of the results of each step in a big matrix. Then, once you have filled in the matrix, you can find the best result, and then work backward to get a path through the matrix that leads to the best result. The matrix is effectively a form of memoization.
Dynamic programming is discussed in The Algorithm Design Manual but there is also plenty of discussion of it on the web. Here's a good intro: http://20bits.com/articles/introduction-to-dynamic-programming/
I'm not sure exactly what the "step" is for this problem. Perhaps you could make a scoring metric for a board that simply sums the points for each of the bubble groups, and then record this score as you try popping balloons? Good steps would tend to cause bubble groups to coalesce, improving the score, and bad steps would break up bubble groups, making the score worse.

You can translate this problem into problem of searching shortest path on graph. http://en.wikipedia.org/wiki/Shortest_path_problem
I would try whit A* and heuristics would include number of islands.

In my chess program I use some ideas which could probably adapted to this problem.
Move Ordering. First find all
possible moves, store them in a list,
and sort them according to some
heuristic. The "better" ones first,
the "bad" ones last. For example,
this could be a function of the size
of the group (prefer medium sized
groups), or the number of adjacent
colors, groups, etc.
Iterative Deepening. Instead of
running a pure depth-first search,
cut of the search after a certain
deep and use some heuristic to assess
the result. Now research the tree
with "better" moves first.
Pruning. Don't search moves which
seems "obviously" bad, according to
some, again, heuristic. This involves
the risk that you won't find the
optimal solution anymore, but
depending on your heuristics you will
very likely find it much earlier.
Hash Tables. No need to store every
board you come accross, just remember
a certain number and overwrite older
ones.

I'm almost finished writing my version of the "solver" in Java. It does both exhaustive search, which takes fricking ages for larger board sizes, and a directed search based on a "pool" of possible paths, which is pruned after every generation, and a fitness function used to prune the pool. I'm just trying to tune the fitness function now...
Update - this is now available at http://bubblesolver.sourceforge.net/

This isn't my area of expertise, but I would like to recommend a book to you. Get a copy of The Algorithm Design Manual by Steven Skiena. This has a whole list of different algorithms, and once you read through it you can use it as a reference. If nothing else it will help you consider your options.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.