Get paths of minimum MSE python - python

I have a list of list of vectors.
[
[[1,2,3],[7,8,5]],
[[7,8,9],[2,8,6],[2,6,3]],
[[7,5,1],[1,7,3],[6,1,1],[5,2,7]]
]
For each of the vectors in the first list , I want to extract the path of minimum distance (MSE) across vectors in each list.
For example, for the first element in the first list, I should obtain this path:
[1,2,3] -> [2,6,3] -> [1,7,3]
in terms of indexes:
[0,2,1]
I should obtain this path for each element in the first list. The lists are huge and the real vectors are of about 300 elements.
There is some pythonic method that avoids hard iterating with for loops?

My algorithms knowledge is a little limited. I dont think there is any particular python specific best method for this. The comment Rock LI made is accurate. Theres a million dollar prize if you can find a best method to this. Implement a Dijkstra algorithm or whatever your favorite search method is for this. You can auto calculate weights from one list to the next. beyond that its pure algorithms

Related

Solving TSP with GA: Should a distance matrix speed up run-time?

I am trying to write a GA in Python to solve TSP. I would like to speed it up. Because right now, it takes 24 seconds to run 200 generations with a population size of 200.
I am using a map with 29 cities. Each city has an id and (x,y) coordinates.
I tried implementing a distance matrix, which calculates all the distances once and stores it in a list. So instead of calculating the distance using the sqrt() function 1M+ times, it only uses the function 406 times. Every time a distance between two cities is required, it is just retrieved from the matrix using the id of the two cities as the index.
But even with this, it takes just as much time. I thought sqrt() would be more expensive than just indexing a list. Is it not? Would a dictionary make it faster?
The short answer:
Yes. Dictionary would make it faster.
The long answer:
Lets say, you pre-processing and calculates all distances once - Great! Now, lets say I want to find the distance between A and B. So, all I have to do now is to find that distance where I put it - it is in the list!
What is the time complexity to find something in the list? Thats right - O(n)
And how may times I'm going to use it? My guess according to your question: 1M+ times
Now, that is a huge problem. I suggest you to use a dictionary so you could search in the pre-calculated distace between any two cities in O(1).

Iterative Divide and Conquer algorithms

I am trying to create an algorithm using the divide-and-conquer approach but using an iterative algorithm (that is, no recursion).
I am confused as to how to approach the loops.
I need to break up my problems into smaller sub problems, until I hit a base case. I assume this is still true, but then I am not sure how I can (without recursion) use the smaller subproblems to solve the much bigger problem.
For example, I am trying to come up with an algorithm that will find the closest pair of points (in one-dimensional space - though I intend to generalize this on my own to higher dimensions). If I had a function closest_pair(L) where L is a list of integer co-ordinates in ℝ, how could I come up with a divide and conquer ITERATIVE algorithm that can solve this problem?
(Without loss of generality I am using Python)
The cheap way to turn any recursive algorithm into an iterative algorithm is to take the recursive function, put it in a loop, and use your own stack. This eliminates the function call overhead and from saving any unneeded data on the stack. However, this is not usually the "best" approach ("best" depends on the problem and context.)
They way you've worded your problem, it sounds like the idea is to break the list into sublists, find the closest pair in each, and then take the closest pair out of those two results. To do this iteratively, I think a better way to approach this than the generic way mentioned above is to start the other way around: look at lists of size 3 (there are three pairs to look at) and work your way up from there. Note that lists of size 2 are trivial.
Lastly, if your coordinates are integers, they are in Z (a much smaller subset of R).

Find neighbour tuples

I'm looking for a algorithm but miss the right keywords to get an overwiew. What I try to realize is a function that finds correlations/patterns/... in a dataset of tuples (simplified). For example:
dataset=(('a','b','c'),('1','a'), ('x','y','b','c'))
print magic(1.0, dataset)
-> ('b','c')
As you see, the function should return pairs of elements, that always appear together (1.0 = 100%) or with a specific propability.
Can anybody please tell me which group of algorithms will suite for my problem? Maybe pointing to a lib that does the work and is tested? :)
Have a look at Frequent Itemset Mining (FIM) and Association rule mining.
In your question, you are essentially interested in association rules of the type A -> B with confidence 100%.
In particular, the APRIORI algorithm, if you are interested in cooccurrences larger than 3.
Note that if you only want pairs, APRIORI boils down to scanning your database twice to count all pairs; you don't gain anything by pruning. Depending on the sparsity of your data, intersecting inverted lists can be much much faster.

Find sublist in list with different scaling / fuzzy pattern recognition

I have two lists, each containing an ordered set of numbers.
One list is small (~ 5 - 20 elements) the other one is large (~ 5000). The lists have a different "scaling" and there might be points missing in one or the other list. In general most elements will be in both lists.
I'm looking for a method to detect the position and the "scaling" between the two lists, such that the distance between the two lists has a minimum.
An example would be:
l1 = [ 100., 200., 400.]
l2 = [ 350., 1000., 2003., 3996., 7500., 23000.]
The scale would be 10. and the position of l1 in l2 is 1.
The list 10.*l1 appears at position 1 within l2; the lists have a distance of 7 (this depends on the metric I choose, here I just summed up the differences between all elements).
I'm wondering if there are already methods out there e.g. in pattern recognition which I can use (preferably in python). It seems to me that this could be a common problem when comparing patterns with unknown scaling factors. But I couldn't find a good keyword which describes my problem.
The application of this is to identify measured spectroscopic lines by comparing them to a catalog of the positions of known lines and therefore converting the unphysical unit "pixel on the detector" to actual wavelength.
In principle I could already provide a decent guess of the scaling factor of the two lists, but I guess this will not be necessary, as the solutions should be unique in most cases.
any help is appreciated,
Julian
The problem you're trying to solve is an optimization in two degrees. The first being the scale and the second being the index. The broad sense of your problem is generally difficult to solve efficiently. However there are a few things that could simplify the calculations. First are both sets sorted? Second are you looking for the consecutive set from the second list that matches the first or not? To explain that further I'll use an example: 1, 2, 3 and 2, 3, 4, 6. is the scale better as 2 (skipping the 3 in the second list) or 1.something (not skipping the 3)? Third what is the weighting you want to use to measure the difference between the two (linear sum, root mean square, etc.)?
If you can provide some of these details I may be able to give you a better idea of some things to try.
UPDATE
So based on your comment you can skip values. That actually makes this problem very difficult to solve O(2^n). Because you are basically looking at all combinations of list one with list two.
Even though you can optimize some aspects of this problem because they are sorted you will still have to check a lot of combinations.

Efficient way to find number of distinct elements in a list

I'm trying to do K-Means Clustering using Kruskal's Minimum Spanning Tree Algorithm. My original design was to run the full-length Kruskal algorithm of the input and produce an MST, after which delete the last k-1 edges (or equivalently k-1 most expensive edges).
Of course this is the same as running Kruskal algorithm and stopping it just before it adds its last k-1 edges.
I want to use the second strategy i.e instead of running the full length Kruskal algorithm, stop it just after the number of clusters so far equals K. I'm using Union-Find data structure and using a list object in this Union-Find data structure.
Each vertex on this graph is represented by its current cluster on this list e.g [1,2,3...] means vertices 1,2,3 are in their distinct independent clusters. If two vertices are joined their corresponding indices on the list data structure are updated to reflect this.
e.g merging vertices 2 and 3 leaves the list data object as [1,2,2,4,5.....]
My strategy is then every time two nodes are merged, count the number of DISTINCT elements in the list and if it equals the number of desired clusters, stop. My worry is that this may not be the most efficient option. Is there a way I could count the number of distinct objects in a list efficiently?
The easiest and probably most efficient is
len(set(l))
where l is the list. You can consider storing the data in sets instead of lists in the first place, if it is appropriate.
Note that for this to work the elements of l have to be hashable, which is guaranteed for numbers, but not for generic "objects".
One way is to sort your list and then run over the elements by comparing each one to the previous one. If they are not equal sum 1 to your "distinct counter". This operation is O(n), and for sorting you can use the sorting algorithm you prefer, such as quick sort or merge sort, but I guess there is an available sorting algorithm in the lib you use.
Another option is to create a hash table and add all the elements. The number of insertions will be the distinct elements, since repeated elements will not be inserted. I think this is O(1) in the best case so maybe this is the better solution. Good luck!
Hope this helps,
Dídac Pérez

Categories

Resources