Find sublist in list with different scaling / fuzzy pattern recognition - python

I have two lists, each containing an ordered set of numbers.
One list is small (~ 5 - 20 elements) the other one is large (~ 5000). The lists have a different "scaling" and there might be points missing in one or the other list. In general most elements will be in both lists.
I'm looking for a method to detect the position and the "scaling" between the two lists, such that the distance between the two lists has a minimum.
An example would be:
l1 = [ 100., 200., 400.]
l2 = [ 350., 1000., 2003., 3996., 7500., 23000.]
The scale would be 10. and the position of l1 in l2 is 1.
The list 10.*l1 appears at position 1 within l2; the lists have a distance of 7 (this depends on the metric I choose, here I just summed up the differences between all elements).
I'm wondering if there are already methods out there e.g. in pattern recognition which I can use (preferably in python). It seems to me that this could be a common problem when comparing patterns with unknown scaling factors. But I couldn't find a good keyword which describes my problem.
The application of this is to identify measured spectroscopic lines by comparing them to a catalog of the positions of known lines and therefore converting the unphysical unit "pixel on the detector" to actual wavelength.
In principle I could already provide a decent guess of the scaling factor of the two lists, but I guess this will not be necessary, as the solutions should be unique in most cases.
any help is appreciated,
Julian

The problem you're trying to solve is an optimization in two degrees. The first being the scale and the second being the index. The broad sense of your problem is generally difficult to solve efficiently. However there are a few things that could simplify the calculations. First are both sets sorted? Second are you looking for the consecutive set from the second list that matches the first or not? To explain that further I'll use an example: 1, 2, 3 and 2, 3, 4, 6. is the scale better as 2 (skipping the 3 in the second list) or 1.something (not skipping the 3)? Third what is the weighting you want to use to measure the difference between the two (linear sum, root mean square, etc.)?
If you can provide some of these details I may be able to give you a better idea of some things to try.
UPDATE
So based on your comment you can skip values. That actually makes this problem very difficult to solve O(2^n). Because you are basically looking at all combinations of list one with list two.
Even though you can optimize some aspects of this problem because they are sorted you will still have to check a lot of combinations.

Related

Get paths of minimum MSE python

I have a list of list of vectors.
[
[[1,2,3],[7,8,5]],
[[7,8,9],[2,8,6],[2,6,3]],
[[7,5,1],[1,7,3],[6,1,1],[5,2,7]]
]
For each of the vectors in the first list , I want to extract the path of minimum distance (MSE) across vectors in each list.
For example, for the first element in the first list, I should obtain this path:
[1,2,3] -> [2,6,3] -> [1,7,3]
in terms of indexes:
[0,2,1]
I should obtain this path for each element in the first list. The lists are huge and the real vectors are of about 300 elements.
There is some pythonic method that avoids hard iterating with for loops?
My algorithms knowledge is a little limited. I dont think there is any particular python specific best method for this. The comment Rock LI made is accurate. Theres a million dollar prize if you can find a best method to this. Implement a Dijkstra algorithm or whatever your favorite search method is for this. You can auto calculate weights from one list to the next. beyond that its pure algorithms

Nested list and dictionary efficiency

I'm working on a project that requires a 2D map with a list for every possible x and y coordinate on that map. Seeing as though the map dimensions are constant, which is faster for creation, searching and changing values of?
Let's say that I have a 2x2 grid with a total of 4 positions. Each stores 2-bits (0, 1, 2 or 3) would having "[0b00, 0b00, 0b00, 0b01]" represent the list be better than "[[0b00, 0b00], [0b00, 0b01]]" in terms of efficiency and readability?
I assumed that the first method would be quicker at creation and iterating over all of the values but the second be faster for finding the value of a certain position (so listName[1][0] is easier to work out than listName[2]).
To clarify, I want to know what is both more memory efficient and CPU efficient for the 3 listed uses and (if it isn't too much trouble) why they are so. Further, the actual lists I'm using are 4096x4096 (using a total of 17Mb in raw data).
Note: I DO already plan on splitting the 4096x4096 grid into sectors that will be part of a nested list, I'm just asking if x and y should be on the same nesting level.
Thanks.

Time complexity of python "set.intersection" for n sets

I want to know the complexity of the set.intersection of python. I looked in the documentations and the online wikis for python, but I did not find the time complexity of this method for multiple sets.
The python wiki on time complexity lists a single intersection as O(min(len(s), len(t)) where s and t are the sizes of the sets and t is a set. (In English: the time is bounded by and linear in the size of the smaller set.)
Note: based on the comments below, this wiki entry had been be wrong if the argument passed is not a set. I've corrected the wiki entry.
If you have n sets (sets, not iterables), you'll do n-1 intersections and the time can be
(n-1)O(len(s)) where s is the set with the smallest size.
Note that as you do an intersection the result may get smaller, so although O is the worst case, in practice, the time will be better than this.
However, looking at the specific code this idea of taking the min() only applies to a single pair of sets and doesn't extend to multiple sets. So in this case, we have to be pessimistic and take s as the set with the largest size.

Which data structure is appropriate for this?

I have a line in my code that currently does this at each step x:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
This is pretty slow. Is there a more efficient way to eliminate things from a list that don't contain a given x?
Perhaps you're looking for an interval tree. From Wikipedia:
In computer science, an interval tree is an ordered tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point.
So, instead of storing the (lo, hi) pairs sequentially in a list, you would have them define the intervals in an interval tree. Then you could perform queries on the tree with x, and retain only the intervals that overlap x.
While you don't give much context, I'll assume the rest of the loop looks like:
for x in xlist:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
In this case, if may be more efficient to construct an interval tree (http://en.wikipedia.org/wiki/Interval_tree) first. Then, for each x you walk the tree and find all intervals which intersect with x; add these intervals to a set as you find them.
Here I'm going to suggest what may seem like a really dumb solution favoring micro-optimizations over algorithmic ones. It'll depend on your specific needs.
The ultimate question is this: is a single linear pass over your array (list in Python), on average, expensive? In other words, is searching for lo/high pairs that contain x going to generally yield results that are very small (ex: 1% of the overall size of the list), or relatively quite large (ex: 25% or more of the original list)?
If the answer is the latter, you might actually get a more efficient solution keeping a basic, contiguous, cache-friendly representation that you're accessing sequentially. The hardware cache excels at plowing through contiguous data where multiple adjacent elements fit into a cache line sequentially.
What you want to avoid in such a case is the expensive linear-time removal from the middle of the array as well as possibly the construction of a new one. If you trigger a linear-time operation for every single individual element you remove from the array, then naturally that's going to get very expensive very quickly.
To exchange that linear-time operation for a much faster constant-time one, all we have to do when we want to remove an element at a certain index in the array is to overwrite the element at that index with the element at the back of the array (last element). Now simply remove the redundant duplicate from the back of the array (a removal from the back of an array is a constant-time operation, and often involves just basic arithmetical instructions).
If your needs fit the criteria, then this can actually give you better results than a smarter algorithm. It's one of the peculiar cases where the practice can trump the theory due to the skewed performance of the hardware cache over DRAM, but if you're performing these types of hi/lo queries repeatedly and wanting to get very narrow results, then something smarter like an interval tree or at least sorting the data to allow binary searches can be considerably better.

Efficient way to find number of distinct elements in a list

I'm trying to do K-Means Clustering using Kruskal's Minimum Spanning Tree Algorithm. My original design was to run the full-length Kruskal algorithm of the input and produce an MST, after which delete the last k-1 edges (or equivalently k-1 most expensive edges).
Of course this is the same as running Kruskal algorithm and stopping it just before it adds its last k-1 edges.
I want to use the second strategy i.e instead of running the full length Kruskal algorithm, stop it just after the number of clusters so far equals K. I'm using Union-Find data structure and using a list object in this Union-Find data structure.
Each vertex on this graph is represented by its current cluster on this list e.g [1,2,3...] means vertices 1,2,3 are in their distinct independent clusters. If two vertices are joined their corresponding indices on the list data structure are updated to reflect this.
e.g merging vertices 2 and 3 leaves the list data object as [1,2,2,4,5.....]
My strategy is then every time two nodes are merged, count the number of DISTINCT elements in the list and if it equals the number of desired clusters, stop. My worry is that this may not be the most efficient option. Is there a way I could count the number of distinct objects in a list efficiently?
The easiest and probably most efficient is
len(set(l))
where l is the list. You can consider storing the data in sets instead of lists in the first place, if it is appropriate.
Note that for this to work the elements of l have to be hashable, which is guaranteed for numbers, but not for generic "objects".
One way is to sort your list and then run over the elements by comparing each one to the previous one. If they are not equal sum 1 to your "distinct counter". This operation is O(n), and for sorting you can use the sorting algorithm you prefer, such as quick sort or merge sort, but I guess there is an available sorting algorithm in the lib you use.
Another option is to create a hash table and add all the elements. The number of insertions will be the distinct elements, since repeated elements will not be inserted. I think this is O(1) in the best case so maybe this is the better solution. Good luck!
Hope this helps,
Dídac Pérez

Categories

Resources