Python list reordering, remember original order? - python

I'm working on a Bayesian probability project, in which I need to adjust probabilities based on new information. I have yet to find an efficient way to do this. What I'm trying to do is start with an equal probability list for distinct scenarios. Ex.
There are 6 people: E, T, M, Q, L, and Z, and their initial respective probabilities of being chosen are represented in
myList=[.1667, .1667, .1667, .1667, .1667, .1667]
New information surfaces that people in the first third alphabetically have a collective 70% chance of being chosen. A new list is made, sorted alphabetically by name (E, L, M, Q, T, Z), that just includes the new information. (.7/.333=2.33, .3/.667=.45)
newList=[2.33, 2.33, .45, .45, .45, .45)
I need a way to order the newList the same as myList so I can multiply the right values in list comprehension, and reach the adjust probabilities. Having a single consistent order is important because the process will be repeated several times, each with different criteria (vowels, closest to P, etc), and in a list with about 1000 items.
Each newList could instead be a newDictionary, and then once the adjustment criteria are created they could be ordered into a list, but transforming multiple dictionaries seems inefficient. Is it? Is there a simple way to do this I'm entirely missing?
Thanks!

For what it's worth, the best thing you can do for the speed of your methods in Python is to use numpy instead of the standard types (you'll thus be using pre-compiled C code to perform arithmetic operations). This will lead to a dramatic speed increase. Numpy arrays have fixed orderings anyway, and syntax is more directly applicable to mathematical operations. You just need to consider how to express the operations as matrix operations. E.g. your example:
myList = np.ones(6) / 6.
newInfo = np.array( [.7/2, .7/2, .3/4, .3/4, .3/4, .3/4] )
result = myList * newInfo
Since both vectors have unit sum there's no need to normalise (I'm not sure what you were doing in your example, I confess, so if there's a subtlety I've missed let me know), but if you do need to it's trivial:
result /= np.sum(result)

Try storing your info as a list of tuples:
bayesList = [('E', 0.1667), ('M', 0.1667), ...]
your list comprehension can be along the lines of
newBayes = [(person, prob * normalizeFactor) for person, prob in bayesList]
where you've normalizeFactor was calculated before setting up your list comprehension

Related

Checking validity of permutations in python

Using python, I would like to generate all possible permutations of 10 labels (for simplicity, I'll call them a, b, c, ...), and return all permutations that satisfy a list of conditions. These conditions have to do with the ordering of the different labels - for example, let's say I want to return all permutations in which a comes before b and when d comes after e. Notably, none of the conditions pertain to any details of the labels themselves, only their relative orderings. I would like to know what the most suitable data structure and general approach is for dealing with these sorts of problems. For example, I can generate all possible permutations of elements within a list, but I can't see a simple way to verify whether a given permutation satisfies the conditions I want.
"The most suitable data structure and general approach" varies, depending on the actual problem. I can outline three basic approaches to the problem you give (generate all permutations of 10 labels a, b, c, etc. in which a comes before b and d comes after e).
First, generate all permutations of the labels, using itertools.permutations, remove/skip over the ones where a comes after b and d comes before e. Given a particular permutation p (represented as a Python tuple) you can check for
p.index("a") < p.index("b") and p.index("d") > p.index("e")
This has the disadvantage that you reject three-fourths of the permutations that are initially generated, and that expression involves four passes through the tuple. But this is simple and short and most of the work is done in the fast code inside Python.
Second, general all permutation of the locations 0 through 9. Consider these to represent the inverses of your desired permutations. In other words, the number at position 0 is not what will go to position 0 in the permutation but rather shows where label a will go in the permutation. Then you can quickly and easily check for your requirements:
p[0] < p[1] and p[3] > p[4]
since a is the 0'th label, etc. If the permutation passes this test, then find the inverse permutation of this and apply it to your labels. Finding the inverse involves one or two passes through the tuple, so it makes fewer passes than the first method. However, this is more complicated and does more work outside the innards of Python, so it is very doubtful that this will be faster than the first method.
Third, generate only the permutations you need. This can be done with these steps.
3a. Note that there are four special positions in the permutations (those for a, b, d, and e). So use itertools.combinations to choose 4 positions out of the 10 total positions. Note I said positions, not labels, so choose 4 integers between 0 and 9.
3b. Use itertools.combinations again to choose 2 of those positions out of the 4 already chosen in step 3a. Place a in the first (smaller) of those 2 positions and b in the other. Place e in the first of the other 2 positions chosen in step 3a and place d in the other.
3c. Use itertools.permutations to choose the order of the other 6 labels.
3d. Interleave all that into one permutation. There are several ways to do that. You could make one pass through, placing everything as needed, or you could use slices to concatenate the various segments of the final permutation.
That third method generates only what you need, but the time involved in constructing each permutation is sizable. I do not know which of the methods would be fastest--you could test with smaller sizes of permutations. There are multiple possible variations for each of the methods, of course.

Efficient way to get family of subsets by its number in Python

I have an n-element set, and I want to consider families of its k-element subsets of fixed size s.
For example, if n = 3, k = 1, s = 2, we have these families:
{{1}, {2}}, {{1}, {3}}, {{2}, {3}}
In my problem n, k, s are not so small, e.g. s = n = 50, k = 20.
Let us say all such families are ordered lexicographically (or really in any clearly stated order). I want to have an efficient way to get a family by its number.
I thought of using itertools, but I am afraid it won't work with such big numbers. Possibly I need to implement something myself, but I have no clear understanding of how to do it. I have only a following idea: enumerate all k-element subsets of n-element set (there is an efficient algorithm to get i-th element by i). Then enumerate all s-element subsets of comb(n, k)-element set, using the same operation. Now we need to generate a number in range (0, comb(comb(n, k), s)) and turn it firstly to the number of s-element subset and then to a family of k-elements sets.
However, such approach looks a bit complicated. Maybe there is an easier one?

Fastest inverse operation of index to label in a numpy array: a dictionary (hash) of label to index

I find myself consistently facing this problem in a couple of different scenarios. So I thought about sharing it here and see if there is an optimal way to solve it.
Suppose that I have a big array of whatever X and an another array of the same size of X called y that has on it the label to whose x belongs. So like the following.
X = np.array(['obect1', 'object2', 'object3', 'object4', 'object5'])
y = np.array([0, 1, 1, 0, 2])
What I desire is a to build a dictionary / hash that uses the set of labels as keys and the indexes of all the objects with those labels in X as items. So in this case the desired output will be:
{0: (array([0, 3]),), 1: (array([1, 2]),), 2: (array([4]),)}
Note that actually what is on X does not matter but I included it for the sake of completeness.
Now, my naive solution for the problem is iterate throughout all the labels and use np.where==label to build the dictionary. In more detail, I use this function:
def get_key_to_indexes_dic(labels):
"""
Builds a dictionary whose keys are the labels and whose
items are all the indexes that have that particular key
"""
# Get the unique labels and initialize the dictionary
label_set = set(labels)
key_to_indexes = {}
for label in label_set:
key_to_indexes[label] = np.where(labels==label)
return key_to_indexes
So now the core of my question:
Is there a way to do better? is there a natural way to solve this using numpy functions? is my approach misguided somehow?
As a lateral matter of less importance: what is the complexity of the solution in the definition above? I believe that the complexity of the solution is the following:
Or in words the number of labels times the complexity of using np.where in a set of the size of y plus the complexity of making a set out of an array. Is this correct?
P.D. I could not find related post with this specific question, if you have suggestions to change the title or anything I would be grateful.
You only need to traverse once if you use a dictionary to store the indexes as you go through:
from collections import defaultdict
def get_key_to_indexes_ddict(labels):
indexes = defaultdict(list)
for index, label in enumerate(labels):
indexes[label].append(index)
The scaling seems much like you have analysed for your option, for the function above it's O(N) where N is the size of y since checking if a value is in a dictionary is O(1).
So the interesting thing is that since np.where is going so much faster in its traversal, as long as there are only a small number of labels, your function is faster. Mine seems faster when there are many distinct labels.
Here is how the functions scale:
The blue lines are your function, the red lines are mine. The line styles indicate the number of distinct labels. {10: ':', 100: '--', 1000: '-.', 10000: '-'}. You can see that my function is relatively independent of number of labels, while yours quickly becomes slow when there are many labels. If you have few labels, you're better off with yours.
The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a fully vectorized manner, and having O(nlogn) worst case time-complexity:
import numpy_indexed as npi
indices = np.arange(len(labels))
unique_labels, indices_per_label = npi.group_by(labels, indices)
Note that for many common applications of such functionality, such as computing a sum or mean over group labels, it is more efficient not to compute the split list of indices, but to make use of the functions for that in npi; ie, npi.group_by(labels).mean(some_corresponding_array), rather than looping through indices_per_label and taking the mean over those indices.
Assuming that the labels are consecutive integers [0, m] and taking n = len(labels), the complexity for set(labels) is O(n) and the complexity for np.where in the loop is O(m*n). However, the overall complexity is written as O(m*n) and not O(m*n + n), see "Big O notation" on wikipedia.
There are two things you can do to improve the performance: 1) use a more efficient algorithm (lower complexity) and 2) replace Python loops with fast array operations.
The other answers currently posted do exactly this, and with very sensible code. However an optimal solution would be both fully vectorized and have O(n) complexity. This can be accomplished using a certain lower level function from Scipy:
def sparse_hack(labels):
from scipy.sparse._sparsetools import coo_tocsr
labels = labels.ravel()
n = len(labels)
nlabels = np.max(labels) + 1
indices = np.arange(n)
sorted_indices = np.empty(n, int)
offsets = np.zeros(nlabels+1, int)
dummy = np.zeros(n, int)
coo_tocsr(nlabels, 1, n, labels, dummy, indices,
offsets, dummy, sorted_indices)
return sorted_indices, offsets
The source for coo_tocsr can be found here. The way I used it, it essentially performs an indirect counting sort. To be honest, this is a rather obscure method and I advise you to use one of the approaches in the other answers.
I've also struggled to find a "numpythonic" way to solve this type of problem. This is the best approach I've come up with, although requiring a bit more memory:
def get_key_to_indexes_dict(labels):
indices = numpy.argsort(labels)
bins = numpy.bincount(labels)
indices = numpy.split(indices, numpy.cumsum(bins[bins > 0][:-1]))
return dict(zip(numpy.unique(labels), indices))

What is the fastest way of computing powerset of an array in Python?

Given a list of numbers, e.g. x = [1,2,3,4,5] I need to compute its powerset (set of all subsets of that list). Right now, I am using the following code to compute the powerset, however when I have a large array of such lists (e.g. 40K of such arrays), it is extremely slow. So I am wondering if there can be any way to speed this up.
superset = [sorted(x[:i]+x[i+s:]) for i in range(len(x)) for s in range(len(x))]
I also tried the following code, however it is much slower than the code above.
from itertools import chain, combinations
def powerset(x):
xx = list(x)
return chain.from_iterable(combinations(xx, i) for i in range(len(xx)+1))
You can represent a powerset more efficiently by having all subsets reference the original set as a list and have each subset include a number whose bits indicate inclusion in the set. Thus you can enumerate the power set by computing the number of elements and then iterating through the integers with that many bits. However, as have been noted in the comments, the power set grows extremely fast, so if you can avoid having to compute or iterate through the power set, you should do so if at all possible.

Filter list to remove similar, but not identical, entries

I have a long list containing several thousand names that are all unique strings, but I would like to filter them to produce a shorter list so that if there are similar names only one is retained. For example, the original list could contain:
Mickey Mouse
Mickey M Mouse
Mickey M. Mouse
The new list would contain just one of them - it doesn't really matter which at this moment in time. It's possible to get a similarity score using the code below (where a and b are the text being compared), so providing I pick an appropriate ratio it I have a way of making a include/exclude decision.
difflib.SequenceMatcher(None, a, b).ratio()
What I'm struggling to work out is how to populate the second list from the first one. I'm sure it's a trivial matter, but it baffling my newbie brain.
I'd have thought something along the lines of this would have worked, but nothing ends up being populated in the second list.
for p in ppl1:
for pp in ppl2:
if difflib.SequenceMater(None, p, pp).ratio() <=0.9:
ppl2.append(p)
In fact, even if that did populate the list, it'd still be wrong. I guess it'd need to compare the name from the first list to all the names in the second list, keep track of the highest ratio scored, and then only add it if the highest ratio was less that the cutoff criteria.
Any guidance gratefully received!
I'm going to risk never getting an accept because this may be too advanced for you, but here's the optimal solution.
What you're trying to do is a variant of agglomerative clustering. A union-find algorithm can be used to solve this efficiently. From all pairs of distinct strings a and b, which can be generated using
def pairs(l):
for i, a in enumerate(l):
for j in range(i + 1, len(l)):
yield (a, l[j])
you filter the pairs that have a similarity ratio <= .9:
similar = ((a, b) for a, b in pairs
if difflib.SequenceMatcher(None, p, pp).ratio() <= .9)
then union those in a disjoint-set forest. After that, you loop over the sets to get their representatives.
Firstly, you shouldn't modify a list while you're iterating over it.
One strategy would be to go through all pairs of names and, if a certain pair is too similar to each other, only keep one, and then iterate this until no two pairs are too similar. Of course, the result would now depend on the initial order of the list, but if your data is sufficiently clustered and your similarity score metric sufficiently nice, it should produce what you're looking for.

Categories

Resources