Using python, I would like to generate all possible permutations of 10 labels (for simplicity, I'll call them a, b, c, ...), and return all permutations that satisfy a list of conditions. These conditions have to do with the ordering of the different labels - for example, let's say I want to return all permutations in which a comes before b and when d comes after e. Notably, none of the conditions pertain to any details of the labels themselves, only their relative orderings. I would like to know what the most suitable data structure and general approach is for dealing with these sorts of problems. For example, I can generate all possible permutations of elements within a list, but I can't see a simple way to verify whether a given permutation satisfies the conditions I want.
"The most suitable data structure and general approach" varies, depending on the actual problem. I can outline three basic approaches to the problem you give (generate all permutations of 10 labels a, b, c, etc. in which a comes before b and d comes after e).
First, generate all permutations of the labels, using itertools.permutations, remove/skip over the ones where a comes after b and d comes before e. Given a particular permutation p (represented as a Python tuple) you can check for
p.index("a") < p.index("b") and p.index("d") > p.index("e")
This has the disadvantage that you reject three-fourths of the permutations that are initially generated, and that expression involves four passes through the tuple. But this is simple and short and most of the work is done in the fast code inside Python.
Second, general all permutation of the locations 0 through 9. Consider these to represent the inverses of your desired permutations. In other words, the number at position 0 is not what will go to position 0 in the permutation but rather shows where label a will go in the permutation. Then you can quickly and easily check for your requirements:
p[0] < p[1] and p[3] > p[4]
since a is the 0'th label, etc. If the permutation passes this test, then find the inverse permutation of this and apply it to your labels. Finding the inverse involves one or two passes through the tuple, so it makes fewer passes than the first method. However, this is more complicated and does more work outside the innards of Python, so it is very doubtful that this will be faster than the first method.
Third, generate only the permutations you need. This can be done with these steps.
3a. Note that there are four special positions in the permutations (those for a, b, d, and e). So use itertools.combinations to choose 4 positions out of the 10 total positions. Note I said positions, not labels, so choose 4 integers between 0 and 9.
3b. Use itertools.combinations again to choose 2 of those positions out of the 4 already chosen in step 3a. Place a in the first (smaller) of those 2 positions and b in the other. Place e in the first of the other 2 positions chosen in step 3a and place d in the other.
3c. Use itertools.permutations to choose the order of the other 6 labels.
3d. Interleave all that into one permutation. There are several ways to do that. You could make one pass through, placing everything as needed, or you could use slices to concatenate the various segments of the final permutation.
That third method generates only what you need, but the time involved in constructing each permutation is sizable. I do not know which of the methods would be fastest--you could test with smaller sizes of permutations. There are multiple possible variations for each of the methods, of course.
Related
I am working on a task of text segmentation in Python.
The texts I am working on should be segmented into 4 sections (based on what they're talking about), let's call them A, B, C and D, usually in this order. These texts are divided in relatively short text segments. The sections are unique (only one per text) and homogeneous (never split, which kinda repeats previous point).
I have got a neural network that identifies the section a segment belongs to with 90% precision, which I'm happy about.
However, when it comes to the last 10%, they are often an isolated segments erroneously tagged, surrounded by other elements righteously tagged.
I can visualise this trough a list of tuples looking like this:
[(segment1, A), (segment2, A), (segment3, B), (segment4, A), (segment5, A), (segment6, C), (segment6, C)]
In this case, segment3 should be tagged as A, not B, because the sections in the document are always homogeneous. How can I identify a homogeneous group and therefore correct isolated items?
My current method consists in saying "if the element before and the element after are tagged the same, but not the element in the middle, correct the element in the middle" but I'm convinced there's a better way to do this (maybe using a different way of formatting my data?).
However, what am I to do in the case where there are 2 isolated itams next to one another?
Thanks in advance.
If it is safe to assume that single outliers are always surrounded by two correctly labeled neighbors, one could try to use triplets like these (neglecting the first and last elements, which would be just pre-/appended):
def sort_by_label(items):
return sorted(items, key=lambda x: x[1])
[(unsorted[1][0], sort_by_label(unsorted)[1][1]) for unsorted in zip(segments, segments[1:], segments[2:])]
with segments being your result list of tuples.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
The code is running fine and is executing in python compiler online but failing all test cases in the Google Foobar
from math import factorial
from collections import Counter
from fractions import gcd
def cycle_count(c, n):
cc=factorial(n)
for a, b in Counter(c).items():
cc//=(a**b)*factorial(b)
return cc
def cycle_partitions(n, i=1):
yield [n]
for i in range(i, n//2+1):
for p in cycle_partitions(n-i, i):
yield [i]+p
def solution(w, h, s):
grid=0
for cpw in cycle_partitions(w):
for cph in cycle_partitions(h):
m=cycle_count(cpw, w)*cycle_count(cph, h)
grid+=m*(s**sum([sum([gcd(i, j) for i in cpw]) for j in cph]))
return grid//(factorial(w)*factorial(h))
Check out this code which is to be executed .Would love suggestions !!!
This is a gorgeous problem, both mathematically and from the algorithmic point of view.
Let me try to explain each part.
The mathematics
This part is better read with nicely typeset formulas. See a concise explanation here where links to further reading are given.
Let me add a reference directly here: For example Harary and Palmer's Graphical enumeration, Chapter 2.
In short, there is a set (the whole set of h x w-matrices, where the entries can take any of s different values) and a group of permutations that transforms some matrices in others. In the problem the group consists of all permutations of rows and/or columns of the matrices.
The set of matrices gets divided into classes of matrices that can be transformed into one another. The goal of the problem is to count the number of these classes. In technical terminology the set of classes is called the quotient of the set by the action of the group, or orbit space.
The good thing is that there is a powerful theorem (with many generalizations and versions) that does exactly that. That is Polya's enumeration theorem. The theorem expresses the number of elements of the orbit space in terms of the value of a polynomial known in the area as Cycle Index. Now, in this problem the group is a direct product of two special groups the group of all permutations of h and w elements, respectively. The Cycle Index polynomials for these groups are known, and so are formulas for computing the Cycle Index polynomial of the product of groups in terms of the Cycle Index polynomials of the factors.
Maybe a comment worth making that motivates the name of the polynomial is the following:
Every permutation of elements can be seen as cycling disjoint subsets of those elements. For example, a permutation of (1,2,3,4,5) and can be (2,3,1,5,4), where we mean that 2 moved to the position of 1, 3 moved to the position of 2, 1 to the position of 3, 5 to the position of 4 and 4 to the position of 5. The effect of this permutation is the same as cycling 1-> 3 -> 2 and 2 back to 1, and cycling 4 -> 5 and 5 back to 4. Similar to how natural numbers can be factored into a product of prime factors, each permutation can be factored into disjoint cycles. For each permutation, the cycles are unique in a sense for each permutation. The Cycle Index polynomial is computed in terms of the number of cycles of each length for each permutation in the group.
Putting all these together we get that the total count is given by the last formula in the link.
Implementation
As seen in the final formula, we need to compute:
Partitions of a number
Greatest common divisors (gcd) of many numbers.
Factorials of many numbers.
For these, we can do:
To compute all partitions one can use the iterative algorithms here. Already written in Python here.
An efficient way to compute gcd one could use Euclidean algorithm. However, since we are going to need the gcd of all pairs of numbers in a range and each one many times. It is better to pre-compute the full table of gcd all at once by dynamic programming. If a>b, then gcd(a,b)=gcd(a-b,b). This recurrence equation allows to compute gcd of larger pairs in terms of that of smaller pairs. In the table, one has the initial values gcd(1,a)=gcd(a,1)=1 and gcd(a,a)=a, for all a.
The same happens for factorials. The formula will require the factorials of all numbers in a range many times each. So, it is better to compute them all from the bottom up using that n! = n(n-1)! and 0!=1!=1.
An implementation in Python could look like this. Feel free to improve it.
I know this is a copied code but you have to :
1) write your own factorial function
2) write your own gcd function
3) cast to string before returning final value.
I have a set of data:
(1438672131.185164, 377961152)
(1438672132.264816, 377961421)
(1438672133.333846, 377961690)
(1438672134.388937, 377961954)
(1438672135.449144, 377962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 377963285)
(1438672139.44115, 377963388)
What I need to figure out is how to group them. Until now I've used a super-duper simple approach, just by diffing two of the second part of the tuple and if the diff was bigger than a certain pre-defined threshold I'd put them into different groups. But it's yielded only unsatisfactory results.
But theoretically I imagine, that it should be possible to determine wether a value of the second part of the tuple belongs to the same group or not, by fitting them on one or multiple line, because I know about the first part of the tuple that it's strictly monotenous because it's a timestamp (time.time()) and I know that all data sets that result will be close to linear. Let's say the tuple is (y, x). There are only three options:
Either all data fits the same equation y = mx + c
Or there is only a differing offset c or
there is an offset c and a different m
The above set would be one group only. The following group would resolve in three groups:
(1438672131.185164, 377961152)
(1438672132.264816, 961421)
(1438672133.333846, 477961690)
(1438672134.388937, 377961954)
(1438672135.449144, 962220)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
(1438672139.069998, 477963285)
(1438672139.44115, 963388)
group1:
(1438672131.185164, 377961152)
(1438672134.388937, 377961954)
(1438672136.540044, 377962483)
(1438672137.172971, 377962763)
(1438672138.24253, 377962915)
(1438672138.652991, 377963185)
group2:
(1438672132.264816, 961421)
(1438672135.449144, 962220)
(1438672139.44115, 963388)
group3:
(1438672133.333846, 477961690)
(1438672139.069998, 477963285)
Is there a module or otherwise simple solution that will solve this problem? I've found least-squares in numpy and scipy, but I'm not quite sure how to properly use or apply them. If there is another way besides linear functions I'm happy to hear about them as well!
EDIT 2
It is a two dimensional problem unfortunately, not one-dimensional. For example
(1439005464, 477961152)
should (if assuming for this data a relationship approximately 1:300) would still be first group.
I have a long list containing several thousand names that are all unique strings, but I would like to filter them to produce a shorter list so that if there are similar names only one is retained. For example, the original list could contain:
Mickey Mouse
Mickey M Mouse
Mickey M. Mouse
The new list would contain just one of them - it doesn't really matter which at this moment in time. It's possible to get a similarity score using the code below (where a and b are the text being compared), so providing I pick an appropriate ratio it I have a way of making a include/exclude decision.
difflib.SequenceMatcher(None, a, b).ratio()
What I'm struggling to work out is how to populate the second list from the first one. I'm sure it's a trivial matter, but it baffling my newbie brain.
I'd have thought something along the lines of this would have worked, but nothing ends up being populated in the second list.
for p in ppl1:
for pp in ppl2:
if difflib.SequenceMater(None, p, pp).ratio() <=0.9:
ppl2.append(p)
In fact, even if that did populate the list, it'd still be wrong. I guess it'd need to compare the name from the first list to all the names in the second list, keep track of the highest ratio scored, and then only add it if the highest ratio was less that the cutoff criteria.
Any guidance gratefully received!
I'm going to risk never getting an accept because this may be too advanced for you, but here's the optimal solution.
What you're trying to do is a variant of agglomerative clustering. A union-find algorithm can be used to solve this efficiently. From all pairs of distinct strings a and b, which can be generated using
def pairs(l):
for i, a in enumerate(l):
for j in range(i + 1, len(l)):
yield (a, l[j])
you filter the pairs that have a similarity ratio <= .9:
similar = ((a, b) for a, b in pairs
if difflib.SequenceMatcher(None, p, pp).ratio() <= .9)
then union those in a disjoint-set forest. After that, you loop over the sets to get their representatives.
Firstly, you shouldn't modify a list while you're iterating over it.
One strategy would be to go through all pairs of names and, if a certain pair is too similar to each other, only keep one, and then iterate this until no two pairs are too similar. Of course, the result would now depend on the initial order of the list, but if your data is sufficiently clustered and your similarity score metric sufficiently nice, it should produce what you're looking for.
I'm working on a Bayesian probability project, in which I need to adjust probabilities based on new information. I have yet to find an efficient way to do this. What I'm trying to do is start with an equal probability list for distinct scenarios. Ex.
There are 6 people: E, T, M, Q, L, and Z, and their initial respective probabilities of being chosen are represented in
myList=[.1667, .1667, .1667, .1667, .1667, .1667]
New information surfaces that people in the first third alphabetically have a collective 70% chance of being chosen. A new list is made, sorted alphabetically by name (E, L, M, Q, T, Z), that just includes the new information. (.7/.333=2.33, .3/.667=.45)
newList=[2.33, 2.33, .45, .45, .45, .45)
I need a way to order the newList the same as myList so I can multiply the right values in list comprehension, and reach the adjust probabilities. Having a single consistent order is important because the process will be repeated several times, each with different criteria (vowels, closest to P, etc), and in a list with about 1000 items.
Each newList could instead be a newDictionary, and then once the adjustment criteria are created they could be ordered into a list, but transforming multiple dictionaries seems inefficient. Is it? Is there a simple way to do this I'm entirely missing?
Thanks!
For what it's worth, the best thing you can do for the speed of your methods in Python is to use numpy instead of the standard types (you'll thus be using pre-compiled C code to perform arithmetic operations). This will lead to a dramatic speed increase. Numpy arrays have fixed orderings anyway, and syntax is more directly applicable to mathematical operations. You just need to consider how to express the operations as matrix operations. E.g. your example:
myList = np.ones(6) / 6.
newInfo = np.array( [.7/2, .7/2, .3/4, .3/4, .3/4, .3/4] )
result = myList * newInfo
Since both vectors have unit sum there's no need to normalise (I'm not sure what you were doing in your example, I confess, so if there's a subtlety I've missed let me know), but if you do need to it's trivial:
result /= np.sum(result)
Try storing your info as a list of tuples:
bayesList = [('E', 0.1667), ('M', 0.1667), ...]
your list comprehension can be along the lines of
newBayes = [(person, prob * normalizeFactor) for person, prob in bayesList]
where you've normalizeFactor was calculated before setting up your list comprehension