Creating an algorithm to solve a problem in python - python

Algorithm Objective:
link to the pictures i took while giving the amazon interview:
[https://boards.wetransfer.com/board/shl7w5z1e62os7nwv20190618224258/latest][pictures]
Eight houses, represented as cells, are arranged in a straight line. Each day every cell competes with its adjacent cells(neighbors). An integer value of 1 represents an active cell and a value of 0 represents an inactive cell. If the neighbors on both sides of a cell are either active or inactive, the cell becomes inactive on the next day, otherwise the cell becomes active. The two cell on each end have a single a single adjacent cell, so assume that the unoccupied space on the opposite side is an inactive cell. Even after updating the cell state, consider its previous state when updating the state of other cells. The state information of all cells should be updated simultaneously.
Create an algorithm to output the state of the cells after the given number of days.
Input:
The input to the function/method consists of two arguments:
states, a list of integers representing the current state of cells,
days,an integer representing the number of days.
Output:
Return a list of integers representing the state of the cells after the given number of days
Note:
The elements of the list states contains 0s and 1s only
TestCase 1:
Input: [1,0,0,0,0,1,0,0] , 1
Expected Return Value: 0 1 0 0 1 0 1 0
TestCase 2:
Input: [1,1,1,0,1,1,1,1] , 2
Expected Return Value: 0 0 0 0 0 1 1 0
What I Tried:
def cellCompete(states, days):
# WRITE YOUR CODE HERE
il = 0;
tl = len(states);
intialvalue = states
results = []
states = []
for i in range(days):
#first range
if(intialvalue[il] != intialvalue[il+1]):
print('value of index 0 is : ',reverse(intialvalue[il]))
results.append(reverse(intialvalue[il]))
else:
print('value of index 0 is :', intialvalue[il])
results.append(intialvalue[il])
print("-------------------")
#range middle
while il < tl-2:
if(intialvalue[il] != intialvalue[il+1] or intialvalue[il+1] != intialvalue[il+2]):
print('value of index',il+1,'is : ',reverse(intialvalue[il+1]))
results.append(reverse(intialvalue[il+1]))
else:
print('value of index', il+1,'is :', intialvalue[il+1])
results.append(intialvalue[il+1])
print("-------------------")
il += 1
#range last
if(intialvalue[tl-2] != intialvalue[tl-1]):
print('value of index',tl-1,'is : ',reverse(intialvalue[tl-1]))
results.append(reverse(intialvalue[tl-1]))
else:
print('value of index',tl-1,'is :', intialvalue[tl-1])
results.append(intialvalue[tl-1])
print("-------------------")
print('Input: ',intialvalue)
print('Results: ',results)
initialvalue = results
def reverse(val):
if(val == 0):
return 1
elif(val == 1):
return 0
print("-------------------------Test case 1--------------------")
cellCompete([1,0,0,0,0,1,0,0],1)
print("-------------------------Test case 2--------------------")
cellCompete([1,1,1,0,1,1,1,1],2)
I am relatively new to python and i could not complete this algorithm for the second case on this python

Here is a much shorter routine that solves your problem.
def cellCompete(states, days):
n = len(states)
for day in range(days):
houses = [0] + states + [0]
states = [houses[i-1] ^ houses[i+1] for i in range(1, n+1)]
return states
print(cellCompete([1,0,0,0,0,1,0,0] , 1))
print(cellCompete([1,1,1,0,1,1,1,1] , 2))
The printout from that is what you want (though with list brackets included):
[0, 1, 0, 0, 1, 0, 1, 0]
[0, 0, 0, 0, 0, 1, 1, 0]
This routine adds sentinel zeros to each end of the list of house states. It then uses a list comprehension to find the houses' new states. All this is repeated the proper number of times before the house states are returned.
The calculation of a new house state is houses[i-1] ^ houses[i+1]. That character ^ is bitwise exclusive-or. The value is 1 if the two values are different and 0 if the two values are the same. That is just what is needed in your problem.

Recursive version:
def cell_compete(states, days):
s = [0] + states + [0]
states = [i ^ j for i, j in zip(s[:-2], s[2:])] # Thanks #RoyDaulton
return cell_compete(states, days - 1) if days > 1 else states
A non-recursive version that also avoids extending the list by adding edge [0] elements would be:
def cell_compete(states, days):
for _ in range(days):
states = [states[1]] + [i ^ j for i, j in zip(states[:-2], states[2:])] + [states[-2]]
return states

Another possibility:
def cellCompete(states,days):
newstates = []
added_states = [0] + states + [0]
for counter,value in enumerate(states):
newstates.append(int((added_states[counter] != added_states[counter+2])))
if days > 1:
return cellCompete(newstates,days-1)
else:
return newstates
print(cellCompete([1,1,1,0,1,1,1,1],2))

Similar to Rory's using XOR but without the need for the internal comprehension. Bit shift the number by 2 and clip the extra bit from the left by taking the modulus:
def process(state, r):
n = int(''.join(map(str,state)), 2)
for i in range(r):
n = ((n ^ n << 2) >> 1) % 256
return list(map(int,format(n, "08b")))
process([1,1,1,0,1,1,1,1], 2)
# [0, 0, 0, 0, 0, 1, 1, 0]
process([1,0,0,0,0,1,0,0] , 1)
# [0, 1, 0, 0, 1, 0, 1, 0]

While everyone is trying to make the simplest version possible here's a more complex version. It's pretty similar to the previous answers except that instead of keeping the state in the function, this solution is formed of 2 two part. One is the utility function that we want to be able to call, the other is a generator that keep tracks of the states.
The main difference here is that the generator takes a comparator and an initial state that will be mutated. The generator can also be sent as a parameter so the generator can help divide the logic of how many state you want to generate and to have a way to mutate from an actual state indefinitely.
def mutator(state, comparator):
while True:
states = [0] + state + [0]
state = [
comparator(states[cellid-1], states[cellid+1])
for cellid in range(1, len(states)-1)
]
yield state
def cellCompete(states, days):
generator = mutator(states, lambda x, y: x ^ y)
for idx, states in enumerate(generator):
if idx+2 > days:
break
return states
print(cellCompete([1,0,0,0,0,1,0,0] , 1))
print(cellCompete([1,1,1,0,1,1,1,1] , 2))
Also, I added a comparator that allow us to have some kind of undefined operation on both elements. It can allow the code to be extended beyond the initial spec. It's obviously a superfluous implementation but as mentioned, it's supposed to be an interview answer and as much as I like to see a straight to the point answer, if someone can come up with a flexible answer in the same timeframe, then why not.

Related

Find longest adjacent repeating non-overlapping substring

(This question isn't about music but I'm using music as an example of
a use case.)
In music a common way to structure phrases is as a sequence of notes
where the middle part is repeated one or more times. Thus, the phrase
consists of an introduction, a looping part and an outro. Here is one
example:
[ E E E F G A F F G A F F G A F C D ]
We can "see" that the intro is [ E E E] the repeating part is [ F G A
F ] and the outro is [ C D ]. So the way to split the list would be
[ [ E E E ] 3 [ F G A F ] [ C D ] ]
where the first item is the intro, the second number of times the
repeating part is repeated and the third part the outro.
I need an algorithm to perform such a split.
But there is one caveat which is that there may be multiple way to
split the list. For example, the above list could be split into:
[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
But this is a worse split because the intro and outro is longer. So
the criteria for the algorithm is to find the split that maximizes the
length of the looping part and minimizes the combined length of the
intro and outro. That means that the correct split for
[ A C C C C C C C C C A ]
is
[ [ A ] 9 [ C ] [ A ] ]
because the combined length of the intro and outro is 2 and the length
of the looping part is 9.
Also, while the intro and outro can be empty, only "true" repeats are
allowed. So the following split would be disallowed:
[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
Think of it as finding the optimal "compression" for the
sequence. Note that there may not be any repeats in some sequences:
[ A B C D ]
For these degenerate cases, any sensible result is allowed.
Here is my implementation of the algorithm:
def find_longest_repeating_non_overlapping_subseq(seq):
candidates = []
for i in range(len(seq)):
candidate_max = len(seq[i + 1:]) // 2
for j in range(1, candidate_max + 1):
candidate, remaining = seq[i:i + j], seq[i + j:]
n_reps = 1
len_candidate = len(candidate)
while remaining[:len_candidate] == candidate:
n_reps += 1
remaining = remaining[len_candidate:]
if n_reps > 1:
candidates.append((seq[:i], n_reps,
candidate, remaining))
if not candidates:
return (type(seq)(), 1, seq, type(seq)())
def score_candidate(candidate):
intro, reps, loop, outro = candidate
return reps - len(intro) - len(outro)
return sorted(candidates, key = score_candidate)[-1]
I'm not sure it is correct, but it passes the simple tests I've
described. The problem with it is that it is way to slow. I've looked
at suffix trees but they don't seem to fit my use case because the
substrings I'm after should be non-overlapping and adjacent.
Here's a way that's clearly quadratic-time, but with a relatively low constant factor because it doesn't build any substring objects apart from those of length 1. The result is a 2-tuple,
bestlen, list_of_results
where bestlen is the length of the longest substring of repeated adjacent blocks, and each result is a 3-tuple,
start_index, width, numreps
meaning that the substring being repeated is
the_string[start_index : start_index + width]
and there are numreps of those adjacent. It will always be that
bestlen == width * numreps
The problem description leaves ambiguities. For example, consider this output:
>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
So it found 5 ways to view "the longest" stretch as being of length 6:
The initial "a" repeated 6 times.
The initial "aa" repeated 3 times.
The leftmost instance of "ab" repeated 3 times.
The leftmost instance of "ba" repeated 3 times.
The initial "aaa" repeated 2 times.
It doesn't return the intro or outro because those are trivial to deduce from what it does return:
The intro is the_string[: start_index].
The outro is the_string[start_index + bestlen :].
If there are no repeated adjacent blocks, it returns
(0, [])
Other examples (from your post):
>>> crunch2("EEEFGAFFGAFFGAFCD")
(12, [(3, 4, 3)])
>>> crunch2("ACCCCCCCCCA")
(9, [(1, 1, 9), (1, 3, 3)])
>>> crunch2("ABCD")
(0, [])
The key to how it works: suppose you have adjacent repeated blocks of width W each. Then consider what happens when you compare the original string to the string shifted left by W:
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN ... ...
Then you get (N-1)*W consecutive equal characters at the same positions. But that also works in the other direction: if you shift left by W and find (N-1)*W consecutive equal characters, then you can deduce:
block1 == block2
block2 == block3
...
blockN-1 == blockN
so all N blocks must be repetitions of block1.
So the code repeatedly shifts (a copy of) the original string left by one character, then marches left to right over both identifying the longest stretches of equal characters. That only requires comparing a pair of characters at a time. To make "shift left" efficient (constant time), the copy of the string is stored in a collections.deque.
EDIT: update() did far too much futile work in many cases; replaced it.
def crunch2(s):
from collections import deque
# There are zcount equal characters starting
# at index starti.
def update(starti, zcount):
nonlocal bestlen
while zcount >= width:
numreps = 1 + zcount // width
count = width * numreps
if count >= bestlen:
if count > bestlen:
results.clear()
results.append((starti, width, numreps))
bestlen = count
else:
break
zcount -= 1
starti += 1
bestlen, results = 0, []
t = deque(s)
for width in range(1, len(s) // 2 + 1):
t.popleft()
zcount = 0
for i, (a, b) in enumerate(zip(s, t)):
if a == b:
if not zcount: # new run starts here
starti = i
zcount += 1
# else a != b, so equal run (if any) ended
elif zcount:
update(starti, zcount)
zcount = 0
if zcount:
update(starti, zcount)
return bestlen, results
Using regexps
[removed this due to size limit]
Using a suffix array
This is the fastest I've found so far, although can still be provoked into quadratic-time behavior.
Note that it doesn't much matter whether overlapping strings are found. As explained for the crunch2() program above (here elaborated on in minor ways):
Given string s with length n = len(s).
Given ints i and j with 0 <= i < j < n.
Then if w = j-i, and c is the number of leading characters in common between s[i:] and s[j:], the block s[i:j] (of length w) is repeated, starting at s[i], a total of 1 + c // w times.
The program below follows that directly to find all repeated adjacent blocks, and remembers those of maximal total length. Returns the same results as crunch2(), but sometimes in a different order.
A suffix array eases the search, but hardly eliminates it. A suffix array directly finds <i, j> pairs with maximal c, but only limits the search to maximize w * (1 + c // w). Worst cases are strings of the form letter * number, like "a" * 10000.
I'm not giving the code for the sa module below. It's long-winded and any implementation of suffix arrays will compute the same things. The outputs of suffix_array():
sa is the suffix array, the unique permutation of range(n) such that for all i in range(1, n), s[sa[i-1]:] < s[sa[i]:].
rank isn't used here.
For i in range(1, n), lcp[i] gives the length of the longest common prefix between the suffixes starting at sa[i-1] and sa[i].
Why does it win? In part because it never has to search for suffixes that start with the same letter (the suffix array, by construction, makes them adjacent), and checking for a repeated block, and for whether it's a new best, takes small constant time regardless of how large the block or how many times it's repeated. As above, that's just trivial arithmetic on c and w.
Disclaimer: suffix arrays/trees are like continued fractions for me: I can use them when I have to, and can marvel at the results, but they give me a headache. Touchy, touchy, touchy.
def crunch4(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(n-1):
i = sa[sai]
c = n
for saj in range(sai + 1, n):
c = min(c, lcp[saj])
if not c:
break
j = sa[saj]
w = abs(i - j)
if c < w:
continue
numreps = 1 + c // w
assert numreps > 1
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
Some timings
I read a modest file of English words into a string, xs. One word per line:
>>> len(xs)
209755
>>> xs.count('\n')
25481
So about 25K words in about 210K bytes. These are quadratic-time algorithms, so I didn't expect it to go fast, but crunch2() was still running after hours - and still running when I let it go overnight.
Which caused me to realize its update() function could do an enormous amount of futile work, making the algorithm more like cubic-time overall. So I repaired that. Then:
>>> crunch2(xs)
(44, [(63750, 22, 2)])
>>> xs[63750 : 63750+50]
'\nelectroencephalograph\nelectroencephalography\nelec'
That took about 38 minutes, which was in the ballpark of what I expected.
The regexp version crunch3() took less than a tenth of a second!
>>> crunch3(xs)
(8, [(19308, 4, 2), (47240, 4, 2)])
>>> xs[19308 : 19308+10]
'beriberi\nB'
>>> xs[47240 : 47240+10]
'couscous\nc'
As explained before, the regexp version may not find the best answer, but something else is at work here: by default, "." doesn't match a newline, so the code was actually doing many tiny searches. Each of the ~25K newlines in the file effectively ends the local search range. Compiling the regexp with the re.DOTALL flag instead (so newlines aren't treated specially):
>>> crunch3(xs) # with DOTALL
(44, [(63750, 22, 2)])
in a bit over 14 minutes.
Finally,
>>> crunch4(xs)
(44, [(63750, 22, 2)])
in a bit under 9 minutes. The time to build the suffix array was an insignificant part of that (less than a second). That's actually pretty impressive, since the not-always-right brute force regexp version is slower despite running almost entirely "at C speed".
But that's in a relative sense. In an absolute sense, all of these are still pig slow :-(
NOTE: the version in the next section cuts this to under 5 seconds(!).
Enormously faster
This one takes a completely different approach. For the largish dictionary example above, it gets the right answer in less than 5 seconds.
I'm rather proud of this ;-) It was unexpected, and I haven't seen this approach before. It doesn't do any string searching, just integer arithmetic on sets of indices.
It remains dreadfully slow for inputs of the form letter * largish_integer. As is, it keeps going up by 1 so long as at least two (not necessarily adjacent, or even non-overlapping!) copies of a substring (of the current length being considered) exist. So, for example, in
'x' * 1000000
it will try all substring sizes from 1 through 999999.
However, looks like that could be greatly improved by doubling the current size (instead of just adding 1) repeatedly, saving the classes as it goes along, doing a mixed form of binary search to locate the largest substring size for which a repetition exists.
Which I'll leave as a doubtless tedious exercise for the reader. My work here is done ;-)
def crunch5(text):
from collections import namedtuple, defaultdict
# For all integers i and j in IxSet x.s,
# text[i : i + x.w] == text[j : j + x.w].
# That is, it's the set of all indices at which a specific
# substring of length x.w is found.
# In general, we only care about repeated substrings here,
# so weed out those that would otherwise have len(x.s) == 1.
IxSet = namedtuple("IxSet", "s w")
bestlen, results = 0, []
# Compute sets of indices for repeated (not necessarily
# adjacent!) substrings of length xs[0].w + ys[0].w, by looking
# at the cross product of the index sets in xs and ys.
def combine(xs, ys):
xw, yw = xs[0].w, ys[0].w
neww = xw + yw
result = []
for y in ys:
shifted = set(i - xw for i in y.s if i >= xw)
for x in xs:
ok = shifted & x.s
if len(ok) > 1:
result.append(IxSet(ok, neww))
return result
# Check an index set for _adjacent_ repeated substrings.
def check(s):
nonlocal bestlen
x, w = s.s.copy(), s.w
while x:
current = start = x.pop()
count = 1
while current + w in x:
count += 1
current += w
x.remove(current)
while start - w in x:
count += 1
start -= w
x.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
ch2ixs = defaultdict(set)
for i, ch in enumerate(text):
ch2ixs[ch].add(i)
size1 = [IxSet(s, 1)
for s in ch2ixs.values()
if len(s) > 1]
del ch2ixs
for x in size1:
check(x)
current_class = size1
# Repeatedly increase size by 1 until current_class becomes
# empty. At that point, there are no repeated substrings at all
# (adjacent or not) of the then-current size (or larger).
while current_class:
current_class = combine(current_class, size1)
for x in current_class:
check(x)
return bestlen, results
And faster still
crunch6() drops the largish dictionary example to under 2 seconds on my box. It combines ideas from crunch4() (suffix and lcp arrays) and crunch5() (find all arithmetic progressions with a given stride in a set of indices).
Like crunch5(), this also loops around a number of times equal to one more than the length of the repeated longest substring (overlapping or not). For if there are no repeats of length n, there are none for any size greater than n either. That makes finding repeats without regard to overlap easier, because it's an exploitable limitation. When constraining "wins" to adjacent repeats, that breaks down. For example, there are no adjacent repeats of even length 1 in "abcabc", but there is one of length 3. That appears to make any form of direct binary search futile (the presence or absence of adjacent repeats of size n says nothing about the existence of adjacent repeats of any other size).
Inputs of the form 'x' * n remain miserable. There are repeats of all lengths from 1 through n-1.
Observation: all the programs I've given generate all possible ways of breaking up repeated adjacent chunks of maximal length. For example, for a string of 9 "x", it says it can be gotten by repeating "x" 9 times or by repeating "xxx" 3 times. So, surprisingly, they can all be used as factoring algorithms too ;-)
def crunch6(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate maximal sets of indices s such that for all i and j
# in s the suffixes starting at s[i] and s[j] start with a
# common prefix of at least len minc.
def genixs(minc, sa=sa, lcp=lcp, n=n):
i = 1
while i < n:
c = lcp[i]
if c < minc:
i += 1
continue
ixs = {sa[i-1], sa[i]}
i += 1
while i < n:
c = min(c, lcp[i])
if c < minc:
yield ixs
i += 1
break
else:
ixs.add(sa[i])
i += 1
else: # ran off the end of lcp
yield ixs
# Check an index set for _adjacent_ repeated substrings
# w apart. CAUTION: this empties s.
def check(s, w):
nonlocal bestlen
while s:
current = start = s.pop()
count = 1
while current + w in s:
count += 1
current += w
s.remove(current)
while start - w in s:
count += 1
start -= w
s.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
c = 0
found = True
while found:
c += 1
found = False
for s in genixs(c):
found = True
check(s, c)
return bestlen, results
Always fast, and published, but sometimes wrong
In bioinformatics, turns out this is studied under the names "tandem repeats", "tandem arrays", and "simple sequence repeats" (SSR). You can search for those terms to find quite a few academic papers, some claiming worst-case linear-time algorithms.
But those seem to fall into two camps:
Linear-time algorithms of the kind to be described, which are actually wrong :-(
Algorithms so complicated it would take dedication to even try to turn them into functioning code :-(
In the first camp, there are several papers that boil down to crunch4() above, but without its inner loop. I'll follow this with code for that, crunch4a(). Here's an example:
"SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences."
Pickett et alia
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/
crunch4a() is always fast, but sometimes wrong. In fact it finds at least one maximal repeated stretch for every example that appeared here, solves the largish dictionary example in a fraction of a second, and has no problem with strings of the form 'x' * 1000000. The bulk of the time is spent building the suffix and lcp arrays. But it can fail:
>>> x = "bcdabcdbcd"
>>> crunch4(x) # finds repeated bcd at end
(6, [(4, 3, 2)])
>>> crunch4a(x) # finds nothing
(0, [])
The problem is that there's no guarantee that the relevant suffixes are adjacent in the suffix array. The suffixes that start with "b" are ordered like so:
bcd
bcdabcdbcd
bcdbcd
To find the trailing repeated block by this approach requires comparing the first with the third. That's why crunch4() has an inner loop, to try all pairs starting with a common letter. The relevant pair can be separated by an arbitrary number of other suffixes in a suffix array. But that also makes the algorithm quadratic time.
# only look at adjacent entries - fast, but sometimes wrong
def crunch4a(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(1, n):
i, j = sa[sai - 1], sa[sai]
c = lcp[sai]
w = abs(i - j)
if c >= w:
numreps = 1 + c // w
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
O(n log n)
This paper looks right to me, although I haven't coded it:
"Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree"
Jens Stoye, Dan Gusfield
https://csiflabs.cs.ucdavis.edu/~gusfield/tcs.pdf
Getting to a sub-quadratic algorithm requires making some compromises, though. For example, "x" * n has n-1 substrings of the form "x"*2, n-2 of the form "x"*3, ..., so there are O(n**2) of those alone. So any algorithm that finds all of them is necessarily also at best quadratic time.
Read the paper for details ;-) One concept you're looking for is "primitive": I believe you only want repeats of the form S*n where S cannot itself be expressed as a repetition of shorter strings. So, e.g., "x" * 10 is primitive, but "xx" * 5 is not.
One step on the way to O(n log n)
crunch9() is an implementation of the "brute force" algorithm I mentioned in the comments, from:
"The enhanced suffix array and its applications to genome analysis"
Ibrahim et alia
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf
The implementation sketch there only finds "branching tandem" repeats, and I added code here to deduce repeats of any number of repetitions, and to include non-branching repeats too. While it's still O(n**2) worst case, it's much faster than anything else here for the seq string you pointed to in the comments. As is, it reproduces (except for order) the same exhaustive account as most of the other programs here.
The paper goes on to fight hard to cut the worst case to O(n log n), but that slows it down a lot. So then it fights harder. I confess I lost interest ;-)
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
interval = i_c, lb, i - 1
yield interval
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch9(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# generate branching tandem repeats
def gen_btr(text=text, n=n, sa=sa):
for c, lb, rb in genlcpi(lcp):
i = sa[lb]
basic = text[i : i + c]
# Binary searches to find subrange beginning with
# basic+basic. A more gonzo implementation would do this
# character by character, never materialzing the common
# prefix in `basic`.
rb += 1
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i + c] < basic:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if basic < text[i : i + c]:
rb = mid
else:
lo = mid + 1
lead = basic[0]
for sai in range(lb, rb):
i = sa[sai]
j = i + 2*c
assert j <= n
if j < n and text[j] == lead:
continue # it's non-branching
yield (i, c, 2)
for start, c, _ in gen_btr():
# extend left
numreps = 2
for i in range(start - c, -1, -c):
if all(text[i+k] == text[start+k] for k in range(c)):
start = i
numreps += 1
else:
break
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start:
if text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
else:
break
return bestlen, results
Earning the bonus points ;-)
For some technical meaning ;-) crunch11() is worst-case O(n log n). Besides the suffix and lcp arrays, this also needs the rank array, sa's inverse:
assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
As code comments note, it also relies on Python 3 for speed (range() behavior). That's shallow but would be tedious to rewrite.
Papers describing this have several errors, so don't flip out if this code doesn't exactly match what you read about. Implement exactly what they say instead, and it will fail.
That said, the code is getting uncomfortably complex, and I can't guarantee there aren't bugs. It works on everything I've tried.
Inputs of the form 'x' * 1000000 still aren't speedy, but clearly no longer quadratic-time. For example, a string repeating the same letter a million times completes in close to 30 seconds. Most other programs here would never end ;-)
EDIT: changed genlcpi() to use semi-open Python ranges; made mostly cosmetic changes to crunch11(); added "early out" that saves about a third the time in worst (like 'x' * 1000000) cases.
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
yield (i_c, lb, i)
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch11(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate branching tandem repeats.
# (i, c, 2) is branching tandem iff
# i+c in interval with prefix text[i : i+c], and
# i+c not in subinterval with prefix text[i : i+c + 1]
# Caution: this pragmatically relies on that, in Python 3,
# `range()` returns a tiny object with O(1) membership testing.
# In Python 2 it returns a list - ahould still work, but very
# much slower.
def gen_btr(text=text, n=n, sa=sa, rank=rank):
from itertools import chain
for c, lb, rb in genlcpi(lcp):
origlb, origrb = lb, rb
origrange = range(lb, rb)
i = sa[lb]
lead = text[i]
# Binary searches to find subrange beginning with
# text[i : i+c+1]. Note we take slices of length 1
# rather than just index to avoid special-casing for
# i >= n.
# A more elaborate traversal of the lcp array could also
# give us a list of child intervals, and then we'd just
# need to pick the right one. But that would be even
# more hairy code, and unclear to me it would actually
# help the worst cases (yes, the interval can be large,
# but so can a list of child intervals).
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i+1] < lead:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if lead < text[i : i+1]:
rb = mid
else:
lo = mid + 1
subrange = range(lb, rb)
if 2 * len(subrange) <= len(origrange):
# Subrange is at most half the size.
# Iterate over it to find candidates i, starting
# with wa. If i+c is also in origrange, but not
# in subrange, good: then i is of the form wwx.
for sai in subrange:
i = sa[sai]
ic = i + c
if ic < n:
r = rank[ic]
if r in origrange and r not in subrange:
yield (i, c, 2, subrange)
else:
# Iterate over the parts outside subrange instead.
# Candidates i are then the trailing wx in the
# hoped-for wwx. We win if i-c is in subrange too
# (or, for that matter, if it's in origrange).
for sai in chain(range(origlb, lb),
range(rb, origrb)):
ic = sa[sai] - c
if ic >= 0 and rank[ic] in subrange:
yield (ic, c, 2, subrange)
for start, c, numreps, irange in gen_btr():
# extend left
crange = range(start - c, -1, -c)
if (numreps + len(crange)) * c < bestlen:
continue
for i in crange:
if rank[i] in irange:
start = i
numreps += 1
else:
break
# check for best
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start and text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
return bestlen, results
Here's my implementation of what you're talking about. It's pretty similar to yours, but it skips over substrings which have been checked as repetitions of previous substrings.
from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])
def longest_repeating_subseq(original: str):
winner = SubSequence(start=0, length=0, reps=0)
checked = set()
subsequences = ( # Evaluates lazily during iteration
SubSequence(start=start, length=length, reps=1)
for start in range(len(original))
for length in range(1, len(original) - start)
if (start, length) not in checked)
for s in subsequences:
subseq = original[s.start : s.start + s.length]
for reps, next_start in enumerate(
range(s.start + s.length, len(original), s.length),
start=1):
if subseq != original[next_start : next_start + s.length]:
break
else:
checked.add((next_start, s.length))
s = s._replace(reps=reps)
if s.reps > 1 and (
(s.length * s.reps > winner.length * winner.reps)
or ( # When total lengths are equal, prefer the shorter substring
s.length * s.reps == winner.length * winner.reps
and s.reps > winner.reps)):
winner = s
# Check for default case with no repetitions
if winner.reps == 0:
winner = SubSequence(start=0, length=len(original), reps=1)
return (
original[ : winner.start],
winner.reps,
original[winner.start : winner.start + winner.length],
original[winner.start + winner.length * winner.reps : ])
def test(seq, *, expect):
print(f'Testing longest_repeating_subseq for {seq}')
result = longest_repeating_subseq(seq)
print(f'Expected {expect}, got {result}')
print(f'Test {"passed" if result == expect else "failed"}')
print()
if __name__ == '__main__':
test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
test('ABCD', expect=('', 1, 'ABCD', ''))
Passes all three of your examples for me. This seems like the sort of thing that could have a lot of weird edge cases, but given that it's an optimized brute force, it would probably be more a matter of updating the spec rather than fixing a bug in the code itself.
It looks like what you are trying to do is pretty much the LZ77 compression algorithm. You can check your code against the reference implementation in the Wikipedia article I linked to.

Fill order from smaller packages?

The input is an integer that specifies the amount to be ordered.
There are predefined package sizes that have to be used to create that order.
e.g.
Packs
3 for $5
5 for $9
9 for $16
for an input order 13 the output should be:
2x5 + 1x3
So far I've the following approach:
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
while remaining_order > 0:
found = False
for pack_num in package_numbers:
if pack_num <= remaining_order:
required_packages.append(pack_num)
remaining_order -= pack_num
found = True
break
if not found:
break
But this will lead to the wrong result:
1x9 + 1x3
remaining: 1
So, you need to fill the order with the packages such that the total price is maximal? This is known as Knapsack problem. In that Wikipedia article you'll find several solutions written in Python.
To be more precise, you need a solution for the unbounded knapsack problem, in contrast to popular 0/1 knapsack problem (where each item can be packed only once). Here is working code from Rosetta:
from itertools import product
NAME, SIZE, VALUE = range(3)
items = (
# NAME, SIZE, VALUE
('A', 3, 5),
('B', 5, 9),
('C', 9, 16))
capacity = 13
def knapsack_unbounded_enumeration(items, C):
# find max of any one item
max1 = [int(C / item[SIZE]) for item in items]
itemsizes = [item[SIZE] for item in items]
itemvalues = [item[VALUE] for item in items]
# def totvalue(itemscount, =itemsizes, itemvalues=itemvalues, C=C):
def totvalue(itemscount):
# nonlocal itemsizes, itemvalues, C
totsize = sum(n * size for n, size in zip(itemscount, itemsizes))
totval = sum(n * val for n, val in zip(itemscount, itemvalues))
return (totval, -totsize) if totsize <= C else (-1, 0)
# Try all combinations of bounty items from 0 up to max1
bagged = max(product(*[range(n + 1) for n in max1]), key=totvalue)
numbagged = sum(bagged)
value, size = totvalue(bagged)
size = -size
# convert to (iten, count) pairs) in name order
bagged = ['%dx%d' % (n, items[i][SIZE]) for i, n in enumerate(bagged) if n]
return value, size, numbagged, bagged
if __name__ == '__main__':
value, size, numbagged, bagged = knapsack_unbounded_enumeration(items, capacity)
print(value)
print(bagged)
Output is:
23
['1x3', '2x5']
Keep in mind that this is a NP-hard problem, so it will blow as you enter some large values :)
You can use itertools.product:
import itertools
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(a)
print(remaining_order)
Output:
(5, 5, 3)
0
This simply does the below steps:
Get value closest to 13, in the list with all the product values.
Then simply make it modify the number of remaining_order.
If you want it output with 'x':
import itertools
from collections import Counter
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(' '.join(['{0}x{1}'.format(v,k) for k,v in Counter(a).items()]))
print(remaining_order)
Output:
2x5 + 1x3
0
For you problem, I tried two implementations depending on what you want, in both of the solutions I supposed you absolutely needed your remaining to be at 0. Otherwise the algorithm will return you -1. If you need them, tell me I can adapt my algorithm.
As the algorithm is implemented via dynamic programming, it handles good inputs, at least more than 130 packages !
In the first solution, I admitted we fill with the biggest package each time.
I n the second solution, I try to minimize the price, but the number of packages should always be 0.
remaining_order = 13
package_numbers = sorted([9,5,3], reverse=True) # To make sure the biggest package is the first element
prices = {9: 16, 5: 9, 3: 5}
required_packages = []
# First solution, using the biggest package each time, and making the total order remaining at 0 each time
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = [0, 0, 0]
for i in range(1, remaining_order + 1):
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != -1:
ans[i] = [tmp[x] if x != index else tmp[x] + 1 for x in range(len(tmp))]
break
else: # Using for else instead of a boolean value `found`
ans[i] = -1 # -1 is the not found combinations
print(ans[13]) # [0, 2, 1]
print(ans[9]) # [1, 0, 0]
# Second solution, minimizing the price with order at 0
def price(x):
return 16*x[0]+9*x[1]+5*x[2]
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = ([0, 0, 0],0) # combination + price
for i in range(1, remaining_order + 1):
# The not found packages will be (-1, float('inf'))
minimal_price = float('inf')
minimal_combinations = -1
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != (-1, float('inf')):
tmp_price = price(tmp[0]) + prices[package_number]
if tmp_price < minimal_price:
minimal_price = tmp_price
minimal_combinations = [tmp[0][x] if x != index else tmp[0][x] + 1 for x in range(len(tmp[0]))]
ans[i] = (minimal_combinations, minimal_price)
print(ans[13]) # ([0, 2, 1], 23)
print(ans[9]) # ([0, 0, 3], 15) Because the price of three packages is lower than the price of a package of 9
In case you need a solution for a small number of possible
package_numbers
but a possibly very big
remaining_order,
in which case all the other solutions would fail, you can use this to reduce remaining_order:
import numpy as np
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
sub_max=np.sum([(np.product(package_numbers)/i-1)*i for i in package_numbers])
while remaining_order > sub_max:
remaining_order -= np.product(package_numbers)
required_packages.append([max(package_numbers)]*np.product(package_numbers)/max(package_numbers))
Because if any package is in required_packages more often than (np.product(package_numbers)/i-1)*i it's sum is equal to np.product(package_numbers). In case the package max(package_numbers) isn't the one with the samllest price per unit, take the one with the smallest price per unit instead.
Example:
remaining_order = 100
package_numbers = [5,3]
Any part of remaining_order bigger than 5*2 plus 3*4 = 22 can be sorted out by adding 5 three times to the solution and taking remaining_order - 5*3.
So remaining order that actually needs to be calculated is 10. Which can then be solved to beeing 2 times 5. The rest is filled with 6 times 15 which is 18 times 5.
In case the number of possible package_numbers is bigger than just a handful, I recommend building a lookup table (with one of the others answers' code) for all numbers below sub_max which will make this immensely fast for any input.
Since no declaration about the object function is found, I assume your goal is to maximize the package value within the pack's capability.
Explanation: time complexity is fixed. Optimal solution may not be filling the highest valued item as many as possible, you have to search all possible combinations. However, you can reuse the possible optimal solutions you have searched to save space. For example, [5,5,3] is derived from adding 3 to a previous [5,5] try so the intermediate result can be "cached". You may either use an array or you may use a set to store possible solutions. The code below runs the same performance as the rosetta code but I think it's clearer.
To further optimize, use a priority set for opts.
costs = [3,5,9]
value = [5,9,16]
volume = 130
# solutions
opts = set()
opts.add(tuple([0]))
# calc total value
cost_val = dict(zip(costs, value))
def total_value(opt):
return sum([cost_val.get(cost,0) for cost in opt])
def possible_solutions():
solutions = set()
for opt in opts:
for cost in costs:
if cost + sum(opt) > volume:
continue
cnt = (volume - sum(opt)) // cost
for _ in range(1, cnt + 1):
sol = tuple(list(opt) + [cost] * _)
solutions.add(sol)
return solutions
def optimize_max_return(opts):
if not opts:
return tuple([])
cur = list(opts)[0]
for sol in opts:
if total_value(sol) > total_value(cur):
cur = sol
return cur
while sum(optimize_max_return(opts)) <= volume - min(costs):
opts = opts.union(possible_solutions())
print(optimize_max_return(opts))
If your requirement is "just fill the pack" it'll be even simpler using the volume for each item instead.

Python removing intersection from list 2 out of list 1 [duplicate]

My problem is as follows:
having file with list of intervals:
1 5
2 8
9 12
20 30
And a range of
0 200
I would like to do such an intersection that will report the positions [start end] between my intervals inside the given range.
For example:
8 9
12 20
30 200
Beside any ideas how to bite this, would be also nice to read some thoughts on optimization, since as always the input files are going to be huge.
this solution works as long the intervals are ordered by the start point and does not require to create a list as big as the total range.
code
with open("0.txt") as f:
t=[x.rstrip("\n").split("\t") for x in f.readlines()]
intervals=[(int(x[0]),int(x[1])) for x in t]
def find_ints(intervals, mn, mx):
next_start = mn
for x in intervals:
if next_start < x[0]:
yield next_start,x[0]
next_start = x[1]
elif next_start < x[1]:
next_start = x[1]
if next_start < mx:
yield next_start, mx
print list(find_ints(intervals, 0, 200))
output:
(in the case of the example you gave)
[(0, 1), (8, 9), (12, 20), (30, 200)]
Rough algorithm:
create an array of booleans, all set to false seen = [False]*200
Iterate over the input file, for each line start end set seen[start] .. seen[end] to be True
Once done, then you can trivially walk the array to find the unused intervals.
In terms of optimisations, if the list of input ranges is sorted on start number, then you can track the highest seen number and use that to filter ranges as they are processed -
e.g. something like
for (start,end) in input:
if end<=lowest_unseen:
next
if start<lowest_unseen:
start=lowest_unseen
...
which (ignoring the cost of the original sort) should make the whole thing O(n) - you go through the array once to tag seen/unseen and once to output unseens.
Seems I'm feeling nice. Here is the (unoptimised) code, assuming your input file is called input
seen = [False]*200
file = open('input','r')
rows = file.readlines()
for row in rows:
(start,end) = row.split(' ')
print "%s %s" % (start,end)
for x in range( int(start)-1, int(end)-1 ):
seen[x] = True
print seen[0:10]
in_unseen_block=False
start=1
for x in range(1,200):
val=seen[x-1]
if val and not in_unseen_block:
continue
if not val and in_unseen_block:
continue
# Must be at a change point.
if val:
# we have reached the end of the block
print "%s %s" % (start,x)
in_unseen_block = False
else:
# start of new block
start = x
in_unseen_block = True
# Handle end block
if in_unseen_block:
print "%s %s" % (start, 200)
I'm leaving the optimizations as an exercise for the reader.
If you make a note every time that one of your input intervals either opens or closes, you can do what you want by putting together the keys of opens and closes, sort into an ordered set, and you'll be able to essentially think, "okay, let's say that each adjacent pair of numbers forms an interval. Then I can focus all of my logic on these intervals as discrete chunks."
myRange = range(201)
intervals = [(1,5), (2,8), (9,12), (20,30)]
opens = {}
closes = {}
def open(index):
if index not in opens:
opens[index] = 0
opens[index] += 1
def close(index):
if index not in closes:
closes[index] = 0
closes[index] += 1
for start, end in intervals:
if end > start: # Making sure to exclude empty intervals, which can be problematic later
open(start)
close(end)
# Sort all the interval-endpoints that we really need to look at
oset = {0:None, 200:None}
for k in opens.keys():
oset[k] = None
for k in closes.keys():
oset[k] = None
relevant_indices = sorted(oset.keys())
# Find the clear ranges
state = 0
results = []
for i in range(len(relevant_indices) - 1):
start = relevant_indices[i]
end = relevant_indices[i+1]
start_state = state
if start in opens:
start_state += opens[start]
if start in closes:
start_state -= closes[start]
end_state = start_state
if end in opens:
end_state += opens[end]
if end in closes:
end_state -= closes[end]
state = end_state
if start_state == 0:
result_start = start
result_end = end
results.append((result_start, result_end))
for start, end in results:
print(str(start) + " " + str(end))
This outputs:
0 1
8 9
12 20
30 200
The intervals don't need to be sorted.
This question seems to be a duplicate of Merging intervals in Python.
If I understood well the problem, you have a list of intervals (1 5; 2 8; 9 12; 20 30) and a range (0 200), and you want to get the positions outside your intervals, but inside given range. Right?
There's a Python library that can help you on that: python-intervals (also available from PyPI using pip). Disclaimer: I'm the maintainer of that library.
Assuming you import this library as follows:
import intervals as I
It's quite easy to get your answer. Basically, you first want to create a disjunction of intervals based on the ones you provide:
inters = I.closed(1, 5) | I.closed(2, 8) | I.closed(9, 12) | I.closed(20, 30)
Then you compute the complement of these intervals, to get everything that is "outside":
compl = ~inters
Then you create the union with [0, 200], as you want to restrict the points to that interval:
print(compl & I.closed(0, 200))
This results in:
[0,1) | (8,9) | (12,20) | (30,200]

How to check if two permutations are symmetric?

Given two permutations A and B of L different elements, L is even, let's call these permutations "symmetric" (for a lack of a better term), if there exist n and m, m > n such as (in python notation):
- A[n:m] == B[L-m:L-n]
- B[n:m] == A[L-m:L-n]
- all other elements are in place
Informally, consider
A = 0 1 2 3 4 5 6 7
Take any slice of it, for example 1 2. It starts at the second index and its length is 2. Now take a slice symmetric to it: it ends at the penultimate index and is 2 chars long too, so it's 5 6. Swapping these slices gives
B = 0 5 6 3 4 1 2 7
Now, A and B are "symmetric" in the above sense (n=1, m=3). On the other hand
A = 0 1 2 3 4 5 6 7
B = 1 0 2 3 4 5 7 6
are not "symmetric" (no n,m with above properties exist).
How can I write an algorithm in python that finds if two given permutations (=lists) are "symmetric" and if yes, find the n and m? For simplicity, let's consider only even L (because the odd case can be trivially reduced to the even one by eliminating the middle fixed element) and assume correct inputs (set(A)==set(B), len(set(A))==len(A)).
(I have no problem bruteforcing all possible symmetries, but looking for something smarter and faster than that).
Fun fact: the number of symmetric permutations for the given L is a Triangular number.
I use this code to test out your answers.
Bounty update: many excellent answers here. #Jared Goguen's solution appears to be the fastest.
Final timings:
testing 0123456789 L= 10
test_alexis ok in 15.4252s
test_evgeny_kluev_A ok in 30.3875s
test_evgeny_kluev_B ok in 27.1382s
test_evgeny_kluev_C ok in 14.8131s
test_ian ok in 26.8318s
test_jared_goguen ok in 10.0999s
test_jason_herbburn ok in 21.3870s
test_tom_karzes ok in 27.9769s
Here is the working solution for the question:
def isSymmetric(A, B):
L = len(A) #assume equivalent to len(B), modifying this would be as simple as checking if len(A) != len(B), return []
la = L//2 # half-list length
Al = A[:la]
Ar = A[la:]
Bl = B[:la]
Br = B[la:]
for i in range(la):
lai = la - i #just to reduce the number of computation we need to perform
for j in range(1, lai + 1):
k = lai - j #same here, reduce computation
if Al[i] != Br[k] or Ar[k] != Bl[i]: #the key for efficient computation is here: do not proceed unnecessarily
continue
n = i #written only for the sake of clarity. i is n, and we can use i directly
m = i + j
if A[n:m] == B[L-m:L-n] and B[n:m] == A[L-m:L-n]: #possibly symmetric
if A[0:n] == B[0:n] and A[m:L-m] == B[m:L-m] and A[L-n:] == B[L-n:]:
return [n, m]
return []
As you have mentioned, though the idea looks simple, but it is actually quite a tricky one. Once we see the patterns, however, the implementation is straight-forward.
The central idea of the solution is this single line:
if Al[i] != Br[k] or Ar[k] != Bl[i]: #the key for efficient computation is here: do not proceed unnecessarily
All other lines are just either direct code translation from the problem statement or optimization made for more efficient computation.
There are few steps involved in order to find the solution:
Firstly, we need to split the each both list Aand list B into two half-lists (called Al, Ar, Bl, and Br). Each half-list would contain half of the members of the original lists:
Al = A[:la]
Ar = A[la:]
Bl = B[:la]
Br = B[la:]
Secondly, to make the evaluation efficient, the goal here is to find what I would call pivot index to decide whether a position in the list (index) is worth evaluated or not to check if the lists are symmetric. This pivot index is the central idea to find an efficient solution. So I would try to elaborate it quite a bit:
Consider the left half part of the A list, suppose you have a member like this:
Al = [al1, al2, al3, al4, al5, al6]
We can imagine that there is a corresponding index list for the mentioned list like this
Al = [al1, al2, al3, al4, al5, al6]
iAl = [0, 1, 2, 3, 4, 5 ] #corresponding index list, added for explanation purpose
(Note: the reason why I mention of imagining a corresponding index list is for ease of explanation purposes)
Likewise, we can imagine that the other three lists may have similar index lists. Let's name them iAr, iBl, and iBr respectively and they are all having identical members with iAl.
It is the index of the lists which would really matter for us to look into - in order to solve the problem.
Here is what I mean: suppose we have two parameters:
index (let's give a variable name i to it, and I would use symbol ^ for current i)
length (let's give a variable name j to it, and I would use symbol == to visually represent its length value)
for each evaluation of the index element in iAl - then each evaluation would mean:
Given an index value i and length value of j in iAl, do
something to determine if it is worth to check for symmetric
qualifications starting from that index and with that length
(Hence the name pivot index come).
Now, let's take example of one evaluation when i = 0 and j = 1. The evaluation can be illustrated as follow:
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 0
== <-- now this has length (j) of 1
In order for those index i and length j to be worth evaluated further, then the counterpart iBr must have the same item value with the same length but on different index (let's name it index k)
iBr = [0, 1, 2, 3, 4, 5]
^ <-- must compare the value in this index to what is pointed by iAl
== <-- must evaluate with the same length = 1
For example, for the above case, this is a possible "symmetric" permutation just for the two lists Al-Br (we will consider the other two lists Ar-Bl later):
Al = [0, x, x, x, x, x] #x means don't care for now
Br = [x, x, x, x, x, 0]
At this moment, it is good to note that
It won't worth evaluating further if even the above condition is not
true
And this is where you get the algorithm to be more efficient; that is, by selectively evaluating only the few possible cases among all possible cases. And how to find the few possible cases?
By trying to find relationship between indexes and lengths of the
four lists. That is, for a given index i and length j in a
list (say Al), what must be the index k in the counterpart
list (in the case is Br). Length for the counterpart list need not
be found because it is the same as in the list (that is j).
Having know that, let's now proceed further to see if we can see more patterns in the evaluation process.
Consider now the effect of length (j). For example, if we are to evaluate from index 0, but the length is 2 then the counterpart list would need to have different index k evaluated than when the length is 1
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 0
===== <-- now this has length (j) of 2
iBr = [0, 1, 2, 3, 4, 5]
^ <-- must compare the value in this index to what is pointed by iAl
===== <-- must evaluate with the same length = 2
Or, for the illustration above, what really matters fox i = 0 and y = 2 is something like this:
# when i = 0 and y = 2
Al = [0, y, x, x, x, x] #x means don't care for now
Br = [x, x, x, x, 0, y] #y means to be checked later
Take a look that the above pattern is a bit different from when i = 0 and y = 1 - the index position for 0 value in the example is shifted:
# when i = 0 and y = 1, k = 5
Al = [0, x, x, x, x, x] #x means don't care for now
Br = [x, x, x, x, x, 0]
# when i = 0 and y = 2, k = 4
Al = [0, y, x, x, x, x] #x means don't care for now
Br = [x, x, x, x, 0, y] #y means to be checked later
Thus, length shifts where the index of the counterpart list must be checked. In the first case, when i = 0 and y = 1, then the k = 5. But in the second case, when i = 0 and y = 1, then the k = 4. Thus we found the pivot indexes relationship when we change the length j for a fixed index i (in this case being 0) unto the counterpart list index k.
Now, consider the effects of index i with fixed length j for counterpart list index k. For example, let's fix the length as y = 4, then for index i = 0, we have:
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 0
========== <-- now this has length (j) of 4
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 1
========== <-- now this has length (j) of 4
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 2
========== <-- now this has length (j) of 4
#And no more needed
In the above example, it can be seen that we need to evaluate 3 possibilities for the given i and j, but if the index i is changed to 1 with the same length j = 4:
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 1
========== <-- now this has length (j) of 4
iAl = [0, 1, 2, 3, 4, 5]
^ <-- now evaluate this index (i) = 2
========== <-- now this has length (j) of 4
Note that we only need to evaluate 2 possibilities. Thus the increase of index i decreases the number of possible cases to be evaluated!
With all the above patterns found, we almost found all the basis we need to make the algorithm works. But to complete that, we need to find the relationship between indexes which appear in Al-Br pair for a given [i, j] => [k, j] with the indexes in Ar-Bl pair for the same [i, j].
Now, we can actually see that they are simply mirroring the relationship we found in Al-Br pair!
(IMHO, this is really beautiful! and thus I think term "symmetric" permutation is not far from truth)
For example, if we have the following Al-Br pair evaluated with i = 0 and y = 2
Al = [0, y, x, x, x, x] #x means don't care for now
Br = [x, x, x, x, 0, y] #y means to be checked later
Then, to make it symmetric, we must have the corresponding Ar-Bl:
Ar = [x, x, x, x, 3, y] #x means don't care for now
Bl = [3, y, x, x, x, x] #y means to be checked later
The indexing of Al-Br pair is mirroring (or, is symmetric to) the indexing of Ar-Bl pair!
Therefore, combining all the pattern we found above, we now could find the pivot indexes for evaluating Al, Ar, Bl, and Br.
We only need to check the values of the lists in the pivot index
first. If the values of the lists in the pivot indexes of Al, Ar, Bl, and Br
matches in the evaluation then and only then we need to check
for symmetric criteria (thus making the computation efficient!)
Putting up all the knowledge above into code, the following is the resulting for-loop Python code to check for symmetricity:
for i in range(len(Al)): #for every index in the list
lai = la - i #just simplification
for j in range(1, lai + 1): #get the length from 1 to la - i + 1
k = lai - j #get the mirror index
if Al[i] != Br[k] or Ar[k] != Bl[i]: #if the value in the pivot indexes do not match
continue #skip, no need to evaluate
#at this point onwards, then the values in the pivot indexes match
n = i #assign n
m = i + j #assign m
#test if the first two conditions for symmetric are passed
if A[n:m] == B[L-m:L-n] and B[n:m] == A[L-m:L-n]: #possibly symmetric
#if it passes, test the third condition for symmetric, the rests of the elements must stay in its place
if A[0:n] == B[0:n] and A[m:L-m] == B[m:L-m] and A[L-n:] == B[L-n:]:
return [n, m] #if all three conditions are passed, symmetric lists are found! return [n, m] immediately!
#passing this but not outside of the loop means
#any of the 3 conditions to find symmetry are failed
#though values in the pivot indexes match, simply continue
return [] #nothing can be found - asymmetric lists
And there go you with the symmetric test!
(OK, this is quite a challenge and it takes quite a while for me to figure out how.)
I rewrote the code without some of the complexity (and errors).
def test_o_o(a, b):
L = len(a)
H = L//2
n, m = 0, H-1
# find the first difference in the left-side
while n < H:
if a[n] != b[n]: break
n += 1
else: return
# find the last difference in the left-side
while m > -1:
if a[m] != b[m]: break
m -= 1
else: return
# for slicing, we want end_index+1
m += 1
# compare each slice for equality
# order: beginning, block 1, block 2, middle, end
if (a[0:n] == b[0:n] and \
a[n:m] == b[L-m:L-n] and \
b[n:m] == a[L-m:L-n] and \
a[m:L-m] == b[m:L-m] and \
a[L-n:L] == b[L-n:L]):
return n, m
The implementation is both elegant and efficient.
The break into else: return structures ensure that the function returns at the soonest possible point. They also validate that n and m have been set to valid values, but this does not appear to be necessary when explicitly checking the slices. These lines can be removed with no noticeable impact on the timing.
Explicitly comparing the slices will also short-circuit as soon as one evaluates to False.
Originally, I checked whether a permutation existed by transforming b into a:
b = b[:]
b[n:m], b[L-m:L-n] = b[L-m:L-n], b[n:m]
if a == b:
return n, m
But this is slower than explicitly comparing the slices. Let me know if the algorithm doesn't speak for itself and I can offer further explanation (maybe even proof) as to why it works and is minimal.
I tried to implement 3 different algorithms for this task. All of them have O(N) time complexity and require O(1) additional space. Interesting fact: all other answers (known so far) implement 2 of these algorithms (though they not always keep optimal asymptotic time/space complexity). Here is high-level description for each algorithm:
Algorithm A
Compare the lists, group "non-equal" intervals, make sure there are exactly two such intervals (with special case when intervals meet in the middle).
Check if "non-equal" intervals are positioned symmetrically, and their contents is also "symmetrical".
Algorithm B
Compare first halves of the lists to guess where are "intervals to be exchanged".
Check if contents of these intervals is "symmetrical". And make sure the lists are equal outside of these intervals.
Algorithm C
Compare first halves of the lists to find first mismatched element.
Find this mismatched element of first list in second one. This hints the position of "intervals to be exchanged".
Check if contents of these intervals is "symmetrical". And make sure the lists are equal outside of these intervals.
There are two alternative implementations for step 1 of each algorithm: (1) using itertools, and (2) using plain loops (or list comprehensions). itertools are efficient for long lists but relatively slow on short lists.
Here is algorithm C with first step implemented using itertools. It looks simpler than other two algorithms (at the end of this post). And it is pretty fast, even for short lists:
import itertools as it
import operator as op
def test_C(a, b):
length = len(a)
half = length // 2
mismatches = it.imap(op.ne, a, b[:half]) # compare half-lists
try:
n = next(it.compress(it.count(), mismatches))
nr = length - n
mr = a.index(b[n], half, nr)
m = length - mr
except StopIteration: return None
except ValueError: return None
if a[n:m] == b[mr:nr] and b[n:m] == a[mr:nr] \
and a[m:mr] == b[m:mr] and a[nr:] == b[nr:]:
return (n, m)
This could be done using mostly itertools:
def test_A(a, b):
equals = it.imap(op.eq, a, b) # compare lists
e1, e2 = it.tee(equals)
l = it.chain(e1, [True])
r = it.chain([True], e2)
borders = it.imap(op.ne, l, r) # delimit equal/non-equal intervals
ranges = list(it.islice(it.compress(it.count(), borders), 5))
if len(ranges) == 4:
n1, m1 = ranges[0], ranges[1]
n2, m2 = ranges[2], ranges[3]
elif len(ranges) == 2:
n1, m1 = ranges[0], len(a) // 2
n2, m2 = len(a) // 2, ranges[1]
else:
return None
if n1 == len(a) - m2 and m1 == len(a) - n2 \
and a[n1:m1] == b[n2:m2] and b[n1:m1] == a[n2:m2]:
return (n1, m1)
High-level description of this algorithm is already provided in OP comments by #j_random_hacker. Here are some details:
Start with comparing the lists:
A 0 1 2 3 4 5 6 7
B 0 5 6 3 4 1 2 7
= E N N E E N N E
Then find borders between equal/non-equal intervals:
= E N N E E N N E
B _ * _ * _ * _ *
Then determine ranges for non-equal elements:
B _ * _ * _ * _ *
[1 : 3] [5 : 7]
Then check if there are exactly 2 ranges (with special case when both ranges meet in the middle), the ranges themselves are symmetrical, and their contents too.
Other alternative is to use itertools to process only half of each list. This allows slightly simpler (and slightly faster) algorithm because there is no need to handle a special case:
def test_B(a, b):
equals = it.imap(op.eq, a, b[:len(a) // 2]) # compare half-lists
e1, e2 = it.tee(equals)
l = it.chain(e1, [True])
r = it.chain([True], e2)
borders = it.imap(op.ne, l, r) # delimit equal/non-equal intervals
ranges = list(it.islice(it.compress(it.count(), borders), 2))
if len(ranges) != 2:
return None
n, m = ranges[0], ranges[1]
nr, mr = len(a) - n, len(a) - m
if a[n:m] == b[mr:nr] and b[n:m] == a[mr:nr] \
and a[m:mr] == b[m:mr] and a[nr:] == b[nr:]:
return (n, m)
This does the right thing:
Br = B[L//2:]+B[:L//2]
same_full = [a==b for (a,b) in zip(A, Br)]
same_part = [a+b for (a,b) in zip(same_full[L//2:], same_full[:L//2])]
for n, vn in enumerate(same_part):
if vn != 2:
continue
m = n
for vm in same_part[n+1:]:
if vm != 2:
break
m+=1
if m>n:
print("n=", n, "m=", m+1)
I'm pretty sure you could do the counting a bit bettter, but... meh
I believe the following pseudocode should work:
Find the first element i for which A[i] != B[i], set n = i. If no such element, return success. If n >= L/2, return fail.
Find the first element i > n for which A[i] == B[i], set m = i. If no such element or m > L/2, set m = L/2.
Check so A[0:n] == B[0:n], A[n:m] == B[L-m:L-n], B[n:m] == A[L-m:L-n], A[m:L-m] == B[m:L-m] and A[L-n:L] == B[L-n:L]. If all are true, return success. Else, return fail.
Complexity is O(n) which should be the lowest possible as one always needs to compare all elements in the lists.
I build a map of where the characters are in list B, then use that to determine the implied subranges in list A. Once I have the subranges, I can sanity check some of the info, and compare the slices.
If A[i] == x, then where does x appear in B? Call that position p.
I know i, the start of the left subrange.
I know L (= len(A)), so I know L-i, the end of the right subrange.
If I know p, then I know the implied start of the right subrange, assuming that B[p] and A[i] are the start of a symmetric pair of ranges. Thus, the OP's L - m would be p if the lists were symmetric.
Setting L-m == p gives me m, so I have all four end points.
Sanity tests are:
n and m are in left half of list(s)
n <= m (note: OP did not prohibit n == m)
L-n is in right half of list (computed)
L-m is in right half (this is a good check for quick fail)
If all those check out, compare A[left] == B[right] and B[left] == A[right]. Return left if true.
def find_symmetry(a:list, b:list) -> slice or None:
assert len(a) == len(b)
assert set(a) == set(b)
assert len(set(a)) == len(a)
length = len(a)
assert length % 2 == 0
half = length // 2
b_loc = {bi:n for n,bi in enumerate(b)}
for n,ai in enumerate(a[:half]):
L_n = length - 1 - n # L - n
L_m = b_loc[ai] # L - m (speculative)
if L_m < half: # Sanity: bail if on wrong side
continue
m = b_loc[a[L_n]] # If A[n] starts range, A[m] ends it.
if m < n or m > half: # Sanity: bail if backwards or wrong side
continue
left = slice(n, m+1)
right = slice(L_m, L_n+1)
if a[left] == b[right] and \
b[left] == a[right]:
return left
return None
res = find_symmetry(
[ 10, 11, 12, 13, 14, 15, 16, 17, ],
[ 10, 15, 16, 13, 14, 11, 12, 17, ])
assert res == slice(1,3)
res = find_symmetry(
[ 0, 1, 2, 3, 4, 5, 6, 7, ],
[ 1, 0, 2, 3, 4, 5, 7, 6, ])
assert res is None
res = find_symmetry("abcdefghijklmn", "nbcdefghijklma")
assert res == slice(0,1)
res = find_symmetry("abcdefghijklmn", "abjklfghicdmen")
assert res == slice(3,4)
res = find_symmetry("abcdefghijklmn", "ancjkfghidelmb")
assert res == slice(3,5)
res = find_symmetry("abcdefghijklmn", "bcdefgaijklmnh")
assert res is None
res = find_symmetry("012345", "013245")
assert res == slice(2,3)
Here's an O(N) solution which passes the test code:
def sym_check(a, b):
cnt = len(a)
ml = [a[i] == b[i] for i in range(cnt)]
sl = [i for i in range(cnt) if (i == 0 or ml[i-1]) and not ml[i]]
el = [i+1 for i in range(cnt) if not ml[i] and (i == cnt-1 or ml[i+1])]
assert(len(sl) == len(el))
range_cnt = len(sl)
if range_cnt == 1:
start1 = sl[0]
end2 = el[0]
if (end2 - start1) % 2 != 0:
return None
end1 = (start1 + end2) // 2
start2 = end1
elif range_cnt == 2:
start1, start2 = sl
end1, end2 = el
else:
return None
if end1 - start1 != end2 - start2:
return None
if start1 != cnt - end2:
return None
if a[start1:end1] != b[start2:end2]:
return None
if b[start1:end1] != a[start2:end2]:
return None
return start1, end1
I only tested it with Python 2, but I believe it will also work with Python 3.
It identifies the ranges where the two lists differ. It looks for two such ranges (if there is only one such range, it tries to divide it in half). It then checks that both ranges are the same length and in the proper positions relative to each other. If so, then it checks that the elements in the ranges match.
Yet another version:
def compare(a, b):
i_zip = list(enumerate(zip(a, b)))
llen = len(a)
hp = llen // 2
def find_index(i_zip):
for i, (x, y) in i_zip:
if x != y:
return i
return i_zip[0][0]
# n and m are determined by the unmoved items:
n = find_index(i_zip[:hp])
p = find_index(i_zip[hp:])
m = llen - p
q = llen - n
# Symmetric?
if a[:n] + a[p:q] + a[m:p] + a[n:m] + a[q:] != b:
return None
return n, m
This solution is based on:
All validly permuted list pairs A, B adhering to the symmetry requirement will have the structure:
A = P1 + P2 + P3 + P4 + P5
B = P1 + P4 + P3 + P2 + P5
^n ^m ^hp ^p ^q <- indexes
,len(P1) == len(P5) and len(P2) == len(P4)
Therefore the 3 last lines of the above function will determine the correct solution provided the indexes n, m are correctly determined. (p & q are just mirror indexes of m & n)
Finding n is a matter of determining when items of A and B start to diverge. Next the same method is applied to finding p starting from midpoint hp. m is just mirror index of p. All involved indexes are found and the solution emerges.
Make a list (ds) of indices where the first halves of the two lists differ.
A possible n is the first such index, the last such index is m - 1.
Check if valid symmetry. len(ds) == m - n makes sure there aren't any gaps.
import itertools as it
import operator as op
def test(a, b):
sz = len(a)
ds = list(it.compress(it.count(), map(op.ne, a[:sz//2], b[:sz//2])))
n,m = ds[0], ds[-1]+1
if a[n:m] == b[sz-m:sz-n] and b[n:m] == a[sz-m:sz-n] and len(ds) == m - n:
return n,m
else:
return None
Here's a simple solution that passes my tests, and yours:
Compare the inputs, looking for a subsequence that does not match.
Transform A by transposing the mismatched subsequence according to the rules. Does the result match B?
The algorithm is O(N); there are no embedded loops, explicit or implicit.
In step 1, I need to detect the case where the swapped substrings are adjacent. This can only happen in the middle of the string, but I found it easier to just look out for the first element of the moved piece (firstval). Step 2 is simpler (and hence less error-prone) than explicitly checking all the constraints.
def compare(A, B):
same = True
for i, (a, b) in enumerate(zip(A,B)):
if same and a != b: # Found the start of a presumed transposition
same = False
n = i
firstval = a # First element of the transposed piece
elif (not same) and (a == b or b == firstval): # end of the transposition
m = i
break
# Construct the transposed string, compare it to B
origin = A[n:m]
if n == 0: # swap begins at the edge
dest = A[-m:]
B_expect = dest + A[m:-m] + origin
else:
dest = A[-m:-n]
B_expect = A[:n] + dest + A[m:-m] + origin + A[-n:]
return bool(B_expect == B)
Sample use:
>>> compare("01234567", "45670123")
True
Bonus: I believe the name for this relationship would be "symmetric block transposition". A block transposition swaps two subsequences, taking ABCDE to ADCBE. (See definition 4 here; I actually found this by googling "ADCBE"). I've added "symmetric" to the name to describe the length conditions.

Find the nth lucky number generated by a sieve in Python

I'm trying to make a program in Python which will generate the nth lucky number according to the lucky number sieve. I'm fairly new to Python so I don't know how to do all that much yet. So far I've figured out how to make a function which determines all lucky numbers below a specified number:
def lucky(number):
l = range(1, number + 1, 2)
i = 1
while i < len(l):
del l[l[i] - 1::l[i]]
i += 1
return l
Is there a way to modify this so that I can instead find the nth lucky number? I thought about increasing the specified number gradually until a list of the appropriate length to find the required lucky number was created, but that seems like a really inefficient way of doing it.
Edit: I came up with this, but is there a better way?
def lucky(number):
f = 2
n = number * f
while True:
l = range(1, n + 1, 2)
i = 1
while i < len(l):
del l[l[i] - 1::l[i]]
i += 1
if len(l) >= number:
return l[number - 1]
f += 1
n = number * f
I came up with this, but is there a better way?
Truth is, there will always be a better way, the remaining question being: is it good enough for your need?
One possible improvement would be to turn all this into a generator function. That way, you would only compute new values as they are consumed. I came up with this version, which I only validated up to about 60 terms:
import itertools
def _idx_after_removal(removed_indices, value):
for removed in removed_indices:
value -= value / removed
return value
def _should_be_excluded(removed_indices, value):
for j in range(len(removed_indices) - 1):
value_idx = _idx_after_removal(removed_indices[:j + 1], value)
if value_idx % removed_indices[j + 1] == 0:
return True
return False
def lucky():
yield 1
removed_indices = [2]
for i in itertools.count(3, 2):
if not _should_be_excluded(removed_indices, i):
yield i
removed_indices.append(i)
removed_indices = list(set(removed_indices))
removed_indices.sort()
If you want to extract for example the 100th term from this generator, you can use itertools nth recipe:
def nth(iterable, n, default=None):
"Returns the nth item or a default value"
return next(itertools.islice(iterable, n, None), default)
print nth(lucky(), 100)
I hope this works, and there's without any doubt more room for code improvement (but as stated previously, there's always room for improvement!).
With numpy arrays, you can make use of boolean indexing, which may help. For example:
>>> a = numpy.arange(10)
>>> print a
[0 1 2 3 4 5 6 7 8 9]
>>> print a[a > 3]
[4 5 6 7 8 9]
>>> mask = np.array([True, False, True, False, True, False, True, False, True, False])
>>> print a[mask]
[0 2 4 6 8]
Here is a lucky number function using numpy arrays:
import numpy as np
class Didnt_Findit(Exception):
pass
def lucky(n):
'''Return the nth lucky number.
n --> int
returns int
'''
# initial seed
lucky_numbers = [1]
# how many numbers do you need to get to n?
candidates = np.arange(1, n*100, 2)
# use numpy array boolean indexing
next_lucky = candidates[candidates > lucky_numbers[-1]][0]
# accumulate lucky numbers till you have n of them
while next_lucky < candidates[-1]:
lucky_numbers.append(next_lucky)
#print lucky_numbers
if len(lucky_numbers) == n:
return lucky_numbers[-1]
mask_start = next_lucky - 1
mask_step = next_lucky
mask = np.array([True] * len(candidates))
mask[mask_start::mask_step] = False
#print mask
candidates = candidates[mask]
next_lucky = candidates[ candidates > lucky_numbers[-1]][0]
raise Didnt_Findit('n = ', n)
>>> print lucky(10)
33
>>> print lucky(50)
261
>>> print lucky(500)
3975
Checked mine and #icecrime's output for 10, 50 and 500 - they matched.
Yours is much faster than mine and scales better with n.
n=input('enter n ')
a= list(xrange(1,n))
x=a[1]
for i in range(1,n):
del a[x-1::x]
x=a[i]
l=len(a)
if i==l-1:
break
print "lucky numbers till %d" % n
print a
lets do this with an example.lets print lucky numbers till 100
put n=100
firstly a=1,2,3,4,5....100
x=a[1]=2
del a[1::2] leaves
a=1,3,5,7....99
now l=50
and now x=3
then del a[2::3] leaving a =1,3,7,9,13,15,.....
and loop continues till i==l-1

Categories

Resources