Related
I'm looking for speedy alternatives to my function. the goal is to make a list of 32 bit integers based of any length integers. The length is explicitly given in a tuple of (value, bitlength). This is part of a bit-banging procedure for a asynchronous interface which takes 4 32 bit integers per bus transaction.
All ints are unsigned, positive or zero, the length can vary between 0 and 2000
My inputs is a list of these tuples,
the output should be integers with implicit 32 bit length, with the bits in sequential order. The remaining bits not fitting into 32 should also be returned.
input: [(0,128),(1,12),(0,32)]
output:[0, 0, 0, 0, 0x100000], 0, 12
I've spent a day or two on profiling with cProfile, and trying different methods, but I seem to be kind of stuck with functions that takes ~100k tuples in one second, which is kinda slow. Ideally i would like a 10x speedup, but I haven't got enough experience to know where to start. The ultimate goal for the speed of this is to be faster than 4M tuples per second.
Thanks for any help or suggestions.
the fastest i can do is:
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
this has very similar performance:
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
# binarylist = "".join([np.binary_repr(a, b) for a, b in inp]) # bin(a)[2:].rjust(b, '0')
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
The best I could come up with so far is a 25% speed-up
from functools import reduce
intMask = 0xffffffff
def f(x,y):
return (x[0] << y[1]) + y[0], x[1] + y[1]
def jens(input):
n, length = reduce( f , input, (0,0) )
remainderBits = length % 32
intBits = length - remainderBits
remainder = ((n & intMask) << (32 - remainderBits)) >> (32 - remainderBits)
n >>= remainderBits
ints = [n & (intMask << i) for i in range(intBits-32, -32, -32)]
return ints, remainderBits, remainder
print([hex(x) for x in jens([(0,128),(1,12),(0,32)])[0]])
It uses a long to sum up the tuple values according to their bit length, and then extract the 32-bit values and the remaining bits from this number. The computastion of the overall length (summing up the length values of the input tuple) and the computation of the large value are done in a single loop with reduce to use an intrinsic loop.
Running matineau's benchmark harness prints, the best numbers I have seen are:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
jens : 0.004151 secs, rel speed 1.00x, 0.00% slower
First snippet : 0.005259 secs, rel speed 1.27x, 26.70% slower
Second snippet : 0.008328 secs, rel speed 2.01x, 100.64% slower
You could probably gain a better speed-up if you use some C extension implementing a bit array.
This isn't an answer with a faster implementation. Instead it's the code in the two snippets you have in your question placed within an extensible benchmarking framework that makes comparing different approaches very easy.
Comparing just those two testcases, it indicates that your second approach does not have very similar performance to the first, based on the output shown. In fact, it's almost twice as slow.
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 1000 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
# Import any resources needed defined in outer benchmarking script.
#from __main__ import ??? # Not needed at this time
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"First snippet": TestCase("""
def foo(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
length = 0
remlen = 0
remint = 0
i32list = []
for a, b in tuples:
n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
length += b
if length > 32:
len32 = int(length/32)
for i in range(len32):
i32list.append((n >> i*32) & 0xFFFFFFFF)
remint = n >> (len32*32)
remlen = length - len32*32
length = remlen
elif length == 32:
appint = n & 0xFFFFFFFF
remint = 0
remlen = 0
length -= 32
i32list.append(appint)
else:
remint = n
remlen = length
return i32list, remint, remlen
""", """
foo([(0,128),(1,12),(0,32)])
"""
),
"Second snippet": TestCase("""
def tpli_2_32ili(tuples):
'''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
totallength = len(binarylist)
tot32 = int(totallength/32)
i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
remlen = totallength - tot32*32
remint = int(binarylist[-remlen:],2)
return i32list, remint, remlen
""", """
tpli_2_32ili([(0,128),(1,12),(0,32)])
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5,.2f}x, {:8,.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
Output:
Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)
First snippet : 0.003024 secs, rel speed 1.00x, 0.00% slower
Second snippet : 0.005085 secs, rel speed 1.68x, 68.13% slower
So I'm writing a program in Python to get the GCD of any amount of numbers.
def GCD(numbers):
if numbers[-1] == 0:
return numbers[0]
# i'm stuck here, this is wrong
for i in range(len(numbers)-1):
print GCD([numbers[i+1], numbers[i] % numbers[i+1]])
print GCD(30, 40, 36)
The function takes a list of numbers.
This should print 2. However, I don't understand how to use the the algorithm recursively so it can handle multiple numbers. Can someone explain?
updated, still not working:
def GCD(numbers):
if numbers[-1] == 0:
return numbers[0]
gcd = 0
for i in range(len(numbers)):
gcd = GCD([numbers[i+1], numbers[i] % numbers[i+1]])
gcdtemp = GCD([gcd, numbers[i+2]])
gcd = gcdtemp
return gcd
Ok, solved it
def GCD(a, b):
if b == 0:
return a
else:
return GCD(b, a % b)
and then use reduce, like
reduce(GCD, (30, 40, 36))
Since GCD is associative, GCD(a,b,c,d) is the same as GCD(GCD(GCD(a,b),c),d). In this case, Python's reduce function would be a good candidate for reducing the cases for which len(numbers) > 2 to a simple 2-number comparison. The code would look something like this:
if len(numbers) > 2:
return reduce(lambda x,y: GCD([x,y]), numbers)
Reduce applies the given function to each element in the list, so that something like
gcd = reduce(lambda x,y:GCD([x,y]),[a,b,c,d])
is the same as doing
gcd = GCD(a,b)
gcd = GCD(gcd,c)
gcd = GCD(gcd,d)
Now the only thing left is to code for when len(numbers) <= 2. Passing only two arguments to GCD in reduce ensures that your function recurses at most once (since len(numbers) > 2 only in the original call), which has the additional benefit of never overflowing the stack.
You can use reduce:
>>> from fractions import gcd
>>> reduce(gcd,(30,40,60))
10
which is equivalent to;
>>> lis = (30,40,60,70)
>>> res = gcd(*lis[:2]) #get the gcd of first two numbers
>>> for x in lis[2:]: #now iterate over the list starting from the 3rd element
... res = gcd(res,x)
>>> res
10
help on reduce:
>>> reduce?
Type: builtin_function_or_method
reduce(function, sequence[, initial]) -> value
Apply a function of two arguments cumulatively to the items of a sequence,
from left to right, so as to reduce the sequence to a single value.
For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5). If initial is present, it is placed before the items
of the sequence in the calculation, and serves as a default when the
sequence is empty.
Python 3.9 introduced multiple arguments version of math.gcd, so you can use:
import math
math.gcd(30, 40, 36)
3.5 <= Python <= 3.8.x:
import functools
import math
functools.reduce(math.gcd, (30, 40, 36))
3 <= Python < 3.5:
import fractions
import functools
functools.reduce(fractions.gcd, (30, 40, 36))
A solution to finding out the LCM of more than two numbers in PYTHON is as follow:
#finding LCM (Least Common Multiple) of a series of numbers
def GCD(a, b):
#Gives greatest common divisor using Euclid's Algorithm.
while b:
a, b = b, a % b
return a
def LCM(a, b):
#gives lowest common multiple of two numbers
return a * b // GCD(a, b)
def LCMM(*args):
#gives LCM of a list of numbers passed as argument
return reduce(LCM, args)
Here I've added +1 in the last argument of range() function because the function itself starts from zero (0) to n-1. Click the hyperlink to know more about range() function :
print ("LCM of numbers (1 to 5) : " + str(LCMM(*range(1, 5+1))))
print ("LCM of numbers (1 to 10) : " + str(LCMM(*range(1, 10+1))))
print (reduce(LCMM,(1,2,3,4,5)))
those who are new to python can read more about reduce() function by the given link.
The GCD operator is commutative and associative. This means that
gcd(a,b,c) = gcd(gcd(a,b),c) = gcd(a,gcd(b,c))
So once you know how to do it for 2 numbers, you can do it for any number
To do it for two numbers, you simply need to implement Euclid's formula, which is simply:
// Ensure a >= b >= 1, flip a and b if necessary
while b > 0
t = a % b
a = b
b = t
end
return a
Define that function as, say euclid(a,b). Then, you can define gcd(nums) as:
if (len(nums) == 1)
return nums[1]
else
return euclid(nums[1], gcd(nums[:2]))
This uses the associative property of gcd() to compute the answer
Try calling the GCD() as follows,
i = 0
temp = numbers[i]
for i in range(len(numbers)-1):
temp = GCD(numbers[i+1], temp)
My way of solving it in Python. Hope it helps.
def find_gcd(arr):
if len(arr) <= 1:
return arr
else:
for i in range(len(arr)-1):
a = arr[i]
b = arr[i+1]
while b:
a, b = b, a%b
arr[i+1] = a
return a
def main(array):
print(find_gcd(array))
main(array=[8, 18, 22, 24]) # 2
main(array=[8, 24]) # 8
main(array=[5]) # [5]
main(array=[]) # []
Some dynamics how I understand it:
ex.[8, 18] -> [18, 8] -> [8, 2] -> [2, 0]
18 = 8x + 2 = (2y)x + 2 = 2z where z = xy + 1
ex.[18, 22] -> [22, 18] -> [18, 4] -> [4, 2] -> [2, 0]
22 = 18w + 4 = (4x+2)w + 4 = ((2y)x + 2)w + 2 = 2z
As of python 3.9 beta 4, it has got built-in support for finding gcd over a list of numbers.
Python 3.9.0b4 (v3.9.0b4:69dec9c8d2, Jul 2 2020, 18:41:53)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> A = [30, 40, 36]
>>> print(math.gcd(*A))
2
One of the issues is that many of the calculations only work with numbers greater than 1. I modified the solution found here so that it accepts numbers smaller than 1. Basically, we can re scale the array using the minimum value and then use that to calculate the GCD of numbers smaller than 1.
# GCD of more than two (or array) numbers - alows folating point numbers
# Function implements the Euclidian algorithm to find H.C.F. of two number
def find_gcd(x, y):
while(y):
x, y = y, x % y
return x
# Driver Code
l_org = [60e-6, 20e-6, 30e-6]
min_val = min(l_org)
l = [item/min_val for item in l_org]
num1 = l[0]
num2 = l[1]
gcd = find_gcd(num1, num2)
for i in range(2, len(l)):
gcd = find_gcd(gcd, l[i])
gcd = gcd * min_val
print(gcd)
HERE IS A SIMPLE METHOD TO FIND GCD OF 2 NUMBERS
a = int(input("Enter the value of first number:"))
b = int(input("Enter the value of second number:"))
c,d = a,b
while a!=0:
b,a=a,b%a
print("GCD of ",c,"and",d,"is",b)
As You said you need a program who would take any amount of numbers
and print those numbers' HCF.
In this code you give numbers separated with space and click enter to get GCD
num =list(map(int,input().split())) #TAKES INPUT
def print_factors(x): #MAKES LIST OF LISTS OF COMMON FACTROS OF INPUT
list = [ i for i in range(1, x + 1) if x % i == 0 ]
return list
p = [print_factors(numbers) for numbers in num]
result = set(p[0])
for s in p[1:]: #MAKES THE SET OF COMMON VALUES IN LIST OF LISTS
result.intersection_update(s)
for values in result:
values = values*values #MULTIPLY ALL COMMON FACTORS TO FIND GCD
values = values//(list(result)[-1])
print('HCF',values)
Hope it helped
Lemme clarify:
What would be the fastest way to get every number with all unique digits between two numbers. For example, 10,000 and 100,000.
Some obvious ones would be 12,345 or 23,456. I'm trying to find a way to gather all of them.
for i in xrange(LOW, HIGH):
str_i = str(i)
...?
Use itertools.permutations:
from itertools import permutations
result = [
a * 10000 + b * 1000 + c * 100 + d * 10 + e
for a, b, c, d, e in permutations(range(10), 5)
if a != 0
]
I used the fact, that:
numbers between 10000 and 100000 have either 5 or 6 digits, but only 6-digit number here does not have unique digits,
itertools.permutations creates all combinations, with all orderings (so both 12345 and 54321 will appear in the result), with given length,
you can do permutations directly on sequence of integers (so no overhead for converting the types),
EDIT:
Thanks for accepting my answer, but here is the data for the others, comparing mentioned results:
>>> from timeit import timeit
>>> stmt1 = '''
a = []
for i in xrange(10000, 100000):
s = str(i)
if len(set(s)) == len(s):
a.append(s)
'''
>>> stmt2 = '''
result = [
int(''.join(digits))
for digits in permutations('0123456789', 5)
if digits[0] != '0'
]
'''
>>> setup2 = 'from itertools import permutations'
>>> stmt3 = '''
result = [
x for x in xrange(10000, 100000)
if len(set(str(x))) == len(str(x))
]
'''
>>> stmt4 = '''
result = [
a * 10000 + b * 1000 + c * 100 + d * 10 + e
for a, b, c, d, e in permutations(range(10), 5)
if a != 0
]
'''
>>> setup4 = setup2
>>> timeit(stmt1, number=100)
7.955858945846558
>>> timeit(stmt2, setup2, number=100)
1.879319190979004
>>> timeit(stmt3, number=100)
8.599710941314697
>>> timeit(stmt4, setup4, number=100)
0.7493319511413574
So, to sum up:
solution no. 1 took 7.96 s,
solution no. 2 (my original solution) took 1.88 s,
solution no. 3 took 8.6 s,
solution no. 4 (my updated solution) took 0.75 s,
Last solution looks around 10x faster than solutions proposed by others.
Note: My solution has some imports that I did not measure. I assumed your imports will happen once, and code will be executed multiple times. If it is not the case, please adapt the tests to your needs.
EDIT #2: I have added another solution, as operating on strings is not even necessary - it can be achieved by having permutations of real integers. I bet this can be speed up even more.
Cheap way to do this:
for i in xrange(LOW, HIGH):
s = str(i)
if len(set(s)) == len(s):
# number has unique digits
This uses a set to collect the unique digits, then checks to see that there are as many unique digits as digits in total.
List comprehension will work a treat here (logic stolen from nneonneo):
[x for x in xrange(LOW,HIGH) if len(set(str(x)))==len(str(x))]
And a timeit for those who are curious:
> python -m timeit '[x for x in xrange(10000,100000) if len(set(str(x)))==len(str(x))]'
10 loops, best of 3: 101 msec per loop
Here is an answer from scratch:
def permute(L, max_len):
allowed = L[:]
results, seq = [], range(max_len)
def helper(d):
if d==0:
results.append(''.join(seq))
else:
for i in xrange(len(L)):
if allowed[i]:
allowed[i]=False
seq[d-1]=L[i]
helper(d-1)
allowed[i]=True
helper(max_len)
return results
A = permute(list("1234567890"), 5)
print A
print len(A)
print all(map(lambda a: len(set(a))==len(a), A))
It perhaps could be further optimized by using an interval representation of the allowed elements, although for n=10, I'm not sure it will make a difference. I could also transform the recursion into a loop, but in this form it is more elegant and clear.
Edit: Here are the timings of the various solutions
2.75808000565 (My solution)
8.22729802132 (Sol 1)
1.97218298912 (Sol 2)
9.659760952 (Sol 3)
0.841020822525 (Sol 4)
no_list=['115432', '555555', '1234567', '5467899', '3456789', '987654', '444444']
rep_list=[]
nonrep_list=[]
for no in no_list:
u=[]
for digit in no:
# print(digit)
if digit not in u:
u.append(digit)
# print(u)
#iF REPEAT IS THERE
if len(no) != len(u):
# print(no)
rep_list.append(no)
#If repeatation is not there
else:
nonrep_list.append(no)
print('Numbers which have no repeatation are=',rep_list)
print('Numbers which have repeatation are=',nonrep_list)
I have two numbers (binary or not, does not play any role) which differ in just one bit, e.g. (pseudocode)
a = 11111111
b = 11011111
I want a simple python function that returns the bit position that differs ('5' in the given example, when seen from right to left). My solution would be (python)
math.log(abs(a-b))/math.log(2)
but I wonder if there is a more elegant way to do this (without using floats etc.).
Thanks
Alex
You could use the binary exclusive:
a = 0b11111111
b = 0b11011111
diff = a^b # 0b100000
diff.bit_length()-1 # 5 (the first position (backwards) which differs, 0 if a==b )
unless i am missing something...
this should work:
>>> def find_bit(a,b):
a = a[::-1]
b = b[::-1]
for i in xrange(len(a)):
if a[i] != b[i]:
return i
return None
>>> a = "11111111"
>>> b = "11011111"
>>> find_bit(a,b)
5
maybe not so elegant, but its easy to understand, and it gets the job done.
Without using bitwise operations you could do something like this:
In [1]: def difbit(a, b):
...: if a == b: return None
...: i = 0
...: while a%2 == b%2:
...: i += 1
...: a //= 2
...: b //= 2
...: return i
...:
In [2]: difbit(0b11111111, 0b11011111)
Out[2]: 5
Using (a^b).bit_length()-1 is perfect for numbers which have only one difference bit. EX:
a = 1000000
b = 1000001
(a^b).bit_length()-1
Output: 0
But for numbers which have multiple difference bits, it gives the index of left most difference bit. EX:
a = 111111111111111111111111111111
b = 111111110111011111111111111111
c = a^b # 1000100000000000000000
c.bit_length()-1
Output: 21 # Instead of 17. 21 is the left most difference bit
So to solve this problem we need to isolate the right most set bit and then get its index. Thus, using ((a^b) & (-(a^b))).bit_length()-1 works best for all inputs:
c = (a^b) & (-(a^b)) # 100000000000000000 - Isolates the rightmost set bit
c.bit_length()-1
Output: 17
(a^b) & (-(a^b))).bit_length()-1
Output: 17
Learn about isolating right most set bit from here
This is a generalization of the "string contains substring" problem to (more) arbitrary types.
Given an sequence (such as a list or tuple), what's the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:
Example usage (Sequence in Sequence):
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1 # or None, or whatever
So far, I just rely on brute force and it seems slow, ugly, and clumsy.
I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/
It finds all the correct subsequences in a given sequence, and should be used as an iterator:
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)
Here's a brute-force approach O(n*m) (similar to #mcella's answer). It might be faster than the Knuth-Morris-Pratt algorithm implementation in pure Python O(n+m) (see #Gregg Lind answer) for small input sequences.
#!/usr/bin/env python
def index(subseq, seq):
"""Return an index of `subseq`uence in the `seq`uence.
Or `-1` if `subseq` is not a subsequence of the `seq`.
The time complexity of the algorithm is O(n*m), where
n, m = len(seq), len(subseq)
>>> index([1,2], range(5))
1
>>> index(range(1, 6), range(5))
-1
>>> index(range(5), range(5))
0
>>> index([1,2], [0, 1, 0, 1, 2])
3
"""
i, n, m = -1, len(seq), len(subseq)
try:
while True:
i = seq.index(subseq[0], i + 1, n - m + 1)
if subseq == seq[i:i + m]:
return i
except ValueError:
return -1
if __name__ == '__main__':
import doctest; doctest.testmod()
I wonder how large is the small in this case?
A simple approach: Convert to strings and rely on string matching.
Example using lists of strings:
>>> f = ["foo", "bar", "baz"]
>>> g = ["foo", "bar"]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True
Example using tuples of strings:
>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True
Example using lists of numbers:
>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True
Same thing as string matching sir...Knuth-Morris-Pratt string matching
>>> def seq_in_seq(subseq, seq):
... while subseq[0] in seq:
... index = seq.index(subseq[0])
... if subseq == seq[index:index + len(subseq)]:
... return index
... else:
... seq = seq[index + 1:]
... else:
... return -1
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1
Sorry I'm not an algorithm expert, it's just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. ;-)
Most probably it's the same thing your brute force approach is doing.
Brute force may be fine for small patterns.
For larger ones, look at the Aho-Corasick algorithm.
Here is another KMP implementation:
from itertools import tee
def seq_in_seq(seq1,seq2):
'''
Return the index where seq1 appears in seq2, or -1 if
seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm
based heavily on code by Neale Pickett <neale#woozle.org>
found at: woozle.org/~neale/src/python/kmp.py
>>> seq_in_seq(range(3),range(5))
0
>>> seq_in_seq(range(3)[-1:],range(5))
2
>>>seq_in_seq(range(6),range(5))
-1
'''
def compute_prefix_function(p):
m = len(p)
pi = [0] * m
k = 0
for q in xrange(1, m):
while k > 0 and p[k] != p[q]:
k = pi[k - 1]
if p[k] == p[q]:
k = k + 1
pi[q] = k
return pi
t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
m,n = len(p),len(t)
pi = compute_prefix_function(p)
q = 0
for i in range(n):
while q > 0 and p[q] != t[i]:
q = pi[q - 1]
if p[q] == t[i]:
q = q + 1
if q == m:
return i - m + 1
return -1
I'm a bit late to the party, but here's something simple using strings:
>>> def seq_in_seq(sub, full):
... f = ''.join([repr(d) for d in full]).replace("'", "")
... s = ''.join([repr(d) for d in sub]).replace("'", "")
... #return f.find(s) #<-- not reliable for finding indices in all cases
... return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True
As noted by Ilya V. Schurov, the find method in this case will not return the correct indices with multi-character strings or multi-digit numbers.
For what it's worth, I tried using a deque like so:
from collections import deque
from itertools import islice
def seq_in_seq(needle, haystack):
"""Generator of indices where needle is found in haystack."""
needle = deque(needle)
haystack = iter(haystack) # Works with iterators/streams!
length = len(needle)
# Deque will automatically call deque.popleft() after deque.append()
# with the `maxlen` set equal to the needle length.
window = deque(islice(haystack, length), maxlen=length)
if needle == window:
yield 0 # Match at the start of the haystack.
for index, value in enumerate(haystack, start=1):
window.append(value)
if needle == window:
yield index
One advantage of the deque implementation is that it makes only a single linear pass over the haystack. So if the haystack is streaming then it will still work (unlike the solutions that rely on slicing).
The solution is still brute-force, O(n*m). Some simple local benchmarking showed it was ~100x slower than the C-implementation of string searching in str.index.
Another approach, using sets:
set([5,6])== set([5,6])&set([4,'a',3,5,6])
True