Adversarial inputs for Python substring search

Adversarial inputs for Python substring search - python

The CPython implementation of substring search (e.g. via in) is implemented by the following algorithm.
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
At least, according to this source (taken from this older answer) written by the author (?) of the CPython implementation.
This same source mentions a worst-case complexity of this algorithm as O(nm), where n and m are the lengths of the two strings. I am interested in whether this bound is tight. My question is:
Are there adversarial examples for the algorithm used in Python in? Can we give a sequence of pairs of strings (pattern, string) so that running pattern in string takes quadratic (or at least superlinear) time?
The standard example that demonstrates the quadratic worst-case run-time of naive substring search, where string = 'a'*n and pattern = 'a'*m + b does not work.

The naive example of s='a'*n and p='a'*m+'b' does not work because of the line
if s[i+m-1] == p[m-1]:
This checks the last character (not the first) of p ('b') with the corresponding current position in s. As this fails, then the result is to just a single iteration over s, which is why it is so fast.
If you flip p (s='a'*n and p='b'+'a'*m), then a similar thing occurs - this time the above line passes (the last character of p is now 'a'), but then p is iterated over forwards, so then the 'b' is found quickly, so again this example is linear and fast.
A simple change to the naive example that would show O(nm) behaviour is s='a'*n and p='a'*m+'ba'. In this case, the last character of p is 'a', so the initial check passes, but then it needs to iterate over the rest of p before it gets to the 'b'.
# full='a'*n; sub='a'*m+'b'
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*100")
0.13620498299860628
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*1000")
0.9594046580004942
>>> timeit("sub in full", "sub='a'*100+'b'; full='a'*1000")
0.9768632190007338
# Linear in n, but m has minimal effect: ~O(n)
# full='a'*n; sub='a'*m+'ba'
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*100")
0.35251976200015633
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*1000")
3.4642483099996753
>>> timeit("sub in full", "sub='a'*100+'ba'; full='a'*1000")
27.152958754999418
# Both n and m have linear effect: ~O(nm)

Try this:
import re
import time
def slow_match(n):
pat = 'a' + ('z' * n)
str = 'z' * (n + n)
start_time = time.time()
if re.search(pat, str):
print("Shouldn't happen")
print(("Searched", n, time.time() - start_time))
slow_match(10000)
slow_match(50000)
slow_match(100000)
slow_match(300000)

Related

forcing ndiff on very dissimilar strings

The ndiff function from difflib allows a nice interface to detect differences in lines. It does a great job when the lines are close enough:
>>> print '\n'.join(list(ndiff(['foo*'], ['foot'], )))
- foo*
? ^
+ foot
? ^
But when the lines are too dissimilar, the rich reporting is no longer possible:
>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****
This is the use case I am hitting, and I am trying to find ways to use ndiff (or the underlying class Differ) to force the reporting even if the strings are too dissimilar.
For the failing example, I would like to have a result like:
>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****
? +++++

The function responsible for printing the context (i.e. those lines starting with ?) is Differ._fancy_replace. That function works by checking whether the two lines are equal by at least 75% (see the cutoff variable). Unfortunately, that 75% cutoff is hard-coded and cannot be changed.
What I can suggest is to subclass Differ and provide a version of _fancy_replace that simply ignores the cutoff. Here it is:
from difflib import Differ, SequenceMatcher
class FullContextDiffer(Differ):
def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
"""
Copied and adapted from https://github.com/python/cpython/blob/3.6/Lib/difflib.py#L928
"""
best_ratio = 0
cruncher = SequenceMatcher(self.charjunk)
for j in range(blo, bhi):
bj = b[j]
cruncher.set_seq2(bj)
for i in range(alo, ahi):
ai = a[i]
if ai == bj:
continue
cruncher.set_seq1(ai)
if cruncher.real_quick_ratio() > best_ratio and \
cruncher.quick_ratio() > best_ratio and \
cruncher.ratio() > best_ratio:
best_ratio, best_i, best_j = cruncher.ratio(), i, j
yield from self._fancy_helper(a, alo, best_i, b, blo, best_j)
aelt, belt = a[best_i], b[best_j]
atags = btags = ""
cruncher.set_seqs(aelt, belt)
for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
la, lb = ai2 - ai1, bj2 - bj1
if tag == 'replace':
atags += '^' * la
btags += '^' * lb
elif tag == 'delete':
atags += '-' * la
elif tag == 'insert':
btags += '+' * lb
elif tag == 'equal':
atags += ' ' * la
btags += ' ' * lb
else:
raise ValueError('unknown tag %r' % (tag,))
yield from self._qformat(aelt, belt, atags, btags)
yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
And here is an example of how it works:
a = [
'foo',
'bar',
'foobar',
]
b = [
'foo',
'bar',
'barfoo',
]
print('\n'.join(FullContextDiffer().compare(a, b)))
# Output:
#
# foo
# bar
# - foobar
# ? ---
#
# + barfoo
# ? +++

It seems what you want to do here is not to compare across multiple lines, but across strings. You can then pass your strings directly, without a list, and you should get a behaviour close to the one you are looking for.
>>> print ('\n'.join(list(ndiff('foo', 'foo*****'))))
f
o
o
+ *
+ *
+ *
+ *
+ *
Even though the output format is not the exact one you are looking for, it encapsulate the correct information. We can make an output adapter to give the correct format.
def adapter(out):
chars = []
symbols = []
for c in out:
chars.append(c[2])
symbols.append(c[0])
return ''.join(chars), ''.join(symbols)
This can be used like so.
>>> print ('\n'.join(adapter(ndiff('foo', 'foo*****'))))
foo*****
+++++

Optimize Python math code for fixed values of variables in function

I have a very long math formula (just to put you in context: it has 293095 characters) which in practice will be the body of a python function. This function has 15 input parameters as in:
def math_func(t,X,P,n1,n2,R,r):
x,y,z = X
a,b,c = P
u1,v1,w1 = n1
u2,v2,w2 = n2
return <long math formula>
The formula uses simple math operations + - * ** / and one function call to arctan. Here an extract of it:
r*((-16*(r**6*t*u1**6 - 6*r**6*u1**5*u2 - 15*r**6*t*u1**4*u2**2 +
20*r**6*u1**3*u2**3 + 15*r**6*t*u1**2*u2**4 - 6*r**6*u1*u2**5 -
r**6*t*u2**6 + 3*r**6*t*u1**4*v1**2 - 12*r**6*u1**3*u2*v1**2 -
18*r**6*t*u1**2*u2**2*v1**2 + 12*r**6*u1*u2**3*v1**2 +
3*r**6*t*u2**4*v1**2 + 3*r**6*t*u1**2*v1**4 - 6*r**6*u1*u2*v1**4 -
3*r**6*t*u2**2*v1**4 + r**6*t*v1**6 - 6*r**6*u1**4*v1*v2 -
24*r**6*t*u1**3*u2*v1*v2 + 36*r**6*u1**2*u2**2*v1*v2 +
24*r**6*t*u1*u2**3*v1*v2 - 6*r**6*u2**4*v1*v2 -
12*r**6*u1**2*v1**3*v2 - 24*r**6*t*u1*u2*v1**3*v2 +
12*r**6*u2**2*v1**3*v2 - 6*r**6*v1**5*v2 - 3*r**6*t*u1**4*v2**2 + ...
Now the point is that in practice the bulk evaluation of this function will be done for fixed values of P,n1,n2,R and r which reduces the set of free variables to only four, and "in theory" the formula with less parameters should be faster.
So the question is: How can I implement this optimization in Python?
I know I can put everything in a string and do some sort of replace,compile and eval like in
formula = formula.replace('r','1').replace('R','2')....
code = compile(formula,'formula-name','eval')
math_func = lambda t,x,y,z: eval(code)
It would be good if some operations (like power) are substituted by their value, for example 18*r**6*t*u1**2*u2**2*v1**2 should become 18*t for r=u1=u2=v1=1. I think compile should do so but in any case I'm not sure. Does compile actually perform this optimization?
My solution speeds up the computation but if I can squeeze it more it will be great. Note: preferable within standard Python (I could try Cython later).
In general I'm interesting in a pythonic way to accomplish my goal maybe with some extra libraries: what is a reasonably good way of doing this? Is my solution a good approach?
EDIT: (To give more context)
The huge expression is the output of a symbolic line integral over an arc of circle. The arc is given in space by the radius r, two ortho-normal vectors (like the x and y axis in a 2D version) n1=(u1,v1,w1),n2=(u2,v2,w2) and the center P=(a,b,c). The rest is the point over which I'm performing the integration X=(x,y,z) and a parameter R for the function I'm integrating.
Sympy and Maple just take ages to compute this, the actual output is from Mathematica.
If you are curious about the formula here it is (pseudo-pseudo-code):
G(u) = P + r*(1-u**2)/(1+u**2)*n1 + r*2*u/(1+u**2)*n2
integral of (1-|X-G(t)|^2/R^2)^3 over t

You could use Sympy:
>>> from sympy import symbols
>>> x,y,z,a,b,c,u1,v1,w1,u2,v2,w2,t,r = symbols("x,y,z,a,b,c,u1,v1,w1,u2,v2,w2,t,r")
>>> r=u1=u2=v1=1
>>> a = 18*r**6*t*u1**2*u2**2*v1**2
>>> a
18*t
Then you can create a Python function like this:
>>> from sympy import lambdify
>>> f = lambdify(t, a)
>>> f(1)
18
And that f function is indeed simply 18*t:
>>> import dis
>>> dis.dis(f)
1 0 LOAD_CONST 1 (18)
3 LOAD_FAST 0 (_Dummy_18)
6 BINARY_MULTIPLY
7 RETURN_VALUE
If you want to compile the resulting code into machine code, you can try a JIT compiler such as Numba, Theano, or Parakeet.

Here's how I would approach this problem:
compile() your function to an AST (Abstract Syntax Tree) instead of a normal bytecode function - see the standard ast module for details.
Traverse the AST, replacing all references to the fixed parameters with their fixed value. There are libraries such as macropy that may be useful for this, I don't have any specific recommendation.
Traverse the AST again, performing whatever optimizations this might enable, such as Mult(1, X) => X. You don't have to worry about operations between two constants, as Python (since 2.6) optimizes that already.
compile() the AST into a normal function. Call it, and hope that the speed was increased by a sufficient amount to justify all the pre-optimization.
Note that Python will never optimize things like 1*X on its own, as it cannot know what type X will be at runtime - it could be an instance of a class that implements the multiplication operation in an arbitrary way, so the result is not necessarily X. Only your knowledge that all the variables are ordinary numbers, obeying the usual rules of arithmetic, makes this optimization valid.

The "right way" to solve a problem like this is one or more of:
Find a more efficient formulation
Symbolically simplify and reduce terms
Use vectorization (e.g. NumPy)
Punt to low-level libraries that are already optimized (e.g. in languages like C or Fortran that implicitly do strong expression optimization, rather than Python, which does nada).
Let's say for a moment, though, that approaches 1, 3, and 4 are not available, and you have to do this in Python. Then simplifying and "hoisting" common subexpressions is your primary tool.
The good news is, there are a lot of opportunities. The expression r**6, for example, is repeated 26 times. You could save 25 computations by simply assigning r_6 = r ** 6 once, then replacing r**6 every time it occurs.
When you start looking for common expressions here, you'll find them everywhere. It'd be nice to mechanize that process, right? In general, that requires a full expression parser (e.g. from the ast module) and is an exponential-time optimization problem. But your expression is a bit of a special case. While long and varied, it's not especially complicated. It has few internal parenthetical groupings, so we can get away with a quicker and dirtier approach.
Before the how, the resulting code is:
sa = r**6 # 26 occurrences
sb = u1**2 # 5 occurrences
sc = u2**2 # 5 occurrences
sd = v1**2 # 5 occurrences
se = u1**4 # 4 occurrences
sf = u2**3 # 3 occurrences
sg = u1**3 # 3 occurrences
sh = v1**4 # 3 occurrences
si = u2**4 # 3 occurrences
sj = v1**3 # 3 occurrences
sk = v2**2 # 1 occurrence
sl = v1**6 # 1 occurrence
sm = v1**5 # 1 occurrence
sn = u1**6 # 1 occurrence
so = u1**5 # 1 occurrence
sp = u2**6 # 1 occurrence
sq = u2**5 # 1 occurrence
sr = 6*sa # 6 occurrences
ss = 3*sa # 5 occurrences
st = ss*t # 5 occurrences
su = 12*sa # 4 occurrences
sv = sa*t # 3 occurrences
sw = v1*v2 # 5 occurrences
sx = sj*v2 # 3 occurrences
sy = 24*sv # 3 occurrences
sz = 15*sv # 2 occurrences
sA = sr*u1 # 2 occurrences
sB = sy*u1 # 2 occurrences
sC = sb*sc # 2 occurrences
sD = st*se # 2 occurrences
# revised formula
sv*sn - sr*so*u2 - sz*se*sc +
20*sa*sg*sf + sz*sb*si - sA*sq -
sv*sp + sD*sd - su*sg*u2*sd -
18*sv*sC*sd + su*u1*sf*sd +
st*si*sd + st*sb*sh - sA*u2*sh -
st*sc*sh + sv*sl - sr*se*sw -
sy*sg*u2*sw + 36*sa*sC*sw +
sB*sf*sw - sr*si*sw -
su*sb*sx - sB*u2*sx +
su*sc*sx - sr*sm*v2 - sD*sk
That avoids 81 computations. It's just a rough cut. Even the result could be further improved. The subexpressions sr*sw and su*sd for example, could be pre-computed as well. But we'll leave that next level for another day.
Note that this doesn't include the starting r*((-16*(. The majority of the simplification can be (and needs to be) done on the core of the expression, not on its outer terms. So I stripped those away for now; they can be added back once the common core is computed.
How do you do this?
f = """
r**6*t*u1**6 - 6*r**6*u1**5*u2 - 15*r**6*t*u1**4*u2**2 +
20*r**6*u1**3*u2**3 + 15*r**6*t*u1**2*u2**4 - 6*r**6*u1*u2**5 -
r**6*t*u2**6 + 3*r**6*t*u1**4*v1**2 - 12*r**6*u1**3*u2*v1**2 -
18*r**6*t*u1**2*u2**2*v1**2 + 12*r**6*u1*u2**3*v1**2 +
3*r**6*t*u2**4*v1**2 + 3*r**6*t*u1**2*v1**4 - 6*r**6*u1*u2*v1**4 -
3*r**6*t*u2**2*v1**4 + r**6*t*v1**6 - 6*r**6*u1**4*v1*v2 -
24*r**6*t*u1**3*u2*v1*v2 + 36*r**6*u1**2*u2**2*v1*v2 +
24*r**6*t*u1*u2**3*v1*v2 - 6*r**6*u2**4*v1*v2 -
12*r**6*u1**2*v1**3*v2 - 24*r**6*t*u1*u2*v1**3*v2 +
12*r**6*u2**2*v1**3*v2 - 6*r**6*v1**5*v2 - 3*r**6*t*u1**4*v2**2
""".strip()
from collections import Counter
import re
expre = re.compile('(?<!\w)\w+\*\*\d+')
multre = re.compile('(?<!\w)\w+\*\w+')
expr_saved = 0
stmts = []
secache = {}
seindex = 0
def subexpr(e):
global seindex
cached = secache.get(e)
if cached:
return cached
base = ord('a') if seindex < 26 else ord('A') - 26
name = 's' + chr(seindex + base)
seindex += 1
secache[e] = name
return name
def hoist(e, flat, c):
"""
Hoist the expression e into name defined by flat.
c is the count of how many times seen in incoming
formula.
"""
global expr_saved
assign = "{} = {}".format(flat, e)
s = "{:30} # {} occurrence{}".format(assign, c, '' if c == 1 else 's')
stmts.append(s)
print "{} needless computations quashed with {}".format(c-1, flat)
expr_saved += c - 1
def common_exp(form):
"""
Replace ALL exponentiation operations with a hoisted
sub-expression.
"""
# find the exponentiation operations
exponents = re.findall(expre, form)
# find and count exponentiation operations
expcount = Counter(re.findall(expre, form))
# for each exponentiation, create a hoisted sub-expression
for e, c in expcount.most_common():
hoist(e, subexpr(e), c)
# replace all exponentiation operations with their sub-expressions
form = re.sub(expre, lambda x: subexpr(x.group(0)), form)
return form
def common_mult(f):
"""
Replace multiplication operations with a hoisted
sub-expression if they occur > 1 time. Also, only
replaces one sub-expression at a time (the most common)
because it may affect further expressions
"""
mults = re.findall(multre, f)
for e, c in Counter(mults).most_common():
# unlike exponents, only replace if >1 occurrence
if c == 1:
return f
# occurs >1 time, so hoist
hoist(e, subexpr(e), c)
# replace in loop and return
return re.sub('(?<!\w)' + re.escape(e), subexpr(e), f)
# return f.replace(e, flat(e))
return f
# fix all exponents
form = common_exp(f)
# fix selected multiplies
prev = form
while True:
form = common_mult(form)
if form == prev:
# have converged; no more replacements possible
break
prev = form
print "--"
mults = re.split(r'\s*[+-]\s*', form)
smults = ['*'.join(sorted(terms.split('*'))) for terms in mults]
print smults
# print the hoisted statements and the revised expression
print '\n'.join(stmts)
print
print "# revised formula"
print form
Parsing with regular expressions is dicey business. That journey is prone to error, sorrow, and regret. I guarded against bad outcomes by hoisting some exponentiations that didn't strictly need to be, and by plugging random values into both the before and after formulas to make sure they both give the same results. I recommend the "punt to C" strategy if this is production code. But if you can't...

compare 2 strings for common substring

i wish to find longest common substring of 2 given strings recursively .i have written this code but it is too inefficient .is there a way i can do it in O(m*n) here m an n are respective lengths of string.here's my code:
def lcs(x,y):
if len(x)==0 or len(y)==0:
return " "
if x[0]==y[0]:
return x[0] + lcs(x[1:],y[1:])
t1 = lcs(x[1:],y)
t2 = lcs(x,y[1:])
if len(t1)>len(t2):
return t1
else:
return t2
x = str(input('enter string1:'))
y = str(input('enter string2:'))
print(lcs(x,y))

You need to memoize your recursion. Without that, you will end up with an exponential number of calls since you will be repeatedly solving the same problem over and over again. To make the memoized lookups more efficient, you can define your recursion in terms of the suffix lengths, instead of the actual suffixes.
You can also find the pseudocode for the DP on Wikipedia.

Here is a naive non-recursive solution which uses the powerset() recipe from itertools:
from itertools import chain, combinations, product
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
def naive_lcs(a, b):
return ''.join(max(set(powerset(a)) & set(powerset(b)), key=len))
It has problems:
>>> naive_lcs('ab', 'ba')
'b'
>>> naive_lcs('ba', 'ab')
'b'
There can be more than one solution for some pairs of strings, but my program picks one arbitrarily.
Also, since any of the combinations can be the longest common one, and since calculating these combinations takes O(2 ^ n) time, this solution doesn't compute in O(n * m) time. With Dynamic Programming and memoizing OTOH we can find a solution that, in theory, should perform better:
from functools import lru_cache
#lru_cache()
def _dynamic_lcs(xs, ys):
if not (xs and ys):
return set(['']), 0
elif xs[-1] == ys[-1]:
result, rlen = _dynamic_lcs(xs[:-1], ys[:-1])
return set(each + xs[-1] for each in result), rlen + 1
else:
xlcs, xlen = _dynamic_lcs(xs, ys[:-1])
ylcs, ylen = _dynamic_lcs(xs[:-1], ys)
if xlen > ylen:
return xlcs, xlen
elif xlen < ylen:
return ylcs, ylen
else:
return xlcs | ylcs, xlen
def dynamic_lcs(xs, ys):
result, _ = _dynamic_lcs(xs, ys)
return result
if __name__ == '__main__':
seqs = list(powerset('abcde'))
for a, b in product(seqs, repeat=2):
assert naive_lcs(a, b) in dynamic_lcs(a, b)
dynamic_lcs() also solves the problem that some pairs strings can have multiple common longest sub-sequences. The result is the set of these, instead of one string. Finding the set of all common sub-sequences though is still of exponential complexity.
Thanks to Pradhan for reminding me of Dynamic Programming and memoization.

What is the simplest way to swap each pair of adjoining chars in a string with Python?

I want to swap each pair of characters in a string. '2143' becomes '1234', 'badcfe' becomes 'abcdef'.
How can I do this in Python?

oneliner:
>>> s = 'badcfe'
>>> ''.join([ s[x:x+2][::-1] for x in range(0, len(s), 2) ])
'abcdef'
s[x:x+2] returns string slice from x to x+2; it is safe for odd len(s).
[::-1] reverses the string in Python
range(0, len(s), 2) returns 0, 2, 4, 6 ... while x < len(s)

The usual way to swap two items in Python is:
a, b = b, a
So it would seem to me that you would just do the same with an extended slice. However, it is slightly complicated because strings aren't mutable; so you have to convert to a list and then back to a string.
Therefore, I would do the following:
>>> s = 'badcfe'
>>> t = list(s)
>>> t[::2], t[1::2] = t[1::2], t[::2]
>>> ''.join(t)
'abcdef'

Here's one way...
>>> s = '2134'
>>> def swap(c, i, j):
... c = list(c)
... c[i], c[j] = c[j], c[i]
... return ''.join(c)
...
>>> swap(s, 0, 1)
'1234'
>>>

''.join(s[i+1]+s[i] for i in range(0, len(s), 2)) # 10.6 usec per loop
or
''.join(x+y for x, y in zip(s[1::2], s[::2])) # 10.3 usec per loop
or if the string can have an odd length:
''.join(x+y for x, y in itertools.izip_longest(s[1::2], s[::2], fillvalue=''))
Note that this won't work with old versions of Python (if I'm not mistaking older than 2.5).
The benchmark was run on python-2.7-8.fc14.1.x86_64 and a Core 2 Duo 6400 CPU with s='0123456789'*4.

If performance or elegance is not an issue, and you just want clarity and have the job done then simply use this:
def swap(text, ch1, ch2):
text = text.replace(ch2, '!',)
text = text.replace(ch1, ch2)
text = text.replace('!', ch1)
return text
This allows you to swap or simply replace chars or substring.
For example, to swap 'ab' <-> 'de' in a text:
_str = "abcdefabcdefabcdef"
print swap(_str, 'ab','de') #decabfdecabfdecabf

Loop over length of string by twos and swap:
def oddswap(st):
s = list(st)
for c in range(0,len(s),2):
t=s[c]
s[c]=s[c+1]
s[c+1]=t
return "".join(s)
giving:
>>> s
'foobar'
>>> oddswap(s)
'ofbora'
and fails on odd-length strings with an IndexError exception.

There is no need to make a list. The following works for even-length strings:
r = ''
for in in range(0, len(s), 2) :
r += s[i + 1] + s[i]
s = r

A more general answer... you can do any single pairwise swap with tuples or strings using this approach:
# item can be a string or tuple and swap can be a list or tuple of two
# indices to swap
def swap_items_by_copy(item, swap):
s0 = min(swap)
s1 = max(swap)
if isinstance(item,str):
return item[:s0]+item[s1]+item[s0+1:s1]+item[s0]+item[s1+1:]
elif isinstance(item,tuple):
return item[:s0]+(item[s1],)+item[s0+1:s1]+(item[s0],)+item[s1+1:]
else:
raise ValueError("Type not supported")
Then you can invoke it like this:
>>> swap_items_by_copy((1,2,3,4,5,6),(1,2))
(1, 3, 2, 4, 5, 6)
>>> swap_items_by_copy("hello",(1,2))
'hlelo'
>>>
Thankfully python gives empty strings or tuples for the cases where the indices refer to non existent slices.

To swap characters in a string a of position l and r
def swap(a, l, r):
a = a[0:l] + a[r] + a[l+1:r] + a[l] + a[r+1:]
return a
Example:
swap("aaabcccdeee", 3, 7) returns "aaadcccbeee"

Do you want the digits sorted? Or are you swapping odd/even indexed digits? Your example is totally unclear.
Sort:
s = '2143'
p=list(s)
p.sort()
s = "".join(p)
s is now '1234'. The trick is here that list(string) breaks it into characters.

Like so:
>>> s = "2143658709"
>>> ''.join([s[i+1] + s[i] for i in range(0, len(s), 2)])
'1234567890'
>>> s = "badcfe"
>>> ''.join([s[i+1] + s[i] for i in range(0, len(s), 2)])
'abcdef'

re.sub(r'(.)(.)',r"\2\1",'abcdef1234')
However re is a bit slow.
def swap(s):
i=iter(s)
while True:
a,b=next(i),next(i)
yield b
yield a
''.join(swap("abcdef1234"))

One more way:
>>> s='123456'
>>> ''.join([''.join(el) for el in zip(s[1::2], s[0::2])])
'214365'

>>> import ctypes
>>> s = 'abcdef'
>>> mutable = ctypes.create_string_buffer(s)
>>> for i in range(0,len(s),2):
>>> mutable[i], mutable[i+1] = mutable[i+1], mutable[i]
>>> s = mutable.value
>>> print s
badcfe

def revstr(a):
b=''
if len(a)%2==0:
for i in range(0,len(a),2):
b += a[i + 1] + a[i]
a=b
else:
c=a[-1]
for i in range(0,len(a)-1,2):
b += a[i + 1] + a[i]
b=b+a[-1]
a=b
return b
a=raw_input('enter a string')
n=revstr(a)
print n

A bit late to the party, but there is actually a pretty simple way to do this:
The index sequence you are looking for can be expressed as the sum of two sequences:
0 1 2 3 ...
+1 -1 +1 -1 ...
Both are easy to express. The first one is just range(N). A sequence that toggles for each i in that range is i % 2. You can adjust the toggle by scaling and offsetting it:
i % 2 -> 0 1 0 1 ...
1 - i % 2 -> 1 0 1 0 ...
2 * (1 - i % 2) -> 2 0 2 0 ...
2 * (1 - i % 2) - 1 -> +1 -1 +1 -1 ...
The entire expression simplifies to i + 1 - 2 * (i % 2), which you can use to join the string almost directly:
result = ''.join(string[i + 1 - 2 * (i % 2)] for i in range(len(string)))
This will work only for an even-length string, so you can check for overruns using min:
N = len(string)
result = ''.join(string[min(i + 1 - 2 * (i % 2), N - 1)] for i in range(N))
Basically a one-liner, doesn't require any iterators beyond a range over the indices, and some very simple integer math.

While the above solutions do work, there is a very simple solution shall we say in "layman's" terms. Someone still learning python and string's can use the other answers but they don't really understand how they work or what each part of the code is doing without a full explanation by the poster as opposed to "this works". The following executes the swapping of every second character in a string and is easy for beginners to understand how it works.
It is simply iterating through the string (any length) by two's (starting from 0 and finding every second character) and then creating a new string (swapped_pair) by adding the current index + 1 (second character) and then the actual index (first character), e.g., index 1 is put at index 0 and then index 0 is put at index 1 and this repeats through iteration of string.
Also added code to ensure string is of even length as it only works for even length.
DrSanjay Bhakkad post above is also a good one that works for even or odd strings and is basically doing the same function as below.
string = "abcdefghijklmnopqrstuvwxyz123"
# use this prior to below iteration if string needs to be even but is possibly odd
if len(string) % 2 != 0:
string = string[:-1]
# iteration to swap every second character in string
swapped_pair = ""
for i in range(0, len(string), 2):
swapped_pair += (string[i + 1] + string[i])
# use this after above iteration for any even or odd length of strings
if len(swapped_pair) % 2 != 0:
swapped_adj += swapped_pair[-1]
print(swapped_pair)
badcfehgjilknmporqtsvuxwzy21 # output if the "needs to be even" code used
badcfehgjilknmporqtsvuxwzy213 # output if the "even or odd" code used

One of the easiest way to swap first two characters from a String is
inputString = '2134'
extractChar = inputString[0:2]
swapExtractedChar = extractChar[::-1] """Reverse the order of string"""
swapFirstTwoChar = swapExtractedChar + inputString[2:]
# swapFirstTwoChar = inputString[0:2][::-1] + inputString[2:] """For one line code"""
print(swapFirstTwoChar)

#Works on even/odd size strings
str = '2143657'
newStr = ''
for i in range(len(str)//2):
newStr += str[i*2+1] + str[i*2]
if len(str)%2 != 0:
newStr += str[-1]
print(newStr)

#Think about how index works with string in Python,
>>> a = "123456"
>>> a[::-1]
'654321'

Best way to determine if a sequence is in another sequence?

This is a generalization of the "string contains substring" problem to (more) arbitrary types.
Given an sequence (such as a list or tuple), what's the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:
Example usage (Sequence in Sequence):
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1 # or None, or whatever
So far, I just rely on brute force and it seems slow, ugly, and clumsy.

I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/
It finds all the correct subsequences in a given sequence, and should be used as an iterator:
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)

Here's a brute-force approach O(n*m) (similar to #mcella's answer). It might be faster than the Knuth-Morris-Pratt algorithm implementation in pure Python O(n+m) (see #Gregg Lind answer) for small input sequences.
#!/usr/bin/env python
def index(subseq, seq):
"""Return an index of `subseq`uence in the `seq`uence.
Or `-1` if `subseq` is not a subsequence of the `seq`.
The time complexity of the algorithm is O(n*m), where
n, m = len(seq), len(subseq)
>>> index([1,2], range(5))
1
>>> index(range(1, 6), range(5))
-1
>>> index(range(5), range(5))
0
>>> index([1,2], [0, 1, 0, 1, 2])
3
"""
i, n, m = -1, len(seq), len(subseq)
try:
while True:
i = seq.index(subseq[0], i + 1, n - m + 1)
if subseq == seq[i:i + m]:
return i
except ValueError:
return -1
if __name__ == '__main__':
import doctest; doctest.testmod()
I wonder how large is the small in this case?

A simple approach: Convert to strings and rely on string matching.
Example using lists of strings:
>>> f = ["foo", "bar", "baz"]
>>> g = ["foo", "bar"]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True
Example using tuples of strings:
>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True
Example using lists of numbers:
>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True

Same thing as string matching sir...Knuth-Morris-Pratt string matching

>>> def seq_in_seq(subseq, seq):
... while subseq[0] in seq:
... index = seq.index(subseq[0])
... if subseq == seq[index:index + len(subseq)]:
... return index
... else:
... seq = seq[index + 1:]
... else:
... return -1
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1
Sorry I'm not an algorithm expert, it's just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. ;-)
Most probably it's the same thing your brute force approach is doing.

Brute force may be fine for small patterns.
For larger ones, look at the Aho-Corasick algorithm.

Here is another KMP implementation:
from itertools import tee
def seq_in_seq(seq1,seq2):
'''
Return the index where seq1 appears in seq2, or -1 if
seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm
based heavily on code by Neale Pickett <neale#woozle.org>
found at: woozle.org/~neale/src/python/kmp.py
>>> seq_in_seq(range(3),range(5))
0
>>> seq_in_seq(range(3)[-1:],range(5))
2
>>>seq_in_seq(range(6),range(5))
-1
'''
def compute_prefix_function(p):
m = len(p)
pi = [0] * m
k = 0
for q in xrange(1, m):
while k > 0 and p[k] != p[q]:
k = pi[k - 1]
if p[k] == p[q]:
k = k + 1
pi[q] = k
return pi
t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
m,n = len(p),len(t)
pi = compute_prefix_function(p)
q = 0
for i in range(n):
while q > 0 and p[q] != t[i]:
q = pi[q - 1]
if p[q] == t[i]:
q = q + 1
if q == m:
return i - m + 1
return -1

I'm a bit late to the party, but here's something simple using strings:
>>> def seq_in_seq(sub, full):
... f = ''.join([repr(d) for d in full]).replace("'", "")
... s = ''.join([repr(d) for d in sub]).replace("'", "")
... #return f.find(s) #<-- not reliable for finding indices in all cases
... return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True
As noted by Ilya V. Schurov, the find method in this case will not return the correct indices with multi-character strings or multi-digit numbers.

For what it's worth, I tried using a deque like so:
from collections import deque
from itertools import islice
def seq_in_seq(needle, haystack):
"""Generator of indices where needle is found in haystack."""
needle = deque(needle)
haystack = iter(haystack) # Works with iterators/streams!
length = len(needle)
# Deque will automatically call deque.popleft() after deque.append()
# with the `maxlen` set equal to the needle length.
window = deque(islice(haystack, length), maxlen=length)
if needle == window:
yield 0 # Match at the start of the haystack.
for index, value in enumerate(haystack, start=1):
window.append(value)
if needle == window:
yield index
One advantage of the deque implementation is that it makes only a single linear pass over the haystack. So if the haystack is streaming then it will still work (unlike the solutions that rely on slicing).
The solution is still brute-force, O(n*m). Some simple local benchmarking showed it was ~100x slower than the C-implementation of string searching in str.index.

Another approach, using sets:
set([5,6])== set([5,6])&set([4,'a',3,5,6])
True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adversarial inputs for Python substring search - python

Try this: import re import time def slow_match(n): pat = 'a' + ('z' * n) str = 'z' * (n + n) start_time = time.time() if re.search(pat, str): print("Shouldn't happen") print(("Searched", n, time.time() - start_time)) slow_match(10000) slow_match(50000) slow_match(100000) slow_match(300000)

Related

forcing ndiff on very dissimilar strings

Optimize Python math code for fixed values of variables in function

compare 2 strings for common substring

What is the simplest way to swap each pair of adjoining chars in a string with Python?

Best way to determine if a sequence is in another sequence?

Categories

Resources