python: comparing two strings

python: comparing two strings - python

I would like to know if there is a library that will tell me approximately how similar two strings are
I am not looking for anything specific, but in this case:
a = 'alex is a buff dude'
b = 'a;exx is a buff dud'
we could say that b and a are approximately 90% similar.
Is there a library which can do this?

import difflib
>>> a = 'alex is a buff dude'
>>> b = 'a;exx is a buff dud'
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.89473684210526316

http://en.wikipedia.org/wiki/Levenshtein_distance
There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings.
You may also want to check out python's difflib: http://docs.python.org/library/difflib.html

Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py

Other way is to use longest common substring. Here a implementation in Daniweb with my lcs implementation (this is also defined in difflib)
Here is simple length only version with list as data structure:
def longest_common_sequence(a,b):
n1=len(a)
n2=len(b)
previous=[]
for i in range(n2):
previous.append(0)
over = 0
for ch1 in a:
left = corner = 0
for ch2 in b:
over = previous.pop(0)
if ch1 == ch2:
this = corner + 1
else:
this = over if over >= left else left
previous.append(this)
left, corner = this, over
return 200.0*previous.pop()/(n1+n2)
Here is my second version which actualy gives the common string with deque data structure (also with the example data use case):
from collections import deque
a = 'alex is a buff dude'
b = 'a;exx is a buff dud'
def lcs_tuple(a,b):
n1=len(a)
n2=len(b)
previous=deque()
for i in range(n2):
previous.append((0,''))
over = (0,'')
for i in range(n1):
left = corner = (0,'')
for j in range(n2):
over = previous.popleft()
if a[i] == b[j]:
this = corner[0] + 1, corner[1]+a[i]
else:
this = max(over,left)
previous.append(this)
left, corner = this, over
return 200.0*this[0]/(n1+n2),this[1]
print lcs_tuple(a,b)
""" Output:
(89.47368421052632, 'aex is a buff dud')
"""

Related

in-place modification of strings within a list?

I'm trying to change some elements of a list based on the properties of previous ones. Because I need to assign an intermediate variable, I don't think this can be done as a list comprehension. The following code, with comment, is what I'm trying to achieve:
for H in header:
if "lower" in H.lower():
pref="lower"
elif "higher" in H.lower():
pref="higher"
if header.count(H) > 1:
# change H inplace
H = pref+H
The best solution I've come up with is:
for ii,H in enumerate(header):
if "lower" in H.lower():
pref="lower"
elif "higher" in H.lower():
pref="higher"
if header.count(H) > 1:
header[ii] = pref+H
It doesn't quite work, and feels un-pythonic to me because of the indexing. Is there a better way to do this?
Concrete example:
header = ['LowerLevel','Term','J','UpperLevel','Term','J']
desired output:
header = ['LowerLevel','LowerTerm','LowerJ','UpperLevel','UpperTerm','UpperJ']
Note that neither of my solutions work: the former never modifies header at all, the latter only returns
header = ['LowerLevel','LowerTerm','LowerJ','UpperLevel','Term','J']
because count is wrong after the modifications.

header = ['LowerLevel','Term','J','UpperLevel','Term','J']
prefixes = ['lower', 'upper']
def prefixed(header):
prefix = ''
for h in header:
for p in prefixes:
if h.lower().startswith(p):
prefix, h = h[:len(p)], h[len(p):]
yield prefix + h
print list(prefixed(header))
I don't really know that this is better than what you had. It's different...
$ ./lower.py
['LowerLevel', 'LowerTerm', 'LowerJ', 'UpperLevel', 'UpperTerm', 'UpperJ']

something like this, using generator function:
In [62]: def func(lis):
pref=""
for x in lis:
if "lower" in x.lower():
pref="Lower"
elif "upper" in x.lower():
pref="Upper"
if header.count(x)>1:
yield pref+x
else:
yield x
....:
In [63]: list(func(header))
Out[63]: ['LowerLevel', 'LowerTerm', 'LowerJ', 'UpperLevel', 'UpperTerm', 'UpperJ']

This should work for the data you presented.
from collections import defaultdict
def find_dups(seq):
'''Finds duplicates in a sequence and returns a dict
of value:occurences'''
seen = defaultdict(int)
for curr in seq:
seen[curr] += 1
d = dict([(i, seen[i]) for i in seen if seen[i] > 1])
return d
if __name__ == '__main__':
header = ['LowerLevel','Term','J','UpperLevel','Term','J']
d = find_dups(header)
for i, s in enumerate(header):
if s in d:
if d[s] % 2:
pref = 'Upper'
else:
pref = 'Lower'
header[i] = pref + s
d[s] -= 1
But it give me the creeps to suggest anything, not knowing but a little about the entire set of data you will be working with.
good luck,
Mike

Checking if word segmentation is possible

This is a follow up question to this response and the pseudo-code algorithm that the user posted. I didn't comment on that question because of its age. I am only interested in validating whether or not a string can be split into words. The algorithm doesn't need to actually split the string. This is the response from the linked question:
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if
the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for
i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and
isWord[j..i]).
I'm translating this algorithm into simple python code, but I'm not sure if I'm understanding it properly. Code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, str_len):
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
I have two related questions. 1) Is this code a proper translation of the linked algorithm into Python, and if it is, 2) Now that I have S, how do I use it to tell if the string is only comprised of words? In this case, is_word is a function that simply looks a given word up in a list. I haven't implemented it as a trie yet.
UPDATE: After updating the code to include the suggested change, it doesn't work. This is the updated code:
def is_all_words(a_string, dictionary)):
str_len = len(a_string)
S = [False] * str_len
S[0] = is_word(a_string[0], dictionary)
for i in range(1, str_len):
check = is_word(a_string[0:i], dictionary)
if (check):
S[i] = check
else:
for j in range(1, i): #THIS LINE WAS UPDATED
check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
if (check):
S[i] == True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE
a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE
It should return True for both of these.

Here is a modified version of your code that should return good results.
Notice that your mistake was simply in the translation from pseudocode array indexing (starting at 1) to python array indexing (starting at 0) therefore S[0] and S[1] where populated with the same value where S[L-1] was actually never computed. You can easily trace this mistake by printing the whole S values. You will find that S[3] is set true in the first example where it should be S[2] for the word "car".
Also you could speed up the process by storing the index of composite words found so far, instead of testing each position.
def is_all_words(a_string, dictionary):
str_len = len(a_string)
S = [False] * (str_len)
# I replaced is_word function by a simple list lookup,
# feel free to replace it with whatever function you use.
# tries or suffix tree are best for this.
S[0] = (a_string[0] in dictionary)
for i in range(1, str_len):
check = a_string[0:i+1] in dictionary # i+1 instead of i
if (check):
S[i] = check
else:
for j in range(0,i+1): # i+1 instead of i
if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
S[i] = True
break
return S
a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE
a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE

For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the recursive approach. By modifying the score function you can prioritize longer matches.
Installation is easy with pip:
$ pip install wordsegment
And segment returns a list of words:
>>> import wordsegment
>>> wordsegment.segment('carrotfever')
['carrot', 'forever']

1) at first glance, looks good. One thing: for j in range(1, str_len): should be for j in range(1, i): I think
2) if S[str_len-1]==true, then the whole string should consist of whole words only.
After all S[i] is true iff
the whole string from 0 to i consists of a single dictionary word
OR there exists a S[j-1]==true with j<i, and the string[j:i] is a single dictionaryword
so if S[str_len-1] is true, then the whole string is composed out of dictionary words

Longest common prefix using buffer?

If I have an input string and an array:
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
I am trying to find the longest common prefix between the consecutive elements of the array pos referencing the original s. I am trying to get the following output:
longest = [3,1]
The way I obtained this is by computing the longest common prefix of the following pairs:
s[15:] which is _be and s[2:] which is _be_or_not_to_be giving 3 ( _be )
s[2:] which is _be_or_not_to_be and s[8:] which is _not_to_be giving 1 ( _ )
However, if s is huge, I don't want to create multiple copies when I do something like s[x:]. After hours of searching, I found the function buffer that maintains only one copy of the input string but I wasn't sure what is the most efficient way to utilize it here in this context. Any suggestions on how to achieve this?

Here is a method without buffer which doesn't copy, as it only looks at one character at a time:
from itertools import islice, izip
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
length = len(s)
for start1, start2 in izip(pos, islice(pos, 1, None)):
pref = 0
for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
if s[pos1] == s[pos2]:
pref += 1
else:
break
print pref
# prints 3 1
I use islice, izip, and xrange in case you're talking about potentially very long strings.
I also couldn't resist this "One Liner" which doesn't even require any indexing:
[next((i for i, (a, b) in
enumerate(izip(islice(s, start1, None), islice(s, start2, None)))
if a != b),
length - max((start1, start2)))
for start1, start2 in izip(pos, islice(pos, 1, None))]
One final method, using os.path.commonprefix:
[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]

>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'
Let Python to manage memory for you. Don't optimize prematurely.
To get the exact output you could do (as #agf suggested):
print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]

I think your worrying about copies is unfounded. See below:
>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True
Unless you're copying the slices and leaving references to the copies hanging around, I wouldn't think it could cause any problem.
edit: There are technical details with string interning and stuff that I'm not really clear on myself. But I'm sure that a string slice is not always a copy:
>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True
I guess the answer I'm trying to give is to just let python manage its memory itself, to begin with, you can look at memory buffers and views later if needed. And if this is already a real problem occurring for you, update your question with details of what the actual problem is.

One way of doing using buffer this is give below. However, there could be much faster ways.
s = "to_be_or_not_to_be"
pos = [15, 2, 8]
lcp = []
length = len(pos) - 1
for index in range(0, length):
pre = buffer(s, pos[index])
cur = buffer(s, pos[index+1], pos[index+1]+len(pre))
count = 0
shorter, longer = min(pre, cur), max(pre, cur)
for i, c in enumerate(shorter):
if c != longer[i]:
break
else:
count += 1
lcp.append(count)
print
print lcp

Python Unknown pattern finding

Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code. What I've come up with is really ugly and slow, at least it works. The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding. My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time. Please help me with new better code or at least help me with optimizing what I got :). Anything that helps is welcome! BTW i have already tried compression libraries but they didn't compress as good as my ugly function.
def c_long(inp, cap=False, b=5):
import re,string
if cap is False: cap = len(inp)
es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
c = b;inpc = inp;pattern = inpc[:b]; l=[]
rep = string.replace; ins = list.insert
while True:
if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
elif le(inpc) <= b: break
if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
p = ref(es(pattern),inp)
pattern += inpc[c]
if le(p) > 1 and le(pattern) >= b+1:
if l == []: l = [[pattern,le(p)+le(pattern)]]
elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
x = [pattern,le(p)+le(inpc[:c+1])]
for i in ran(le(l)):
if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
pattern = inpc[:b]
c = b-1
c += 1
d = {}; c = 0
s = ran(le(l))
for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
return [inp,d]
def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])

The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:
>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26
In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib").
A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as
def decompress(compressed_string, substitutions):
subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
return compressed_string.format(*subst_list)
Now it's already much more obvious what it does. You could go one step further: Why is substitutions a dictionary with the string keys "0", "1" etc.? Not only is it strange to use strings instead of integers -- you don't need the keys at all! A simple list will do, and decompress() will simplify to
def decompress(compressed_string, substitutions):
return compressed_string.format(*substitutions)
You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself. (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.)

Typically one would pump the program through a compression algorithm optimized for text, then run that through exec, e.g.
code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)
It may be the case that .pyc/.pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""", then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.

I need a Python Function that will output a random string of 4 different characters when given the desired probabilites of the characters

For example,
The function could be something like def RandABCD(n, .25, .34, .25, .25):
Where n is the length of the string to be generated and the following numbers are the desired probabilities of A, B, C, D.
I would imagine this is quite simple, however i am having trouble creating a working program. Any help would be greatly appreciated.

Here's the code to select a single weighted value. You should be able to take it from here. It uses bisect and random to accomplish the work.
from bisect import bisect
from random import random
def WeightedABCD(*weights):
chars = 'ABCD'
breakpoints = [sum(weights[:x+1]) for x in range(4)]
return chars[bisect(breakpoints, random())]
Call it like this: WeightedABCD(.25, .34, .25, .25).
EDIT: Here is a version that works even if the weights don't add up to 1.0:
from bisect import bisect_left
from random import uniform
def WeightedABCD(*weights):
chars = 'ABCD'
breakpoints = [sum(weights[:x+1]) for x in range(4)]
return chars[bisect_left(breakpoints, uniform(0.0,breakpoints[-1]))]

The random class is quite powerful in python. You can generate a list with the characters desired at the appropriate weights and then use random.choice to obtain a selection.
First, make sure you do an import random.
For example, let's say you wanted a truly random string from A,B,C, or D.
1. Generate a list with the characters
li = ['A','B','C','D']
Then obtain values from it using random.choice
output = "".join([random.choice(li) for i in range(0, n)])
You could easily make that a function with n as a parameter.
In the above case, you have an equal chance of getting A,B,C, or D.
You can use duplicate entries in the list to give characters higher probabilities. So, for example, let's say you wanted a 50% chance of A and 25% chances of B and C respectively. You could have an array like this:
li = ['A','A','B','C']
And so on.
It would not be hard to parameterize the characters coming in with desired weights, to model that I'd use a dictionary.
characterbasis = {'A':25, 'B':25, 'C':25, 'D':25}
Make that the first parameter, and the second being the length of the string and use the above code to generate your string.

For four letters, here's something quick off the top of my head:
from random import random
def randABCD(n, pA, pB, pC, pD):
# assumes pA + pB + pC + pD == 1
cA = pA
cB = cA + pB
cC = cB + pC
def choose():
r = random()
if r < cA:
return 'A'
elif r < cB:
return 'B'
elif r < cC:
return 'C'
else:
return 'D'
return ''.join([choose() for i in xrange(n)])
I have no doubt that this can be made much cleaner/shorter, I'm just in a bit of a hurry right now.
The reason I wouldn't be content with David in Dakota's answer of using a list of duplicate characters is that depending on your probabilities, it may not be possible to create a list with duplicates in the right numbers to simulate the probabilities you want. (Well, I guess it might always be possible, but you might wind up needing a huge list - what if your probabilities were 0.11235442079, 0.4072777384, 0.2297927874, 0.25057505341?)
EDIT: here's a much cleaner generic version that works with any number of letters with any weights:
from bisect import bisect
from random import uniform
def rand_string(n, content):
''' Creates a string of letters (or substrings) chosen independently
with specified probabilities. content is a dictionary mapping
a substring to its "weight" which is proportional to its probability,
and n is the desired number of elements in the string.
This does not assume the sum of the weights is 1.'''
l, cdf = zip(*[(l, w) for l, w in content.iteritems()])
cdf = list(cdf)
for i in xrange(1, len(cdf)):
cdf[i] += cdf[i - 1]
return ''.join([l[bisect(cdf, uniform(0, cdf[-1]))] for i in xrange(n)])

Here is a rough idea of what might suit you
import random as r
def distributed_choice(probs):
r= r.random()
cum = 0.0
for pair in probs:
if (r < cum + pair[1]):
return pair[0]
cum += pair[1]
The parameter probs takes a list of pairs of the form (object, probability). It is assumed that the sum of probabilities is 1 (otherwise, its trivial to normalize).
To use it just execute:
''.join([distributed_choice(probs)]*4)

Hmm, something like:
import random
class RandomDistribution:
def __init__(self, kv):
self.entries = kv.keys()
self.where = []
cnt = 0
for x in self.entries:
self.where.append(cnt)
cnt += kv[x]
self.where.append(cnt)
def find(self, key):
l, r = 0, len(self.where)-1
while l+1 < r:
m = (l+r)/2
if self.where[m] <= key:
l=m
else:
r=m
return self.entries[l]
def randomselect(self):
return self.find(random.random()*self.where[-1])
rd = RandomDistribution( {"foo": 5.5, "bar": 3.14, "baz": 2.8 } )
for x in range(1000):
print rd.randomselect()
should get you most of the way...

Thank you all for your help, I was able to figure something out, mostly with this info.
For my particular need, I did something like this:
import random
#Create a function to randomize a given string
def makerandom(seq):
return ''.join(random.sample(seq, len(seq)))
def randomDNA(n, probA=0.25, probC=0.25, probG=0.25, probT=0.25):
notrandom=''
A=int(n*probA)
C=int(n*probC)
T=int(n*probT)
G=int(n*probG)
#The remainder part here is used to make sure all n are used, as one cannot
#have half an A for example.
remainder=''
for i in range(0, n-(A+G+C+T)):
ramainder+=random.choice("ATGC")
notrandom=notrandom+ 'A'*A+ 'C'*C+ 'G'*G+ 'T'*T + remainder
return makerandom(notrandom)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: comparing two strings - python

import difflib >>> a = 'alex is a buff dude' >>> b = 'a;exx is a buff dud' >>> difflib.SequenceMatcher(None, a, b).ratio() 0.89473684210526316

http://en.wikipedia.org/wiki/Levenshtein_distance There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings. You may also want to check out python's difflib: http://docs.python.org/library/difflib.html

Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py

Related

in-place modification of strings within a list?

Checking if word segmentation is possible

Longest common prefix using buffer?

Python Unknown pattern finding

I need a Python Function that will output a random string of 4 different characters when given the desired probabilites of the characters

Categories

Resources