Edit Distance with accents

Edit Distance with accents - python

Are there some edit-distance in python that take account of the accent.
Where for exemple hold the following property
d('ab', 'ac') > d('àb', 'ab') > 0

With the Levenshtein module:
In [1]: import unicodedata, string
In [2]: from Levenshtein import distance
In [3]: def remove_accents(data):
...: return ''.join(x for x in unicodedata.normalize('NFKD', data)
...: if x in string.ascii_letters).lower()
In [4]: def norm_dist(s1, s2):
...: norm1, norm2 = remove_accents(s1), remove_accents(s2)
...: d1, d2 = distance(s1, s2), distance(norm1, norm2)
...: return (d1+d2)/2.
In [5]: norm_dist(u'ab', u'ac')
Out[5]: 1.0
In [6]: norm_dist(u'àb', u'ab')
Out[6]: 0.5

Unicode allows decomposition of accented characters into the base character plus a combining accent character; e.g. à decomposes into a followed by a combining grave accent.
You want to convert both strings using normalization form NFKD, which decomposes accented characters and converts compatibility characters to their canonical forms, then use an edit distance metric that ranks substitutions above insertions and deletions.

Here's a solution based on difflib and unicodedata with no dependencies whatsoever:
import unicodedata
from difflib import Differ
# function taken from https://stackoverflow.com/a/517974/1222951
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore').decode()
return only_ascii
def compare(wrong, right):
# normalize both strings to make sure equivalent (but
# different) unicode characters are canonicalized
wrong = unicodedata.normalize('NFKC', wrong)
right = unicodedata.normalize('NFKC', right)
num_diffs = 0
index = 0
differences = list(Differ().compare(wrong, right))
while True:
try:
diff = differences[index]
except IndexError:
break
# diff is a string like "+ a" (meaning the character "a" was inserted)
# extract the operation and the character
op = diff[0]
char = diff[-1]
# if the character isn't equal in both
# strings, increase the difference counter
if op != ' ':
num_diffs += 1
# if a character is wrong, there will be two operations: one
# "+" and one "-" operation
# we want to count this as a single mistake, not as two mistakes
if op in '+-':
try:
next_diff = differences[index+1]
except IndexError:
pass
else:
next_op = next_diff[0]
if next_op in '+-' and next_op != op:
# skip the next operation, we don't want to count
# it as another mistake
index += 1
# we know that the character is wrong, but
# how wrong is it?
# if the only difference is the accent, it's
# a minor mistake
next_char = next_diff[-1]
if remove_accents(char) == remove_accents(next_char):
num_diffs -= 0.5
index += 1
# output the difference as a ratio of
# (# of wrong characters) / (length of longest input string)
return num_diffs / max(len(wrong), len(right))
Tests:
for w, r in (('ab','ac'),
('àb','ab'),
('être','etre'),
('très','trés'),
):
print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100))
"ab" and "ac": 50.0% difference
"àb" and "ab": 25.0% difference
"être" and "etre": 12.5% difference
"très" and "trés": 12.5% difference

Related

Write a function that determines the maximum number of consecutive BA, CA character pairs per line

My respects, colleagues.
I need to write a function that determines the maximum number of consecutive BA, CA character pairs per line.
print(f("BABABA125")) # -> 3
print(f("234CA4BACA")) # -> 2
print(f("BABACABACA56")) # -> 5
print(f("1BABA24CA")) # -> 2
Actually, I've written a function, but, to my mind, it's not very good.
def f(s: str) -> int:
res = 0
if not s:
return res
cur = 0
i = len(s) - 1
while i >= 0:
if s[i] == "A" and (s[i-1] == "B" or s[i-1] == "C"):
cur += 1
i -= 2
else:
if cur > res:
res = cur
cur = 0
i -= 1
else:
if cur > res:
res = cur
return res
In addition, I'm not allowed to use libraries and regular expressions (only string and list methods). Could you please help me or rate my code in this context. I'll be very grateful.

Here's a function f2 that performs this operation.
if not re.search('(BA|CA)', s): return 0
First check if the string actually contains any BA or CA (to prevent ValueError: max() arg is an empty sequence on step 3), and return 0 if there aren't any.
matches = re.finditer(r'(?:CA|BA)+', s)
Find all consecutive sequences of CA or BA, using non-capturing groups to ensure re.finditer outputs only full matches instead of partial matches.
res = max(matches, key=lambda m: len(m.group(0)))
Then, among the matches (re.Match objects), fetch the matched substring using m.group(0) and compare their lengths to find the longest one.
return len(res.group(0))//2
Divide the length of the longest result by 2 to get the number of BA or CAs in this substring. Here we use floor division // to coerce the output into an int, since division would normally convert the answer to float.
import re
strings = [
"BABABA125", # 3
"234CA4BACA", # 2
"BABACABACA56", # 5
"1BABA24CA", # 2
"NO_MATCH_TO_BE_FOUND", # 0
]
def f2(s: str):
if not re.search('(BA|CA)', s): return 0
matches = re.finditer(r'(?:CA|BA)+', s)
res = max(matches, key=lambda m: len(m.group(0)))
return len(res.group(0))//2
for s in strings:
print(f2(s))
UPDATE: Thanks to #StevenRumbalski for providing a simpler version of the above answer. (I split it into multiple lines for readability)
def f3(s):
if not re.search('(BA|CA)', s): return 0
matches = re.findall(r'(?:CA|BA)+', s)
max_length = max(map(len, matches))
return max_length // 2
if not re.search('(BA|CA)', s): return 0
Same as above
matches = re.findall(r'(?:CA|BA)+', s)
Find all consecutive sequences of CA or BA, but each value in matches is a str instead of a re.Match, which is easier to handle.
max_length = max(map(len, matches))
Map each matched substring to its length and find the maximum length among them.
return max_length // 2
Floor divide the length of the longest matching substring by the length of BA, CA to get the number of consecutive occurrences of BA or CA in this string.

Here's an alternative implementation without any imports. Do note however that it's quite slow compared to your C-style implementation.
The idea is simple: Transform the input string into a string consisting of only two types of characters c1 and c2, with c1 representing CA or BA, and c2 representing anything else. Then find the longest substring of consecutive c1s.
The implementation is as follows:
Pick a char that is guaranteed not to appear in the input string; here we use + as an example. Then pick a char different from the previous one; here we use -.
Replace each occurrence of CA and BA with a +.
Replace everything else in the string (that is not a +) with a - (this is why + cannot be present in the original input string). Now we have a string consisting purely of +s and -s.
Split the string with - as delimiter, and map each resulting substring to their length.
Return the maximum of these substring lengths.
strings = [
"BABABA125", # 3
"234CA4BACA", # 2
"BABACABACA56", # 5
"1BABA24CA", # 2
"NO_MATCH_TO_BE_FOUND", # 0
]
def f4(string: str):
string = string.replace("CA", "+")
string = string.replace("BA", "+")
string = "".join([(c if c == "+" else "-") for c in string])
str_list = string.split("-")
str_lengths = map(len, str_list)
return max(str_lengths)
for s in strings:
print(f4(s))

create string pattern automatically

Need to create a string based on a given pattern.
If the pattern is 222243243 string need to be created is "2{4,6}[43]+2{1,3}[43]+".
Logic to create the above string is, check how many 2's sets in pattern and count them and add more two 2's .here contains two sets of 2's. The first one contains 4 2's and the seconds part contains 1 2's. So the first 2's can be 4 to 6(4+2) 2's and seconds 2's can be 1 to 3(1+2). when there are 3's or 4's, [43]+ need to add.
workings:
import re
data='222243243'
TwosStart=[]#contains twos start positions
TwosEnd=[]#contains twos end positions
TwoLength=[]#number of 2's sets
for match in re.finditer('2+', data):
s = match.start()#2's start position
e = match.end()#2's end position
d=e-s
print(s,e,d)
TwosStart.append(s)
TwosEnd.append(e)
TwoLength.append(d)
So using the above code I know how many 2's sets are in a given pattern and their starting and ending positions. but I have no idea to automatically create a string using the above information.
Ex:
if pattern '222243243' string should be "2{4,6}[43]+2{1,3}[43]+"
if pattern '222432243' string should be "2{3,5}[43]+2{2,4}[43]+"
if pattern '22432432243' string should be "2{2,4}[43]+2{1,3}[43]+2{2,4}[43]+"

One approach is to use itertools.groupby:
from itertools import groupby
s = "222243243"
result = []
for key, group in groupby(s, key=lambda c: c == "2"):
if key:
size = (sum(1 for _ in group))
result.append(f"2{{{size},{size+2}}}")
else:
result.append("[43]+")
pattern = "".join(result)
print(pattern)
Output
2{4,6}[43]+2{1,3}[43]+

Using your base code:
import re
data='222243243'
cpy=data
offset=0 # each 'cpy' modification offsets the addition
for match in re.finditer('2+', data):
s = match.start() # 2's start position
e = match.end() # 2's end position
d = e-s
regex = "]+2{" + str(d) + "," + str(d+2) + "}["
cpy = cpy[:s+offset] + regex + cpy[e+offset:]
offset+=len(regex)-d
# sometimes the borders can have wrong characters
if cpy[0]==']':
cpy=cpy[2:] # remove "]+"
else:
cpy='['+cpy
if cpy[len(cpy)-1]=='[':
cpy=cpy[:-1]
else:
cpy+="]+"
print(cpy)
Output
2{4,6}[43]+2{1,3}[43]+

Python: compare a string to a reference string to see what percentage of the letters are the same

I'm very new to python and I'm trying to work with strings.
I have some data with peptides for example (test string) KGSLADEE. I want to write a function which compares the test string to the reference string: AGSTQKP to see what percentage of the letters in the test string are the same as in the reference string. How can I do this? When looking online I can only find code for exact string matches.
For this example:
(1*K) + (1*G) + (1*S) + (1*L) = 4 (letters which are the same)
Divide by 8 (total number of letters in the test string)
(4/8) * 100 = 50%
How can I do this? When looking online I can only find code for exact string matches.

This yields the same results as the answer from Hoxha Alban, but I find this one a bit easier to read. It uses the Counter module (see here: https://pymotw.com/2/collections/counter.html)
from collections import Counter
def f(test, ref):
intersection = Counter(test) & Counter(ref)
return len(list(intersection.elements())) / len(test) * 100

Are you searching for something like this?
def f(test, ref):
d = dict()
for c in ref:
if c not in d:
d[c] = min(ref.count(c), test.count(c))
return sum(d.values()) / len(test) * 100
f('KGSLADEE', 'AGSTQKP') # 50%
f('hello', 'h') # 20%
f('abc', 'cba') # 100%
f('a', 'aaa') # 100%
f('aaa', 'a') # 33.333%

You will need to loop through every letter in your test string. Your loop will go through each letter and check that letter within your reference string and then give you an output.
You can then use this output to calculate your percentage.

word1 = 'KGSLADEE'
word2 = 'AGSTQKP'
same = 0
for letter1 in word1:
for letter2 in word2:
if letter1 == letter2:
same += 1
break
print(same/len(word1)*100)
It is not a solution for all situations. But you can expand that

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?

re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.

I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe

r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())

Remove adjacent duplicates given a condition

I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!

Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'

Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...

You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Edit Distance with accents - python

Are there some edit-distance in python that take account of the accent. Where for exemple hold the following property d('ab', 'ac') > d('àb', 'ab') > 0

Related

Write a function that determines the maximum number of consecutive BA, CA character pairs per line

create string pattern automatically

Python: compare a string to a reference string to see what percentage of the letters are the same

Struggling with Regex for adjacent letters differing by case

Remove adjacent duplicates given a condition

Categories

Resources