Remove adjacent duplicates given a condition - python

I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!

Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'

Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...

You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res

Related

Python detect if string contains specific length substring

I am given a string and need to find the first substring in it, according to the substring's length
for example: given the string 'abaadddefggg'
for length = 3 I should get the output of 'ddd'
for length = 2 I should get 'aa' and so on
any ideas?
You could iterate over the strings indexes, and produce all the substrings. If any of these substrings is made up of a single character, that's the substring you're looking for:
def sequence(s, length):
for i in range(len(s) - length):
candidate = s[i:i+length]
if len(set(candidate)) == 1:
return candidate
One approach in Python 3.8+ using itertools.groupby combined with the walrus operator:
from itertools import groupby
string = 'abaadddefggg'
k = 3
res = next(s for _, group in groupby(string) if len(s := "".join(group)) == k)
print(res)
Output
ddd
An alternative general approach:
from itertools import groupby
def find_substring(string, k):
for _, group in groupby(string):
s = "".join(group)
if len(s) == k:
return s
res = find_substring('abaadddefggg', 3)
print(res)

finding repeated substring in k length in a string using function

I just started using function and I'm trying to build one that's find a repeated substring that is length is at least k and returns the results into tuple that contains a dict.
the keys needs to be the substring and the value is how many times it was repeated, and then add to the tuple the length of the substring.
I just started but I didnt really knew how to continue but this is what I tried to do:
def longest_repeat(string, K)
longest = {} ,
if isinstance(K, int) and isinstance(string, str)
for sub_str in string:
if sub_str >= K:
longest[0][sub_seq] = DNA_seq_slic = []
a=0
b=k
for nuc in range(len(DNA_seq)-k+1):
DNA_seq_slic.append(DNA_seq[a:b])
a +=1
b +=1
import collections
for sub_seq in DNA_seq_slic:
repeated = [item for item, count in collections.Counter(DNA_seq_slic).items() if count > 1]
repeated_subseq_dict = dict(zip(repeated,[0 for x in range(0,len(repeated))]))
for key in repeated_subseq_dict:
repeated_subseq_dict[key] = DNA_seq_slic.count(key)
return(repeated_subseq_dict)
Im sorry if its a little bit messed up, I didnt really had direction and I tried to use other function I built to solve this and it didnt really worked. I can clarify more if needed.
the output should be something like this:
longest_repeated("ATAATACATAATA", 5)
output: longest = {ATAATA: 2} , 6
Really appreciate any kind of help! Thanks!
You can try re module:
import re
def longest_repeated(s, k):
m = re.findall(f"(.{{{k},}})(?=.*\\1)", s)
if m:
mx = max(m, key=len)
return {mx: s.count(mx)}, len(mx)
Some tests:
print(longest_repeated("ATAATACATAATA", 5))
({'ATAATA': 2}, 6)
print(longest_repeated("XXXXXATAATACATAATAXXXXX", 5))
({'ATAATA': 2}, 6)

Format a large integer with commas without using .format()

I'm trying to format any number by inserting ',' every 3 numbers from the end by not using format()
123456789 becomes 123,456,789
1000000 becomes 1,000,000
What I have so far only seems to go from the start, I've tried different ideas to get it to reverse but they seem to not work as I hoped.
def format_number(number):
s = [x for x in str(number)]
for a in s[::3]:
if s.index(a) is not 0:
s.insert(s.index(a), ',')
return ''.join(s)
print(format_number(1123456789))
>> 112,345,678,9
But obviously what I want is 1,123,456,789
I tried reversing the range [:-1:3] but I get 112,345,6789
Clarification: I don't want to use format to structure the number, I'd prefer to understand how to do it myself just for self-study's sake.
Here is a solution for you, without using built-in functions:
def format_number(number):
s = list(str(number))[::-1]
o = ''
for a in range(len(s)):
if a and a % 3 == 0:
o += ','
o += s[a]
return o[::-1]
print(format_number(1123456789))
And here is the same solution using built-in functions:
def format_number(number):
return '{:,}'.format(number)
print(format_number(1123456789))
I hope this helps. :D
One way to do it without built-in functions at all...
def format_number(number):
i = 0
r = ""
while True:
r = "0123456789"[number % 10] + r
number //= 10
if number == 0:
return r
i += 1
if i % 3 == 0:
r = "," + r
Here's a version that's almost free of built-in functions or methods (it does still have to use str)
def format_number(number):
i = 0
r = ""
for character in str(number)[::-1]:
if i > 0 and i % 3 == 0:
r = "," + r
r = character + r
i += 1
return r
Another way to do it without format but with other built-ins is to reverse the number, split it into chunks of 3, join them with a comma, and reverse it again.
def format_number(number):
backward = str(number)[::-1]
r = ",".join(backward[i:i+3] for i in range(0, len(backward), 3))
return r[::-1]
Your current approach has following drawbacks
checking for equality/inequality in most cases (especially for int) should be made using ==/!= operators, not is/is not ones,
using list.index returns first occurence from the left end (so s.index('1') will be always 0 in your example), we can iterate over range if indices instead (using range built-in).
we can have something like
def format_number(number):
s = [x for x in str(number)]
for index in range(len(s) - 3, 0, -3):
s.insert(index, ',')
return ''.join(s)
Test
>>> format_number(1123456789)
'1,123,456,789'
>>> format_number(6789)
'6,789'
>>> format_number(135)
'135'
If range, list.insert and str.join are not allowed
We can replace
range with while loop,
list.insert using slicing and concatenation,
str.join with concatenation,
like
def format_number(number):
s = [x for x in str(number)]
index = len(s) - 3
while index > 0:
s = s[:index] + [','] + s[index:]
index -= 3
result = ''
for character in s:
result += character
return result
Using str.format
Finally, following docs
The ',' option signals the use of a comma for a thousands separator. For a locale aware separator, use the 'n' integer presentation type instead.
your function can be simplified to
def format_number(number):
return '{:,}'.format(number)
and it will even work for floats.

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.
Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']
Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']
The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

Repeating characters results in wrong repetition counts

My function looks like this:
def accum(s):
a = []
for i in s:
b = s.index(i)
a.append(i * (b+1))
x = "-".join(a)
return x.title()
with the expected input of:
'abcd'
the output should be and is:
'A-Bb-Ccc-Dddd'
but if the input has a recurring character:
'abccba'
it returns:
'A-Bb-Ccc-Ccc-Bb-A'
instead of:
'A-Bb-Ccc-Cccc-Bbbbb-Aaaaaa'
how can I fix this?
Don't use str.index(), it'll return the first match. Since c and b and a appear early in the string you get 2, 1 and 0 back regardless of the position of the current letter.
Use the enumerate() function to give you position counter instead:
for i, letter in enumerate(s, 1):
a.append(i * letter)
The second argument is the starting value; setting this to 1 means you can avoid having to + 1 later on. See What does enumerate mean? if you need more details on what enumerate() does.
You can use a list comprehension here rather than use list.append() calls:
def accum(s):
a = [i * letter for i, letter in enumerate(s, 1)]
x = "-".join(a)
return x.title()
which could, at a pinch, be turned into a one-liner:
def accum(s):
a = '-'.join([i * c for i, c in enumerate(s, 1)]).title()
This is because s.index(a) returns the first index of the character. You can use enumerate to pair elements to their indices:
Here is a Pythonic solution:
def accum(s):
return "-".join(c*(i+1) for i, c in enumerate(s)).title()
simple:
def accum(s):
a = []
for i in range(len(s)):
a.append(s[i]*(i+1))
x = "-".join(a)
return x.title()

Categories

Resources