How to remove multiple consecutive sequences of consecutive duplicate characters in python

How to remove multiple consecutive sequences of consecutive duplicate characters in python - python

I am trying to preprocess some tweets for an ML project where I am having troubles with two types of strings e.g.
str1 = "coooool" and str2 = "gooooaaaaaal".
After removing repeated characters, I would like to maintain the word in str1, i.e.
cleaned_str1 = "cool" while cleaned_str2 = "goal".
I tried a few approaches that I found but I couldn't get the right output. Could someone help me with this? Thank you in advance.

Use regular expressions:
re.sub(r"(\w)\1+(\w)\2+", r"\1\2", "goooaaaal") # -> goal
re.sub(r"(\w)\1+(\w)\2+", r"\1\2", "coooool") # -> cool

def removeDuplicates(S):
n = len(S)
j = 0
if (n < 2) :
return
for i in range(n):
if (S[j] != S[i]):
j += 1
S[j] = S[i]
j += 1
S = S[:j]
return S
This was taken directly from Geeks for Geeks.
There is no way for a program to intuitively know that "cool" needs two "o's" as in your example.

Related

Find Longest Alphabetically Ordered Substring - Efficiently

The goal of some a piece of code I wrote is to find the longest alphabetically ordered substring within a string.
"""
Find longest alphabetically ordered substring in string s.
"""
s = 'zabcabcd' # Test string.
alphabetical_str, temp_str = s[0], s[0]
for i in range(len(s) - 1): # Loop through string.
if s[i] <= s[i + 1]: # Check if next character is alphabetically next.
temp_str += s[i + 1] # Add character to temporary string.
if len(temp_str) > len(alphabetical_str): # Check is temporary string is the longest string.
alphabetical_str = temp_str # Assign longest string.
else:
temp_str = s[i + 1] # Assign last checked character to temporary string.
print(alphabetical_str)
I get an output of abcd.
But the instructor says there is PEP 8 compliant way of writing this code that is 7-8 lines of code and there is a more computational efficient way of writing this code that is ~16 lines. Also that there is a way of writing this code in only 1 line 75 character!
Can anyone provide some insight on what the code would look like if it was 7-8 lines or what the most work appropriate way of writing this code would be? Also any PEP 8 compliance critique would be appreciated.

Linear time:
s = 'zabcabcd'
longest = current = []
for c in s:
if [c] < current[-1:]:
current = []
current += c
longest = max(longest, current, key=len)
print(''.join(longest))
Your PEP 8 issues I see:
"Limit all lines to a maximum of 79 characters." (link) - You have two lines longer than that.
"do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b" [...] the ''.join() form should be used instead" (link). You do that repeated string concatenation.
Also, yours crashes if the input string is empty.
1 line 72 characters:
s='zabcabcd';print(max([t:='']+[t:=t*(c>=t[-1:])+c for c in s],key=len))
Optimized linear time (I might add benchmarks tomorrow):
def Kelly_fast(s):
maxstart = maxlength = start = length = 0
prev = ''
for c in s:
if c >= prev:
length += 1
else:
if length > maxlength:
maxstart = start
maxlength = length
start += length
length = 1
prev = c
if length > maxlength:
maxstart = start
maxlength = length
return s[maxstart : maxstart+maxlength]

Depending on how you choose to count, this is only 6-7 lines and PEP 8 compliant:
def longest_alphabetical_substring(s):
sub = '', 0
for i in range(len(s)):
j = i + len(sub) + 1
while list(s[i:j]) == sorted(s[i:j]) and j <= len(s):
sub, j = s[i:j], j+1
return sub
print(longest_alphabetical_substring('zabcabcd'))
Your own code was PEP 8 compliant as far as I can tell, although it would make sense to capture code like this in a function, for easy reuse and logical grouping for improved readability.
The solution I provided here is not very efficient, as it keeps extracting copies of the best result so far. A slightly longer solution that avoids this:
def longest_alphabetical_substring(s):
n = m = 0
for i in range(len(s)):
for j in range(i+1, len(s)+1):
if j == len(s) or s[j] < s[j-1]:
if j-i > m-n:
n, m = i, j
break
return s[n:m]
print(longest_alphabetical_substring('zabcabcd'))
There may be more efficient ways of doing this; for example you could detect that there's no need to keep looking because there is not enough room left in the string to find longer strings, and exit the outer loop sooner.
User #kellybundy is correct, a truly efficient solution would be linear in time. Something like:
def las_efficient(s):
t = s[0]
return max([(t := c) if c < t[-1] else (t := t + c) for c in s[1:]], key=len)
print(las_efficient('zabcabcd'))
No points for readability here, but PEP 8 otherwise, and very brief.
And for an even more efficient solution:
def las_very_efficient(s):
m, lm, t, ls = '', 0, s[0], len(s)
for n, c in enumerate(s[1:]):
if c < t[-1]:
t = c
else:
t += c
if len(t) > lm:
m, lm = t, len(t)
if n + lm > ls:
break
return m

You can keep appending characters from the input string to a candidate list, but clear the list when the current character is lexicographically smaller than the last character in the list, and set the candidate list as the output list if it's longer than the current output list. Join the list into a string for the final output:
s = 'zabcabcdabc'
candidate = longest = []
for c in s:
if candidate and c < candidate[-1]:
candidate = []
candidate.append(c)
if len(candidate) > len(longest):
longest = candidate
print(''.join(longest))
This outputs:
abcd

How to find the longest repeating sequence using python

I went through an interview, where they asked me to print the longest repeated character sequence.
I got stuck is there any way to get it?
But my code prints only the count of characters present in a string is there any approach to get the expected output
import pandas as pd
import collections
a = 'abcxyzaaaabbbbbbb'
lst = collections.Counter(a)
df = pd.Series(lst)
df
Expected output :
bbbbbbb
How to add logic to in above code?

A regex solution:
max(re.split(r'((.)\2*)', a), key=len)
Or without library help (but less efficient):
s = ''
max((s := s * (c in s) + c for c in a), key=len)
Both compute the string 'bbbbbbb'.

Without any modules, you could use a comprehension to go backward through possible sizes and get the first character multiplication that is present in the string:
next(c*s for s in range(len(a),0,-1) for c in a if c*s in a)
That's quite bad in terms of efficiency though
another approach would be to detect the positions of letter changes and take the longest subrange from those
chg = [i for i,(x,y) in enumerate(zip(a,a[1:]),1) if x!=y]
s,e = max(zip([0]+chg,chg+[len(a)]),key=lambda se:se[1]-se[0])
longest = a[s:e]
Of course a basic for-loop solution will also work:
si,sc = 0,"" # current streak (start, character)
ls,le = 0,0 # longest streak (start, end)
for i,c in enumerate(a+" "): # extra space to force out last char.
if i-si > le-ls: ls,le = si,i # new longest
if sc != c: si,sc = i,c # new streak
longest = a[ls:le]
print(longest) # bbbbbbb

A more long winded solution, picked wholesale from:
maximum-consecutive-repeating-character-string
def maxRepeating(str):
len_s = len(str)
count = 0
# Find the maximum repeating
# character starting from str[i]
res = str[0]
for i in range(len_s):
cur_count = 1
for j in range(i + 1, len_s):
if (str[i] != str[j]):
break
cur_count += 1
# Update result if required
if cur_count > count :
count = cur_count
res = str[i]
return res, count
# Driver code
if __name__ == "__main__":
str = "abcxyzaaaabbbbbbb"
print(maxRepeating(str))
Solution:
('b', 7)

Generate tailor made thousand delimiter

I want to tailor-make a thousand delimiter in Python. I am generating HTML and want to use   as thousand separator. (It would look like: 1 000 000)
So far I have found the following way to add a , as a separator:
>>> '{0:,}'.format(1000000)
'1,000,000'
But I don't see to be able to use a similar construction to get another delimiter. '{0:|}'.format(1000000) for example does not work. Is there an easy way to use anything (i.e.,  ) as a thousand separator?

Well, you can always do this:
'{0:,}'.format(1000000).replace(',', '|')
Result: '1|000|000'
Here's a simple algorithm for the same. The previous version of it (ThSep two revisions back) didn't handle long separators like  :
def ThSep(num, sep = ','):
num = int(num)
if not num:
return '0'
ret = ''
dig = 0
neg = False
if num < 0:
num = -num
neg = True
while num != 0:
dig += 1
ret += str(num % 10)
if (dig == 3) and (num / 10):
for ch in reversed(sep):
ret += ch
dig = 0
num /= 10
if neg:
ret += '-'
return ''.join(reversed(ret))
Call it with ThSep(1000000, ' ') or ThSep(1000000, '|') to get the result you want.
It's about 4 times slower than the first method, though, so you can try rewriting this as a C extension for production code. This is only if speed matters much. I converted 2 000 000 negative and positive numbers in half a minute for the test.

There is no built in way to do this, But you can use str.replace, if the number is the only present value
>>> '{0:,}'.format(1000000).replace(',','|')
'1|000|000'
This is mentioned in PEP 378
The proposal works well with floats, ints, and decimals. It also allows easy substitution for other separators. For example:
format(n, "6,d").replace(",", "_")

python pattern count Find a “hidden message” in the replication origin

The question ask to find a “hidden message” in the replication origin.
Input: A string Text (representing the replication origin of a genome).
Output: A hidden message in Text.
Translate to computational language,
Input: Strings Text and Pattern.
Output: Count(Text, Pattern).
For example,
Count(ACAACTATGCATACTATCGGGAACTATCCT, ACTAT) = 3.
In theory, we should account for overlapping occurrences of Pattern in Text right? So one way to do it is to screen down from first element to the length of text-length of the pattern we are looking for?
Here's the pseudo code i come up with,
def PatternCount(Text, Pattern):
count = 0
for i = 0 to len(Text)-len(Pattern):
if Text(i, len(Pattern)) = Pattern:
count = count + 1
return count
Any suggestion? I'm new to python. Thanks in advance.

This is what I came up with:
def pattern_count(text, pattern):
count = 0
for i in range(0, len(text) - len(pattern) + 1):
if text[i : len(pattern) + i] == pattern:
count += 1
return count
We're using string slicing (text[i : len(pattern) + i]) to check if the sub-string matches the pattern.
Input: text = "abc123!##654abcabc" and pattern = "abc"
Output: 3

import re
print len(re.findall("abc", "abc123!##654abcabc"))

I think a more "pythonic" solution would be to use list comprehensions.
def pattern_count(text, pattern):
return len([x for x in range(len(text) - len(pattern)+1) if pattern in text[x:len(pattern)+x]])

String count with overlapping occurrences [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:
def function(string, str_to_search_for):
count = 0
for x in xrange(len(string) - len(str_to_search_for) + 1):
if string[x:x+len(str_to_search_for)] == str_to_search_for:
count += 1
return count
function('1011101111','11')
This method returns 5.
Is there a better way in Python?

Well, this might be faster since it does the comparing in C:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count

>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5
If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:
>>> sum(1 for _ in re.finditer('(?=11)', text))
5
As a function (re.escape makes sure the substring doesn't interfere with the regex):
def occurrences(text, sub):
return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
5

You can also try using the new Python regex module, which supports overlapping matches.
import regex as re
def count_overlapping(text, search_for):
return len(re.findall(search_for, text, overlapped=True))
count_overlapping('1011101111','11') # 5

Python's str.count counts non-overlapping substrings:
In [3]: "ababa".count("aba")
Out[3]: 1
Here are a few ways to count overlapping sequences, I'm sure there are many more :)
Look-ahead regular expressions
How to find overlapping matches with a regexp?
In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']
Generate all substrings
In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2

def count_substring(string, sub_string):
count = 0
for pos in range(len(string)):
if string[pos:].startswith(sub_string):
count += 1
return count
This could be the easiest way.

A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.
sequence = 'abaaadcaaaa'
substr = 'aa'
counts = sum([
sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts) # 5
The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.

s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))

How to find a pattern in another string with overlapping
This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.
def occurrences(pattern, text):
"""
input: search a pattern (regular expression) in a text
returns: a list of substrings and their positions
"""
p = re.compile('(?=({0}))'.format(pattern))
matches = re.finditer(p, text)
return [(match.group(1), match.start()) for match in matches]
print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))
[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]

My answer, to the bob question on the course:
s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
if s[i:i+3] == 'bob':
total += 1
print 'number of times bob occurs is: ', total

Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:
s = 'azcbobobegghakl'
count = 0
while 'bob' in s:
count += 1
s = s[(s.find('bob') + 2):]
print "Number of times bob occurs is: {}".format(count)

If strings are large, you want to use Rabin-Karp, in summary:
a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy

That can be solved using regex.
import re
def function(string, sub_string):
match = re.findall('(?='+sub_string+')',string)
return len(match)

def count_substring(string, sub_string):
counter = 0
for i in range(len(string)):
if string[i:].startswith(sub_string):
counter = counter + 1
return counter
Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.

re.subn hasn't been mentioned yet:
>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5

def count_overlaps (string, look_for):
start = 0
matches = 0
while True:
start = string.find (look_for, start)
if start < 0:
break
start += 1
matches += 1
return matches
print count_overlaps ('abrabra', 'abra')

Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.
def count_Occurrences(string, sub):
count=0
for i in range(0, len(string)-len(sub)+1):
if sub in string[i:i+len(sub)]:
count=count+1
print 'Number of times sub occurs in string (including overlaps): ', count

For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.
counted = 0
for i in range(len(string)):
if string[i*3:(i+1)*3] == 'xox':
counted = counted +1
print counted

An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:
def countSubstr(string, sub):
count = 0
while sub in string:
count += 1
string = string[string.find(sub) + 1:]
return count;
This avoids while True: and is a little cleaner in my opinion

This is another example of using str.find() but a lot of the answers make it more complicated than necessary:
def occurrences(text, sub):
c, n = 0, text.find(sub)
while n != -1:
c += 1
n = text.find(sub, n+1)
return c
In []:
occurrences('1011101111', '11')
Out[]:
5

Given
sequence = '1011101111'
sub = "11"
Code
In this particular case:
sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5
More generally, this
windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5
or extend to generators:
import itertools as it
iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)
Alternative
You can use more_itertools.locate:
import more_itertools as mit
len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5

A simple way to count substring occurrence is to use count():
>>> s = 'bobob'
>>> s.count('bob')
1
You can use replace () to find overlapping strings if you know which part will be overlap:
>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2
Note that besides being static, there are other limitations:
>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1
>>> s.replace('a', 'aa').count('aa')
3

def occurance_of_pattern(text, pattern):
text_len , pattern_len = len(text), len(pattern)
return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)

I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":
from itertools import count, takewhile
from operator import eq
# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
:param iterable: An iterable
:type iterable: ```Iterable```
:return: Number of items in iterable
:rtype: ```int```
"""
counter = count()
deque(zip(iterable, counter), maxlen=0)
return next(counter)
def begin_matches_end(s):
"""
Checks if the begin matches the end of the string
:param s: Input string of length > 0
:type s: ```str```
:return: Whether the beginning matches the end (checks first match chars
:rtype: ```bool```
"""
return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))

Solution with replaced parts of the string
s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)
Answer is 4.

If you want to count permutation counts of length 5 (adjust if wanted for different lengths):
def MerCount(s):
for i in xrange(len(s)-4):
d[s[i:i+5]] += 1
return d

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove multiple consecutive sequences of consecutive duplicate characters in python - python

Use regular expressions: re.sub(r"(\w)\1+(\w)\2+", r"\1\2", "goooaaaal") # -> goal re.sub(r"(\w)\1+(\w)\2+", r"\1\2", "coooool") # -> cool

def removeDuplicates(S): n = len(S) j = 0 if (n < 2) : return for i in range(n): if (S[j] != S[i]): j += 1 S[j] = S[i] j += 1 S = S[:j] return S This was taken directly from Geeks for Geeks. There is no way for a program to intuitively know that "cool" needs two "o's" as in your example.

Related

Find Longest Alphabetically Ordered Substring - Efficiently

How to find the longest repeating sequence using python

Generate tailor made thousand delimiter

python pattern count Find a “hidden message” in the replication origin

String count with overlapping occurrences [closed]

Categories

Resources