For a given substring, I need to determine all the lengths, in order, of the repeating chains of that substring in a given string.
Example: for the substring ATT and a string ATTATTATT GGG ATTATT GGG ATT, I want to return (3,2,1).
I think I have a solution, but it's inelegant and potentially slow (written below). I wanted to use more_itertools.consecutive_groups() on the start locations of the substring, but couldn't figure out how to adjust for the substring being longer than length 1.
spans = [i.span() for i in
finditer(substring,string)]
lengths = []
runninglength = 1
for i in range(len(spans)):
if i == len(spans)-1:
lengths.append(runninglength)
elif spans[i][1] == spans[i+1][0]:
runninglength += 1
else:
lengths.append(runninglength)
runninglength = 1
return tuple(lengths)
Is there a faster, less confusing way to accomplish this?
You could use re.findall to find all the non-overlapping matches in the string, then divide the length of the captured matches by the length of the search string to get the number of consecutive matches. For example:
import re
s = 'ATTATTATT GGG ATTATT GGG ATT'
sub = 'ATT'
sl = len(sub)
regex = re.compile(f'((?:{sub})+)')
lens = [len(m) // sl for m in regex.findall(s)]
print(lens)
Output:
[3, 2, 1]
I think the below is what you want to do:
sub_string = 'ATT'
string = 'ATTATTATT GGGATTATT GGGATT'
count = tuple(sub.count(sub_string) for sub in string.split(' '))
Try this
tuple(each.count('ATT') for each in 'ATTATTATT GGG ATTATT GGG ATT'.split(' '))
Output:
(3, 2, 1)
Related
My respects, colleagues.
I need to write a function that determines the maximum number of consecutive BA, CA character pairs per line.
print(f("BABABA125")) # -> 3
print(f("234CA4BACA")) # -> 2
print(f("BABACABACA56")) # -> 5
print(f("1BABA24CA")) # -> 2
Actually, I've written a function, but, to my mind, it's not very good.
def f(s: str) -> int:
res = 0
if not s:
return res
cur = 0
i = len(s) - 1
while i >= 0:
if s[i] == "A" and (s[i-1] == "B" or s[i-1] == "C"):
cur += 1
i -= 2
else:
if cur > res:
res = cur
cur = 0
i -= 1
else:
if cur > res:
res = cur
return res
In addition, I'm not allowed to use libraries and regular expressions (only string and list methods). Could you please help me or rate my code in this context. I'll be very grateful.
Here's a function f2 that performs this operation.
if not re.search('(BA|CA)', s): return 0
First check if the string actually contains any BA or CA (to prevent ValueError: max() arg is an empty sequence on step 3), and return 0 if there aren't any.
matches = re.finditer(r'(?:CA|BA)+', s)
Find all consecutive sequences of CA or BA, using non-capturing groups to ensure re.finditer outputs only full matches instead of partial matches.
res = max(matches, key=lambda m: len(m.group(0)))
Then, among the matches (re.Match objects), fetch the matched substring using m.group(0) and compare their lengths to find the longest one.
return len(res.group(0))//2
Divide the length of the longest result by 2 to get the number of BA or CAs in this substring. Here we use floor division // to coerce the output into an int, since division would normally convert the answer to float.
import re
strings = [
"BABABA125", # 3
"234CA4BACA", # 2
"BABACABACA56", # 5
"1BABA24CA", # 2
"NO_MATCH_TO_BE_FOUND", # 0
]
def f2(s: str):
if not re.search('(BA|CA)', s): return 0
matches = re.finditer(r'(?:CA|BA)+', s)
res = max(matches, key=lambda m: len(m.group(0)))
return len(res.group(0))//2
for s in strings:
print(f2(s))
UPDATE: Thanks to #StevenRumbalski for providing a simpler version of the above answer. (I split it into multiple lines for readability)
def f3(s):
if not re.search('(BA|CA)', s): return 0
matches = re.findall(r'(?:CA|BA)+', s)
max_length = max(map(len, matches))
return max_length // 2
if not re.search('(BA|CA)', s): return 0
Same as above
matches = re.findall(r'(?:CA|BA)+', s)
Find all consecutive sequences of CA or BA, but each value in matches is a str instead of a re.Match, which is easier to handle.
max_length = max(map(len, matches))
Map each matched substring to its length and find the maximum length among them.
return max_length // 2
Floor divide the length of the longest matching substring by the length of BA, CA to get the number of consecutive occurrences of BA or CA in this string.
Here's an alternative implementation without any imports. Do note however that it's quite slow compared to your C-style implementation.
The idea is simple: Transform the input string into a string consisting of only two types of characters c1 and c2, with c1 representing CA or BA, and c2 representing anything else. Then find the longest substring of consecutive c1s.
The implementation is as follows:
Pick a char that is guaranteed not to appear in the input string; here we use + as an example. Then pick a char different from the previous one; here we use -.
Replace each occurrence of CA and BA with a +.
Replace everything else in the string (that is not a +) with a - (this is why + cannot be present in the original input string). Now we have a string consisting purely of +s and -s.
Split the string with - as delimiter, and map each resulting substring to their length.
Return the maximum of these substring lengths.
strings = [
"BABABA125", # 3
"234CA4BACA", # 2
"BABACABACA56", # 5
"1BABA24CA", # 2
"NO_MATCH_TO_BE_FOUND", # 0
]
def f4(string: str):
string = string.replace("CA", "+")
string = string.replace("BA", "+")
string = "".join([(c if c == "+" else "-") for c in string])
str_list = string.split("-")
str_lengths = map(len, str_list)
return max(str_lengths)
for s in strings:
print(f4(s))
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?
>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)
Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2
I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)
I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.
>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']
It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.
I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!
In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']
As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.
I have strings like "aaaaabbbbbbbbbbbbbbccccccccccc".
The number of the chars can differ and sometimes there can be dash inside the string, like "aaaaa-bbbbbbbbbbbbbbccccccccccc".
Is there any smart way to either split it "aaaaa","bbbbbbbbbbbbbb","ccccccccccc" and get the indices of were it is split or just get the indices, without looping through every string? If the dash is between to patterns it can end up either in the left or right one as long it is always handled the same.
Any idea?
Regular expression MatchObject results include indices of the match. What remains is to match repeating characters:
import re
repeat = re.compile(r'(?P<start>[a-z])(?P=start)+-?')
would match only if a given letter character (a-z) is repeated at least once:
>>> for match in repeat.finditer("aaaaabbbbbbbbbbbbbbccccccccccc"):
... print match.group(), match.start(), match.end()
...
aaaaa 0 5
bbbbbbbbbbbbbb 5 19
ccccccccccc 19 30
The .start() and .end() methods on the match result give you the exact positions in the input string.
Dashes are included in the matches, but not non-repeating characters:
>>> for match in repeat.finditer("a-bb-cccccccc"):
... print match.group(), match.start(), match.end()
...
bb- 2 5
cccccccc 5 13
If you want the a- part to be a match, simply replace the + with a * multiplier:
repeat = re.compile(r'(?P<start>[a-z])(?P=start)*-?')
What about using itertools.groupby?
>>> s = 'aaaaabbbbbbbbbbbbbbccccccccccc'
>>> from itertools import groupby
>>> [''.join(v) for k,v in groupby(s)]
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
This will put the - as their own substrings which could easily be filtered out.
>>> s = 'aaaaa-bbbbbbbbbbbbbb-ccccccccccc'
>>> [''.join(v) for k,v in groupby(s) if k != '-']
['aaaaa', 'bbbbbbbbbbbbbb', 'ccccccccccc']
str="aaaaabbbbbbbbbbbbbbccccccccccc"
p = [0]
for i, c in enumerate(zip(str, str[1:])):
if c[0] != c[1]:
p.append(i + 1)
print p
# [0, 5, 19]
I have a shorter string s I'm trying to match to a longer string s1. 1's match 1's, but 0's will match either a 0 or a 1.
For instance:
s = '11111' would match s1 = '11111'
s = '11010' would match s1 = '11111' or '11011' or '11110' or '11010'
I know regular expressions would make this much easier but am confused on where to start.
Replace each instance of 0 with [01] to enable it matching either 0 or 1:
s = '11010'
pattern = s.replace('0', '[01]')
regex = re.compile(pattern)
regex.match('11111')
regex.match('11011')
It looks to me like you're actually looking for bit arithmetics
s = '11010'
n = int(s, 2)
for r in ('11111', '11011', '11110', '11010'):
if int(r, 2) & n == n:
print r, 'matches', s
else:
print r, 'doesnt match', s
import re
def matches(pat, s):
p = re.compile(pat.replace('0', '[01]') + '$')
return p.match(s) is not None
print matches('11111', '11111')
print matches('11111', '11011')
print matches('11010', '11111')
print matches('11010', '11011')
You say "match to a longer string s1", but you don't say whether you'd like to match the start of the string, or the end etc. Until I better understand your requirements, this performs an exact match.