Detect repetitions in string - python

I have a simple problem, but can't come with a simple solution :)
Let's say I have a string. I want to detect if there is a repetition in it.
I'd like:
"blablabla" # => (bla, 3)
"rablabla" # => (bla, 2)
The thing is I don't know what pattern I am searching for (I don't have "bla" as input).
Any idea?
EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:
In a string, there is either a pattern that is repeted or not.
The repeted pattern can be of any length.
If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.
Example:
"testblblblblb" # => ("bl",4)

import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0))/len(match.group(1)))
finds all non-overlapping repeating matches, using the shortest possible unit of repetition:
>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

Related

How to extract each word consecutive to its own previous number in a string and sorting the result in Python

Input : x3b4U5i2
Output : bbbbiiUUUUUxxx
How can i solve this problem in Python. I have to print the word next to it's number n times and sort it
It wasn't clear if multiple digit counts or groups of letters should be handled. Here's a solution that does all of that:
import re
def main(inp):
parts = re.split(r"(\d+)", inp)
parts_map = {parts[i]:int(parts[i+1]) for i in range(0, len(parts)-1, 2)}
print(''.join([c*parts_map[c] for c in sorted(parts_map.keys(),key=str.lower)]))
main("x3b4U5i2")
main("x3brx4U5i2")
main("x23b4U35i2")
Result:
bbbbiiUUUUUxxx
brxbrxbrxbrxiiUUUUUxxx
bbbbiiUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUxxxxxxxxxxxxxxxxxxxxxxx
I'm assuming the formatting will always be <char><int> with <int> being in between 1 and 9...
input_ = "x3b4U5i2"
result_list = [input_[i]*int(input_[i+1]) for i in range(0, len(input_), 2)]
result_list.sort(key=str.lower)
result = ''.join(result_list)
There's probably a much more performance-oriented approach to solving this, it's just the first solution that came into my limited mind.
Edit
After the feedback in the comments I've tried to improve performance by sorting it first, but I have actually decreased performance in the following implementaiton:
input_ = "x3b4U5i2"
def sort_first(value):
return value[0].lower()
tuple_construct = [(input_[i], int(input_[i+1])) for i in range(0, len(input_), 2)]
tuple_construct.sort(key=sort_first)
result = ''.join([tc[0] * tc[1] for tc in tuple_construct])
Execution time for 100,000 iterations on it:
1) The execution time is: 0.353036
2) The execution time is: 0.4361724
One option, extract the character/digit(s) pairs with a regex, sort them by letter (ignoring case), multiply the letter by the number of repeats, join:
s = 'x3b4U5i2'
import re
out = ''.join([c*int(i) for c,i in
sorted(re.findall('(\D)(\d+)', s),
key=lambda x: x[0].casefold())
])
print(out)
Output: bbbbiiUUUUUxxx
If you want to handle multiple characters you can use '(\D+)(\d+)'
No list comprehensions or generator expressions in sight. Just using re.sub with a lambda to expand the length encoding, then sorting that, and then joing that back into a string.
import re
s = "x3b4U5i2"
''.join(sorted(re.sub(r"(\D+)(\d+)",
lambda m: m.group(1)*int(m.group(2)),
s),
key=lambda x: x[0].casefold()))
# 'bbbbiiUUUUUxxx'
If we use re.findall to extract a list of pairs of strings and multipliers:
import re
s = 'x3b4U5i2'
pairs = re.findall(r"(\D+)(\d+)", s)
Then we can use some functional style to sort that list before expanding it.
from operator import itemgetter
def compose(f, g):
return lambda x: f(g(x))
sorted(pairs, key=compose(str.lower, itemgetter(0)))
# [('b', '4'), ('i', '2'), ('U', '5'), ('x', '3')]

Python package for converting finite regex to a text array? [duplicate]

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?
Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']
I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.
It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

How to iterate over all overlapping matches in string?

I want to iterate over all overlapping matches of terms in a sentence/string. If it is not possible to solve the problem with just a single regex, it is fine to go with more than one -- but the number of expressions should be independent of the number of terms.
>>> import re
>>> text = 'X has effect on Y.'
>>> terms = ['has effect', 'effect', 'effect on']
>>> pattern = r"(?=\b(" + "|".join(re.escape(term) for term in terms) + r")\b)"
>>> print pattern
(?=\b(has\ effect|effect|effect\ on)\b)
>>> [(m.span(1), text[m.start(1):m.end(1)]) for m in re.finditer(pattern, text)]
[((2, 12), 'has effect'), ((6, 12), 'effect')]
I'm still not able to extract 'effect on' term -- I think I stuck with 'lookbehind assertion'. Thanks!
EDIT #1:
Problem description in terms of input/output specs with example:
Input:
a sentence, e.g. 'X has effect on Y.'
a list of terms, e.g. ['has effect', 'effect', 'effect on']
Output:
a list of mentions (((start, end), term) tuples), e.g. [((2, 12), 'has effect'), ((6, 12), 'effect'), ((6, 15), 'effect on')]

How to Identify Repetitive Characters in a String Using Python?

I am new to python and I want to write a program that determines if a string consists of repetitive characters. The list of strings that I want to test are:
Str1 = "AAAA"
Str2 = "AGAGAG"
Str3 = "AAA"
The pseudo-code that I come up with:
WHEN len(str) % 2 with zero remainder:
- Divide the string into two sub-strings.
- Then, compare the two sub-strings and check if they have the same characters, or not.
- if the two sub-strings are not the same, divide the string into three sub-strings and compare them to check if repetition occurs.
I am not sure if this is applicable way to solve the problem, Any ideas how to approach this problem?
Thank you!
You could use the Counter library to count the most common occurrences of the characters.
>>> from collections import Counter
>>> s = 'abcaaada'
>>> c = Counter(s)
>>> c.most_common()
[('a', 5), ('c', 1), ('b', 1), ('d', 1)]
To get the single most repetitive (common) character:
>>> c.most_common(1)
[('a', 5)]
You could do this using a RegX backreferences.
To find a pattern in Python, you are going to need to use "Regular Expressions". A regular expression is typically written as:
match = re.search(pat, str)
This is usually followed by an if-statement to determine if the search succeeded.
for example this is how you would find the pattern "AAAA" in a string:
import re
string = ' blah blahAAAA this is an example'
match = re.search(r'AAAA', string)
if match:
print 'found', match.group()
else:
print 'did not find'
This returns "found 'AAAA'"
Do the same for your other two strings and it will work the same.
Regular expressions can do a lot more than just this so work around with them and see what else they can do.
Assuming you mean the whole string is a repeating pattern, this answer has a good solution:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

extracting data from matchobjects

I have a long sequence with multiple repeats of a specific string( 'say GAATTC') randomly throughout the sequence string. I'm currently using the regular expression .span() to provide with me with the indices of where the pattern 'GAATTC' is found. Now I want to use those indices to slice the pattern between the G and A (i.e. 'G|AATTC').
How do I use the data from the match object to slice those out?
If I understand you correctly, you have the string and an index where the sequence GAATTC starts, so do you need this (i here is the m.start for the group)?
>>> seq = "GAATTC"
>>> s = "AATCCTGAGAATTCAAC"
>>> i = 8 # the index where seq starts in s
>>> s[i:]
'GAATTCAAC'
>>> s[i:i+len(seq)]
'GAATTC'
That extracts it. You can also slice the original sequence at the G like this:
>>> s[:i+1]
'AATCCTGAG'
>>> s[i+1:]
'AATTCAAC'
>>>
If what you want to do is replace the 'GAATTC' by the 'G|AATTC' one (not sure of what you want to do in the end), I think that you can manage this without regex:
>>> string = 'GAATTCAAGAATTCTTGAATTCGAATTCAATATATA'
>>> string.replace('GAATTC', 'G|AATTC')
'G|AATTCAAG|AATTCTTG|AATTCG|AATTCAATATATA'
EDIT: ok, this way can be adapted to suit what you want to do:
>>> groups = string.replace('GAATTC', 'G|AATTC').split('|')
>>> groups
['G', 'AATTCAAG', 'AATTCTTG', 'AATTCG', 'AATTCAATATATA']
>>> map(len, groups)
[1, 8, 8, 6, 13]

Categories

Resources