How to Identify Repetitive Characters in a String Using Python? - python

I am new to python and I want to write a program that determines if a string consists of repetitive characters. The list of strings that I want to test are:
Str1 = "AAAA"
Str2 = "AGAGAG"
Str3 = "AAA"
The pseudo-code that I come up with:
WHEN len(str) % 2 with zero remainder:
- Divide the string into two sub-strings.
- Then, compare the two sub-strings and check if they have the same characters, or not.
- if the two sub-strings are not the same, divide the string into three sub-strings and compare them to check if repetition occurs.
I am not sure if this is applicable way to solve the problem, Any ideas how to approach this problem?
Thank you!

You could use the Counter library to count the most common occurrences of the characters.
>>> from collections import Counter
>>> s = 'abcaaada'
>>> c = Counter(s)
>>> c.most_common()
[('a', 5), ('c', 1), ('b', 1), ('d', 1)]
To get the single most repetitive (common) character:
>>> c.most_common(1)
[('a', 5)]

You could do this using a RegX backreferences.

To find a pattern in Python, you are going to need to use "Regular Expressions". A regular expression is typically written as:
match = re.search(pat, str)
This is usually followed by an if-statement to determine if the search succeeded.
for example this is how you would find the pattern "AAAA" in a string:
import re
string = ' blah blahAAAA this is an example'
match = re.search(r'AAAA', string)
if match:
print 'found', match.group()
else:
print 'did not find'
This returns "found 'AAAA'"
Do the same for your other two strings and it will work the same.
Regular expressions can do a lot more than just this so work around with them and see what else they can do.

Assuming you mean the whole string is a repeating pattern, this answer has a good solution:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

Related

Python package for converting finite regex to a text array? [duplicate]

Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?
Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']
I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.
It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.

Split a unicode string into components containing numbers and letters

I'd like to split the string u'123K into 123 and K. I've tried re.match("u'123K", "\d+") to match the number and re.match("u'123K", "K") to match the letter but they don't work. What is a Pythonic way to do this?
Use re.findall() to find all numbers and characters:
>>> s = u'123K'
>>> re.findall(r'\d+|[a-zA-Z]+', s) # or use r'\d+|\D+' as mentioned in comment in order to match all numbers and non-numbers.
['123', 'K']
If you are just dealing with this string or if you only want to split the string from the last character you can simply use a indexing:
num, charracter = s[:-1], s[-1:]
You can also use itertools.groupby method, grouping digits:
>>> import itertools as it
>>> for _,v in it.groupby(s, key=str.isdigit):
print(''.join(v))
123
K

Remove repeating characters from words

I was wondering what is the best way to convert something like "haaaaapppppyyy" to "haappyy".
Basically, when parsing slang, people sometimes repeat characters for added emphasis.
I was wondering what the best way to do this is? Using set() doesn't work because the order of the letters is obviously important.
Any ideas? I'm using Python + nltk.
It can be done using regular expressions:
>>> import re
>>> re.sub(r'(.)\1+', r'\1\1', "haaaaapppppyyy")
'haappyy'
(.)\1+ repleaces any character (.) followed by one or more of the same character (because of the backref \1 it must be the same) by twice the character.
You can squash multiple occurrences of letters with itertools.groupby:
>>> ''.join(c for c, _ in groupby("haaaaapppppyyy"))
'hapy'
Similarly, you can get haappyy from groupby with
>>> ''.join(''.join(s)[:2] for _, s in groupby("haaaaapppppyyy"))
'haappyy'
You should do it without reduce or regexps:
>>> s = 'hhaaaaapppppyyy'
>>> ''.join(['' if i>1 and e==s[i-2] else e for i,e in enumerate(s)])
'haappyy'
The number of repetitions are hardcoded to >1 and -2 above. The general case:
>>> reps = 1
>>> ''.join(['' if i>reps-1 and e==s[i-reps] else e for i,e in enumerate(s)])
'hapy'
This is one way of doing it (limited to the obvious constraint that python doesn't speak english).
>>> s="haaaappppyy"
>>> reduce(lambda x,y: x+y if x[-2:]!=y*2 else x, s, "")
'haappyy'

Detect repetitions in string

I have a simple problem, but can't come with a simple solution :)
Let's say I have a string. I want to detect if there is a repetition in it.
I'd like:
"blablabla" # => (bla, 3)
"rablabla" # => (bla, 2)
The thing is I don't know what pattern I am searching for (I don't have "bla" as input).
Any idea?
EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:
In a string, there is either a pattern that is repeted or not.
The repeted pattern can be of any length.
If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.
Example:
"testblblblblb" # => ("bl",4)
import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0))/len(match.group(1)))
finds all non-overlapping repeating matches, using the shortest possible unit of repetition:
>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

python: keep char only if it is within this list

i have a list:
a = ['a','b','c'.........'A','B','C'.........'Z']
and i have string:
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
i want to keep ONLY those characters in string1 that exist in a
what is the most effecient way to do this? perhaps instead of having a be a list, i should just make it a string? like this a='abcdefg..........ABC..Z' ??
This should be faster.
>>> import re
>>> string1 = 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
>>> a = ['E', 'i', 'W']
>>> r = re.compile('[^%s]+' % ''.join(a))
>>> print r.sub('', string1)
EiiWW
This is even faster than that.
>>> all_else = ''.join( chr(i) for i in range(256) if chr(i) not in set(a) )
>>> string1.translate(None, all_else)
'EiiWW'
44 microsec vs 13 microsec on my laptop.
How about that?
(Edit: turned out, translate yields the best performance.)
''.join([s for s in string1 if s in a])
Explanation:
[s for s in string1 if s in a]
creates a list of all characters in string1, but only if they are also in the list a.
''.join([...])
turns it back into a string by joining it with nothing ('') in between the elements of the given list.
List comprehension to the rescue!
wanted = ''.join(letter for letter in string1 if letter in a)
(Note that when passing a list comprehension to a function you can omit the brackets so that the full list isn't generated prior to being evaluated. While semantically the same as a list comprehension, this is called a generator expression.)
If, you are going to do this with large strings, there is a faster solution using translate; see this answer.
#katrielalex: To spell it out:
import string
string1= 's#$%ERGdfhliisgdfjkskjdfW$JWLI3590823r'
non_letters= ''.join(chr(i) for i in range(256) if chr(i) not in string.letters)
print string1.translate(None,non_letters)
print 'Simpler, but possibly less correct'
print string1.translate(None, string.punctuation+string.digits+string.whitespace)

Categories

Resources