Find the repeating substring a string is composed of, if it exists - python

How would you go about splitting a normal string in to as many identical pieces as possible whilst using all characters. For example
a = "abab"
Would return "ab", whereas with
b= "ababc"
It would return "ababc", as it can't be split into identical pieces using all letters.

This is very similar, but not identical, to How can I tell if a string repeats itself in Python? – the difference being that that question only asks to determine whether a string is made up of identical repeating substrings, rather than what the repeating substring (if any) is.
The accepted (and by far the best performing) answer to that question can be adapted to return the repeating string if there is one:
def repeater(s):
i = (s+s)[1:-1].find(s)
if i == -1:
return s
else:
return s[:i+1]
Examples:
>>> repeater('abab')
'ab'
>>> repeater('ababc')
'ababc'
>>> repeater('xyz' * 1000000)
'xyz'
>>> repeater('xyz' * 50 + 'q')
'xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzq'

It seems that repeating substring has no pre and after letters, so it also could be this way:
In[4]: re.sub(r'^([a-z]+)\1$',r'\1','abab')
Out[4]: 'ab'
In[5]: re.sub(r'^([a-z]+)\1$',r'\1','ababc')
Out[5]: 'ababc'
([a-z]+) means substring, \1 means repeat.
EDIT :
re.sub(r'^([a-z]+)\1{1,}$',r'\1','abcabcabcabc')
'abc'

Related

How can you group a very specfic pattern with regex?

Problem:
https://coderbyte.com/editor/Simple%20Symbols
The str parameter will be composed of + and = symbols with
several letters between them (ie. ++d+===+c++==a) and for the string
to be true each letter must be surrounded by a + symbol. So the string
to the left would be false. The string will not be empty and will have
at least one letter.
Input:"+d+=3=+s+"
Output:"true"
Input:"f++d+"
Output:"false"
I'm trying to create a regular expression for the following problem, but I keep running into various problems. How can I produce something that returns the specified rules('+\D+')?
import re
plusReg = re.compile(r'[(+A-Za-z+)]')
plusReg.findall()
>>> []
Here I thought I could create my own class that searches for the pattern.
import re
plusReg = re.compile(r'([\\+,\D,\\+])')
plusReg.findall('adf+a+=4=+S+')
>>> ['a', 'd', 'f', '+', 'a', '+', '=', '=', '+', 'S', '+']
Here I thought I the '\\+' would single out the plus symbol and read it as a char.
mo = plusReg.search('adf+a+=4=+S+')
mo.group()
>>>'a'
Here using the same shell, I tried using the search instead of findall, but I just ended up with the first letter which isn't even surrounded by a plus.
My end result is to group the string 'adf+a+=4=+S+' into ['+a+','+S+'] and so on.
edit:
Solution:
import re
def SimpleSymbols(str):
#added padding, because if str = 'y+4==+r+'
#then program would return true when it should return false.
string = '=' + str + '='
#regex that returns false if a letter *doesn't* have a + in front or back
plusReg = re.compile(r'[^\+][A-Za-z].|.[A-Za-z][^\+]')
#if statement that returns "true" if regex doesn't find any letters
#without a + behind or in front
if plusReg.search(string) is None:
return "true"
return "false"
print SimpleSymbols(raw_input())
I borrowed some code from ekhumoro and Sanjay. Thanks
One approach is to search the string for any letters that are either: (1) not preceeded by a +, or (2) not followed by a +. This can be done using look ahead and look behind assertions:
>>> rgx = re.compile(r'(?<!\+)[a-zA-Z]|[a-zA-Z](?!\+)')
So if rgx.search(string) returns None, the string is valid:
>>> rgx.search('+a+') is None
True
>>> rgx.search('+a+b+') is None
True
but if it returns a match, the string is invalid:
>>> rgx.search('+ab+') is None
False
>>> rgx.search('+a=b+') is None
False
>>> rgx.search('a') is None
False
>>> rgx.search('+a') is None
False
>>> rgx.search('a+') is None
False
The important thing about look ahead/behind assertions is that they don't consume characters, so they can handle overlapping matches.
Something like this should do the trick:
import re
def is_valid_str(s):
return re.findall('[a-zA-Z]', s) == re.findall('\+([a-zA-Z])\+', s)
Usage:
In [10]: is_valid_str("f++d+")
Out[10]: False
In [11]: is_valid_str("+d+=3=+s+")
Out[11]: True
I think you are on the right track. The regular expression you have is correct, but it can simplify down to just letters:
search_pattern = re.compile(r'\+[a-zA-z]\+')
for upper and lower case strings. Now we can use this regex with the findall function:
results = re.findall(search_pattern, 'adf+a+=4=+S+') # returns ['+a+', '+S+']
Now the question needs you to return a boolean depending on if the string is valid to the specified pattern so we can wrap this all up into a function:
def is_valid_pattern(pattern_string):
search_pattern = re.compile(r'\+[a-zA-z]?\+')
letter_pattern = re.compile(r'[a-zA-z]') # to search for all letters
results = re.findall(search_pattern, pattern_string)
letters = re.findall(letter_pattern, pattern_string)
# if the lenght of the list of all the letters equals the length of all
# the values found with the pattern, we can say that it is a valid string
return len(results) == len(letter_pattern)
You should be looking for what isn't there, as opposed to what is. You should search for something like, ([^\+][A-Za-z]|[A-Za-z][^\+]). The | in the middle is a logical or operator. Then on either side, it checks if it can find any scenario where there is a letter without a "+" on the left/right respectively. If if finds something, that means the string fails. If it can't find anything, that means that there are no instances of a letter not being surrounded by "+"'s.

How to make this simple string function "pythonic"

Coming from the C/C++ world and being a Python newb, I wrote this simple string function that takes an input string (guaranteed to be ASCII) and returns the last four characters. If there’s less than four characters, I want to fill the leading positions with the letter ‘A'. (this was not an exercise, but a valuable part of another complex function)
There are dozens of methods of doing this, from brute force, to simple, to elegant. My approach below, while functional, didn’t seem "Pythonic".
NOTE: I’m presently using Python 2.6 — and performance is NOT an issue. The input strings are short (2-8 characters), and I call this function only a few thousand times.
def copyFourTrailingChars(src_str):
four_char_array = bytearray("AAAA")
xfrPos = 4
for x in src_str[::-1]:
xfrPos -= 1
four_char_array[xfrPos] = x
if xfrPos == 0:
break
return str(four_char_array)
input_str = "7654321"
print("The output of {0} is {1}".format(input_str, copyFourTrailingChars(input_str)))
input_str = "21"
print("The output of {0} is {1}".format(input_str, copyFourTrailingChars(input_str)))
The output is:
The output of 7654321 is 4321
The output of 21 is AA21
Suggestions from Pythoneers?
I would use simple slicing and then str.rjust() to right justify the result using A as fillchar . Example -
def copy_four(s):
return s[-4:].rjust(4,'A')
Demo -
>>> copy_four('21')
'AA21'
>>> copy_four('1233423')
'3423'
You can simple adding four sentinel 'A' character before the original string, then take the ending four characters:
def copy_four(s):
return ('AAAA'+s)[-4:]
That's simple enough!
How about something with string formatting?
def copy_four(s):
return '{}{}{}{}'.format(*('A'*(4-len(s[-4:])) + s[-4:]))
Result:
>>> copy_four('abcde')
'bcde'
>>> copy_four('abc')
'Aabc'
Here's a nicer, more canonical option:
def copy_four(s):
return '{:A>4}'.format(s[-4:])
Result:
>>> copy_four('abcde')
'bcde'
>>> copy_four('abc')
'Aabc'
You could use slicing to get the last 4 characters, then string repetition (* operator) and concatenation (+ operator) as below:
def trailing_four(s):
s = s[-4:]
s = 'A' * (4 - len(s)) + s
return s
You can try this
def copy_four_trailing_chars(input_string)
list_a = ['A','A','A','A']
str1 = input_string[:-4]
if len(str1) < 4:
str1 = "%s%s" % (''.join(list_a[:4-len(str1)]), str1)
return str1

Python: Finding Regex occurance for variable char

I know, for example, that if I want to find lengths of all the occurrences of consecutive 'a's
in input = "1111aaaaa11111aaaaaaa111aaa", I can do
[len(s) for s in re.findall(r'a+', input)]
However, I'm not sure how to do this with a char variable. For instance,
CHAR = 'a'
[len(s) for s in re.findall(r'??????', input)] # Trying to find occurrences of CHARs..
Is there a way to do this??
Here is a general solution that should work for strings of any length:
CHAR = 'a'
[len(s) for s in re.findall(r'(?:{})+'.format(re.escape(CHAR)), input)]
Or an alternative using itertools (single character only):
import itertools
[sum(1 for _ in g) for k, g in itertools.groupby(input) if k == CHAR]
I think what you're asking for is:
[len(s) for s in re.findall(r'{}+'.format(CHAR), input)]
Except of course that this won't work if CHAR is a special value, like \. If that's an issue:
[len(s) for s in re.findall(r'{}+'.format(re.escape(CHAR)), input)]
If you want to match two or more instead of one or more, the syntax for that is {2,}. As the docs say:
{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab…
That gets a little ugly when we're using {} for string formatting, so let's switch to %-formatting:
[len(s) for s in re.findall(r'%s{2,}' % (re.escape(CHAR),), input)]
… or just simple concatenation:
[len(s) for s in re.findall(re.escape(CHAR) + r'{2,}', input)]

Regular expression to replace a character on odd repeated occurrences in Python

Can't get a regular expression to replace a character on odd repeated occurrences in Python.
Example:
char = ``...```.....``...`....`````...`
to
``...``````.....``...``....``````````...``
on even occurrences doesn't replace.
for example:
>>> import re
>>> s = "`...```.....``...`....`````...`"
>>> re.sub(r'((?<!`)(``)*`(?!`))', r'\1\1', s)
'``...``````.....``...``....``````````...``'
Maybe I'm old fashioned (or my regex skills aren't up to par), but this seems to be a lot easier to read:
import re
def double_odd(regex,string):
"""
Look for groups that match the regex. Double every second one.
"""
count = [0]
def _double(match):
count[0] += 1
return match.group(0) if count[0]%2 == 0 else match.group(0)*2
return re.sub(regex,_double,string)
s = "`...```.....``...`....`````...`"
print double_odd('`',s)
print double_odd('`+',s)
It seems that I might have been a little confused about what you were actually looking for. Based on the comments, this becomes even easier:
def odd_repl(match):
"""
double a match (all of the matched text) when the length of the
matched text is odd
"""
g = match.group(0)
return g*2 if len(g)%2 == 1 else g
re.sub(regex,odd_repl,your_string)
This may be not as good as the regex solution, but works:
In [101]: s1=re.findall(r'`{1,}',char)
In [102]: s2=re.findall(r'\.{1,}',char)
In [103]: fill=s1[-1] if len(s1[-1])%2==0 else s1[-1]*2
In [104]: "".join("".join((x if len(x)%2==0 else x*2,y)) for x,y in zip(s1,s2))+fill
Out[104]: '``...``````.....``...``....``````````...``'

Swapping every second character in a string in Python

I have the following problem: I would like to write a function in Python which, given a string, returns a string where every group of two characters is swapped.
For example given "ABCDEF" it returns "BADCFE".
The length of the string would be guaranteed to be an even number.
Can you help me how to do it in Python?
To add another option:
>>> s = 'abcdefghijkl'
>>> ''.join([c[1] + c[0] for c in zip(s[::2], s[1::2])])
'badcfehgjilk'
import re
print re.sub(r'(.)(.)', r'\2\1', "ABCDEF")
from itertools import chain, izip_longest
''.join(chain.from_iterable(izip_longest(s[1::2], s[::2], fillvalue = '')))
You can also use islices instead of regular slices if you have very large strings or just want to avoid the copying.
Works for odd length strings even though that's not a requirement of the question.
While the above solutions do work, there is a very simple solution shall we say in "layman's" terms. Someone still learning python and string's can use the other answers but they don't really understand how they work or what each part of the code is doing without a full explanation by the poster as opposed to "this works". The following executes the swapping of every second character in a string and is easy for beginners to understand how it works.
It is simply iterating through the string (any length) by two's (starting from 0 and finding every second character) and then creating a new string (swapped_pair) by adding the current index + 1 (second character) and then the actual index (first character), e.g., index 1 is put at index 0 and then index 0 is put at index 1 and this repeats through iteration of string.
Also added code to ensure string is of even length as it only works for even length.
string = "abcdefghijklmnopqrstuvwxyz123"
# use this prior to below iteration if string needs to be even but is possibly odd
if len(string) % 2 != 0:
string = string[:-1]
# iteration to swap every second character in string
swapped_pair = ""
for i in range(0, len(string), 2):
swapped_pair += (string[i + 1] + string[i])
# use this after above iteration for any even or odd length of strings
if len(swapped_pair) % 2 != 0:
swapped_adj += swapped_pair[-1]
print(swapped_pair)
badcfehgjilknmporqtsvuxwzy21 # output if the "needs to be even" code used
badcfehgjilknmporqtsvuxwzy213 # output if the "even or odd" code used
Here's a nifty solution:
def swapem (s):
if len(s) < 2: return s
return "%s%s%s"%(s[1], s[0], swapem (s[2:]))
for str in ("", "a", "ab", "abcdefgh", "abcdefghi"):
print "[%s] -> [%s]"%(str, swapem (str))
though possibly not suitable for large strings :-)
Output is:
[] -> []
[a] -> [a]
[ab] -> [ba]
[abcdefgh] -> [badcfehg]
[abcdefghi] -> [badcfehgi]
If you prefer one-liners:
''.join(reduce(lambda x,y: x+y,[[s[1+(x<<1)],s[x<<1]] for x in range(0,len(s)>>1)]))
Here's a another simple solution:
"".join([(s[i:i+2])[::-1]for i in range(0,len(s),2)])

Categories

Resources