Regex-python: Match a string in alphabetical order [duplicate] - python

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).
I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.
However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?
Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.
See regex in use here
((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)
Results in:
abc
ghiijk
y
x
z
Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):
a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
(?(?!b)(*ACCEPT)) If clause evaluating the condition.
(?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
(*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.
The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:
def LNDS(s):
start = 0
cur_len = 1
max_len = 1
for i in range(1,len(s)):
if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
cur_len += 1
else:
if cur_len > max_len:
max_len = cur_len
start = i - cur_len
cur_len = 1
if cur_len > max_len:
max_len = cur_len
start = len(s) - cur_len
return s[start:start+max_len]
>>> LNDS('abcghiijkyxz')
'ghiijk'
We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.

Generate all the regex substrings like ^a+b+c+$ (longest to shortest).
Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.
def all_substrings(s):
n = len(s)
for i in xrange(n, 0, -1):
for j in xrange(n - i + 1):
yield s[j:j + i]
def longest_alphabetical_substring(s):
for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
for u in all_substrings(s):
if r.match(u):
return u
print longest_alphabetical_substring("abcghiijkyxz")
That prints "ghiijk".

Regex: char+ meaning a+b+c+...
Details:
+ Matches between one and unlimited times
Python code:
import re
def LNDS(text):
array = []
for y in range(97, 122): # a - z
st = r"%s+" % chr(y)
for x in range(y+1, 123): # b - z
st += r"%s+" % chr(x)
match = re.findall(st, text)
if match:
array.append(max(match, key=len))
else:
break
if array:
array = [max(array, key=len)]
return array
Output:
print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']
For string abcghiijkyxz regex pattern:
a+b+ i+j+k+l+
a+b+c+ j+k+
a+b+c+d+ j+k+l+
b+c+ k+l+
b+c+d+ l+m+
c+d+ m+n+
d+e+ n+o+
e+f+ o+p+
f+g+ p+q+
g+h+ q+r+
g+h+i+ r+s+
g+h+i+j+ s+t+
g+h+i+j+k+ t+u+
g+h+i+j+k+l+ u+v+
h+i+ v+w+
h+i+j+ w+x+
h+i+j+k+ x+y+
h+i+j+k+l+ y+z+
i+j+
i+j+k+
Code demo

To actually "solve" the problem, you could use
string = 'abcxyzghiijkl'
def sort_longest(string):
stack = []; result = [];
for idx, char in enumerate(string):
c = ord(char)
if idx == 0:
# initialize our stack
stack.append((char, c))
elif idx == len(string) - 1:
result.append(stack)
elif c == stack[-1][1] or c == stack[-1][1] + 1:
# compare it to the item before (a tuple)
stack.append((char, c))
else:
# append the stack to the overall result
# and reinitialize the stack
result.append(stack)
stack = []
stack.append((char, c))
return ["".join(item[0]
for item in sublst)
for sublst in sorted(result, key=len, reverse=True)]
print(sort_longest(string))
Which yields
['ghiijk', 'abc', 'xyz']
in this example.
The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().

It's really easy with regexps!
(Using trailing contexts here)
rexp=re.compile(
"".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
for x in "abcdefghijklmnopqrstuvwxyz"])
+'[a-z]')
a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'
re.findall(rexp, a)
#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']

Related

Simplify and Improve for Multi-If-Statement

I am trying to randomly generate multiple short 5 base-pair DNA sequences. Among them, I want to pick the sequences that meet the following conditions:
If the first letter is A then the last letter cannot be T
If the first letter is T then the last letter cannot be A
If the first letter is C then the last letter cannot be G
If the first letter is G then the last letter cannot be C
The same requirements are repeated for the second and the second to the last letters.
I am currently using a very long If-Statement to make the first-last letter work, but I was wondering if there is a simple way to achieve the same result so I don't have to repeat the long statement for making the second-second-to-the-last letter work? If so, how should I change the code? Thank you.
import itertools
a = "ATCG"
for output in itertools.product(a, repeat=5):
if((output[0] == 'A') and (output[4] != 'T')) or ((output[0] == 'T') and (output[4] != 'A')) or ((output[0] == 'C') and (output[4] != 'G')) or ((output[0] == 'G') and (output[4] != "C")):
list = "".join(output)
print(list)
'''
Here is a dictionary containing the forbidden opposite:
forbidden = {
'A': 'T',
'T': 'A',
'C': 'G',
'G': 'C',
}
Now you can check that the character at index -1 - i is not the forbidden opposite of the one at i by doing a simple lookup. The trick is to loop only over the first half of the string:
def check(s):
for i in range(len(s) // 2):
if s[-1 - i] == forbidden[s[i]]:
return False
return True
Incidentally, this will work correctly on both even and odd string lengths.
for sequence in map(''.join, itertools.product(forbidden.keys(), repeat=5)):
if check(sequence):
print(sequence)
All that being as it may, it's a bit inefficient to generate a bunch of extra sequences when you only want ones matching a specific pattern. The pattern is that the first half of your string is constrained to 4 options, while the second half is to 3. You can therefore generate only matching patterns with something like this:
def generate(n=5):
first = random.choices('ATCG', k=(n + 1) // 2)
second = random.choices('ATC', k = n // 2)
second = ['G' if s == forbidden[f] else s for f, s in zip(first, second)]
return ''.join(first + second[::-1])
Given that only one character is forbidden, you can generate any three characters for the second half, and replace forbidden ones with the missing. The second half then gets reversed because of how you actually want to compare the halves.
Are you looking for something like this?
You can define regular expression to filter your outputs.
To learn more about regular expression: https://docs.python.org/3/library/re.html
import itertools
import re
a = "ATCG"
case1 = ["(^[A].{3}[^T]$)",
"(^[T].{3}[^A]$)",
"(^[C].{3}[^G]$)",
"(^[G].{3}[^C]$)"]
case2 = ["(^.[A].[^T].$)",
"(^.[T].[^A].$)",
"(^.[C].[^G].$)",
"(^.[G].[^C].$)"]
case1_filter = '|'.join(case1)
case2_filter = '|'.join(case2)
for output in itertools.product(a, repeat=5):
sequence = ''.join(output)
if re.match(case1_filter, sequence) and re.match(case2_filter, sequence):
print(''.join(output))
I'd use sets:
disallowed = [{'A', 'T'},
{'C', 'G'}]
for output in itertools.product(a, repeat=5):
first_last = {output[0], output[4]}
second_fourth = {output[1], output[3]}
pairs = (first_last, second_fourth)
if all(pair not in disallowed for pair in pairs):
sequence = "".join(output)
print(sequence)
We're using the all function which takes a Python sequence and will return True if all items in the sequence evaluate to True. This means it only evaluates enough to determine it: once one value evaluates False it stops comparing because the result will always then be False.
pairs is just a tuple of the two sets. This makes it easy to iterate over the two sets. Otherwise we would just have to write the comparison twice. I'd rather do it in a loop and write the comparison once.
The sequence we pass to the all function is each of the two pairs, and we check to see whether it is not in disallowed. If both pairs are not in disallowed then all returns True.

How do i make the program print specific letters in this specific format i give to it?

so i need to code a program which, for example if given the input 3[a]2[b], prints "aaabb" or when given 3[ab]2[c],prints "abababcc"(basicly prints that amount of that letter in the given order). i tried to use a for loop to iterate the first given input and then detect "[" letters in it so it'll know that to repeatedly print but i don't know how i can make it also understand where that string ends
also this is where i could get it to,which probably isnt too useful:
string=input()
string=string[::-1]
bulundu=6
for i in string:
if i!="]":
if i!="[":
lst.append(i)
if i=="[":
break
The approach I took is to remove the brackets, split the items into a list, then walk the list, and if the item is a number, add that many repeats of the next item to the result for output:
import re
data = "3[a]2[b]"
# Remove brackets and convert to a list
data = re.sub(r'[\[\]]', ' ', data).split()
result = []
for i, item in enumerate(data):
# If item is a number, print that many of the next item
if item.isdigit():
result.append(data[i+1] * int(item))
print(''.join(result))
# aaabb
A different approach, inspired by Subbu's use of re.findall. This approach finds all 'pairs' of numbers and letters using match groups, then multiplies them to produce the required text:
import re
data = "3[a]2[b]"
matches = re.findall('(\d+)\[([a-zA-Z]+)\]',data)
# [(3, 'a'), (2, 'b')]
for x in matches:
print(x[1] * int(x[0]), end='')
#aaabb
Lenghty and documented version using NO regex but simple string and list manipulation:
first split the input into parts that are numbers and texts
then recombinate them again
I opted to document with inline comments
This could be done like so:
# testcases are tuples of input and correct result
testcases = [ ("3[a]2[b]","aaabb"),
("3[ab]2[c]","abababcc"),
("5[12]6[c]","1212121212cccccc"),
("22[a]","a"*22)]
# now we use our algo for all those testcases
for inp,res in testcases:
split_inp = [] # list that takes the splitted values of the input
num = 0 # accumulator variable for more-then-1-digit numbers
in_text = False # bool that tells us if we are currently collecting letters
# go over all letters : O(n)
for c in inp:
# when a [ is reached our num is complete and we need to store it
# we collect all further letters until next ] in a list that we
# add at the end of your split_inp
if c == "[":
split_inp.append(num) # add the completed number
num = 0 # and reset it to 0
in_text = True # now in text
split_inp.append([]) # add a list to collect letters
# done collecting letters
elif c == "]":
in_text = False # no longer collecting, convert letters
split_inp[-1] = ''.join(split_inp[-1]) # to text
# between [ and ] ... simply add letter to list at end
elif in_text:
split_inp[-1].append(c) # add letter
# currently collecting numbers
else:
num *= 10 # increase current number by factor 10
num += int(c) # add newest number
print(repr(inp), split_inp, sep="\n") # debugging output for parsing part
# now we need to build the string from our parsed data
amount = 0
result = [] # intermediate list to join ['aaa','bb']
# iterate the list, if int remember it, it text, build composite
for part in split_inp:
if isinstance(part, int):
amount = part
else:
result.append(part*amount)
# join the parts
result = ''.join(result)
# check if all worked out
if result == res:
print("CORRECT: ", result + "\n")
else:
print (f"INCORRECT: should be '{res}' but is '{result}'\n")
Result:
'3[a]2[b]'
[3, 'a', 2, 'b']
CORRECT: aaabb
'3[ab]2[c]'
[3, 'ab', 2, 'c']
CORRECT: abababcc
'5[12]6[c]'
[5, '12', 6, 'c']
CORRECT: 1212121212cccccc
'22[a]'
[22, 'a']
CORRECT: aaaaaaaaaaaaaaaaaaaaaa
This will also handle cases of '5[12]' wich some of the other solutions wont.
You can capture both the number of repetitions n and the pattern to repeat v in one go using the described pattern. This essentially matches any sequence of digits - which is the first group we need to capture, reason why \d+ is between brackets (..) - followed by a [, followed by anything - this anything is the second pattern of interest, hence it is between backets (...) - which is then followed by a ].
findall will find all these matches in the passed line, then the first match - the number - will be cast to an int and used as a multiplier for the string pattern. The list of int(n) * v is then joined with an empty space. Malformed patterns may throw exceptions or return nothing.
Anyway, in code:
import re
pattern = re.compile("(\d+)\[(.*?)\]")
def func(x): return "".join([v*int(n) for n,v in pattern.findall(x)])
print(func("3[a]2[b]"))
print(func("3[ab]2[c]"))
OUTPUT
aaabb
abababcc
FOLLOW UP
Another solution which achieves the same result, without using regular expression (ok, not nice at all, I get it...):
def func(s): return "".join([int(x[0])*x[1] for x in map(lambda x:x.split("["), s.split("]")) if len(x) == 2])
I am not much more than a beginner and looking at the other answers, I thought understanding regex might be a challenge for a new contributor such as yourself since I myself haven't really dealt with regex.
The beginner friendly way to do this might be to loop through the input string and use string functions like isnumeric() and isalpha()
data = "3[a]2[b]"
chars = []
nums = []
substrings = []
for i, char in enumerate(data):
if char.isnumeric():
nums.append(char)
if char.isalpha():
chars.append(char)
for i, char in enumerate(chars):
substrings.append(char * int(nums[i]))
string = "".join(substrings)
print(string)
OUTPUT:
aaabb
And on trying different values for data:
data = "0[a]2[b]3[p]"
OUTPUT bbppp
data = "1[a]1[a]2[a]"
OUTPUT aaaa
NOTE: In case you're not familiar with the above functions, they are string functions, which are fairly self-explanatory. They are used as <your_string_here>.isalpha() which returns true if and only if the string is an alphabet (whitespace, numerics, and symbols return false
And, similarly for isnumeric()
For example,
"]".isnumeric() and "]".isalpha() return False
"a".isalpha() returns True
IF YOU NEED ANY CLARIFICATION ON A FUNCTION USED, PLEASE DO NOT HESITATE TO LEAVE A COMMENT

First and last occurence of a symbol (python without regex)

I am dealing with a string from 'ACGT' alphabet (a genetic sequence) padded by letters 'N' in the beginning and in the end:
NNN...NNACGT...GGCTAANNNN...NNN
I would like to find the positions where the actual sequence begins and ends. It could be easily done by using regular expressions, but I would like to have a simpler solution using basic python string operations. Your suggestions will be appreciated.
To get the remainder (removing padding from left and right) it seems like all you need is:
<YourString>.strip('N')
If you need to find indices maybe refer to lstrip and rstrip instead:
sStart = len(<YourString>)-len(<YourString>.lstrip('N'))+1
sEnd = len(<YourString>.rstrip('N'))
Since you mentioned you wanted to find the 'positions'. The code below will give you the positions where the actual sequence starts and ends in the string.
s = 'NNNNAANNNN'
i, j = s.find(next((x for x in s if x != 'N'), None)), s.rfind(next((x for x in reversed(s) if x != 'N'), None))
print(i, j)
print(s[i:j+1])
#Output
4 5
A A
Use strip()
s = "NNNNNACGTGGCTAANNNNNNN"
s = s.strip('N')
print(s)

How can I use Regex to find a string of characters in alphabetical order using Python?

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).
I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.
However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?
Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.
Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.
See regex in use here
((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)
Results in:
abc
ghiijk
y
x
z
Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):
a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
(?(?!b)(*ACCEPT)) If clause evaluating the condition.
(?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
(*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.
The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.
As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:
def LNDS(s):
start = 0
cur_len = 1
max_len = 1
for i in range(1,len(s)):
if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
cur_len += 1
else:
if cur_len > max_len:
max_len = cur_len
start = i - cur_len
cur_len = 1
if cur_len > max_len:
max_len = cur_len
start = len(s) - cur_len
return s[start:start+max_len]
>>> LNDS('abcghiijkyxz')
'ghiijk'
We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.
Generate all the regex substrings like ^a+b+c+$ (longest to shortest).
Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.
def all_substrings(s):
n = len(s)
for i in xrange(n, 0, -1):
for j in xrange(n - i + 1):
yield s[j:j + i]
def longest_alphabetical_substring(s):
for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
for u in all_substrings(s):
if r.match(u):
return u
print longest_alphabetical_substring("abcghiijkyxz")
That prints "ghiijk".
Regex: char+ meaning a+b+c+...
Details:
+ Matches between one and unlimited times
Python code:
import re
def LNDS(text):
array = []
for y in range(97, 122): # a - z
st = r"%s+" % chr(y)
for x in range(y+1, 123): # b - z
st += r"%s+" % chr(x)
match = re.findall(st, text)
if match:
array.append(max(match, key=len))
else:
break
if array:
array = [max(array, key=len)]
return array
Output:
print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']
For string abcghiijkyxz regex pattern:
a+b+ i+j+k+l+
a+b+c+ j+k+
a+b+c+d+ j+k+l+
b+c+ k+l+
b+c+d+ l+m+
c+d+ m+n+
d+e+ n+o+
e+f+ o+p+
f+g+ p+q+
g+h+ q+r+
g+h+i+ r+s+
g+h+i+j+ s+t+
g+h+i+j+k+ t+u+
g+h+i+j+k+l+ u+v+
h+i+ v+w+
h+i+j+ w+x+
h+i+j+k+ x+y+
h+i+j+k+l+ y+z+
i+j+
i+j+k+
Code demo
To actually "solve" the problem, you could use
string = 'abcxyzghiijkl'
def sort_longest(string):
stack = []; result = [];
for idx, char in enumerate(string):
c = ord(char)
if idx == 0:
# initialize our stack
stack.append((char, c))
elif idx == len(string) - 1:
result.append(stack)
elif c == stack[-1][1] or c == stack[-1][1] + 1:
# compare it to the item before (a tuple)
stack.append((char, c))
else:
# append the stack to the overall result
# and reinitialize the stack
result.append(stack)
stack = []
stack.append((char, c))
return ["".join(item[0]
for item in sublst)
for sublst in sorted(result, key=len, reverse=True)]
print(sort_longest(string))
Which yields
['ghiijk', 'abc', 'xyz']
in this example.
The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().
It's really easy with regexps!
(Using trailing contexts here)
rexp=re.compile(
"".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
for x in "abcdefghijklmnopqrstuvwxyz"])
+'[a-z]')
a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'
re.findall(rexp, a)
#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']

Python - packing/unpacking by letters

I'm just starting to learn python and I have this exercise that's puzzling me:
Create a function that can pack or unpack a string of letters.
So aaabb would be packed a3b2 and vice versa.
For the packing part of the function, I wrote the following
def packer(s):
if s.isalpha(): # Defines if unpacked
stack = []
for i in s:
if s.count(i) > 1:
if (i + str(s.count(i))) not in stack:
stack.append(i + str(s.count(i)))
else:
stack.append(i)
print "".join(stack)
else:
print "Something's not quite right.."
return False
packer("aaaaaaaaaaaabbbccccd")
This seems to work all proper. But the assignment says that
if the input has (for example) the letter a after b or c, then
it should later be unpacked into it's original form.
So "aaabbkka" should become a3b2k2a, not a4b2k2.
I hence figured, that I cannot use the "count()" command, since
that counts all occurrences of the item in the whole string, correct?
What would be my options here then?
On to the unpacking -
I've thought of the basics what my code needs to do -
between the " if s.isalpha():" and else, I should add an elif that
checks whether or not the string has digits in it. (I figured this would be
enough to determine whether it's the packed version or unpacked).
Create a for loop and inside of it an if sentence, which then checks for every element:
2.1. If it has a number behind it > Return (or add to an empty stack) the number times the digit
2.2. If it has no number following it > Return just the element.
Big question number 2 - how do I check whether it's a number or just another
alphabetical element following an element in the list? I guess this must be done with
slicing, but those only take integers. Could this be achieved with the index command?
Also - if this is of any relevance - so far I've basically covered lists, strings, if and for
and I've been told this exercise is doable with just those (...so if you wouldn't mind keeping this really basic)
All help appreciated for the newbie enthusiast!
SOLVED:
def packer(s):
if s.isalpha(): # Defines if unpacked
groups= []
last_char = None
for c in s:
if c == last_char:
groups[-1].append(c)
else:
groups.append([c])
last_char = c
return ''.join('%s%s' % (g[0], len(g)>1 and len(g) or '') for g in groups)
else: # Seems to be packed
stack = ""
for i in range(len(s)):
if s[i].isalpha():
if i+1 < len(s) and s[i+1].isdigit():
digit = s[i+1]
char = s[i]
i += 2
while i < len(s) and s[i].isdigit():
digit +=s[i]
i+=1
stack += char * int(digit)
else:
stack+= s[i]
else:
""
return "".join(stack)
print (packer("aaaaaaaaaaaabbbccccd"))
print (packer("a4b19am4nmba22"))
So this is my final code. Almost managed to pull it all off with just for loops and if statements.
In the end though I had to bring in the while loop to solve reading the multiple-digit numbers issue. I think I still managed to keep it simple enough. Thanks a ton millimoose and everyone else for chipping in!
A straightforward solution:
If a char is different, make a new group. Otherwise append it to the last group. Finally count all groups and join them.
def packer(s):
groups = []
last_char = None
for c in s:
if c == last_char:
groups[-1].append(c)
else:
groups.append([c])
last_char = c
return ''.join('%s%s'%(g[0], len(g)) for g in groups)
Another approach is using re.
Regex r'(.)\1+' can match consecutive characters longer than 1. And with re.sub you can easily encode it:
regex = re.compile(r'(.)\1+')
def replacer(match):
return match.group(1) + str(len(match.group(0)))
regex.sub(replacer, 'aaabbkka')
#=> 'a3b2k2a'
I think You can use `itertools.grouby' function
for example
import itertools
data = 'aaassaaasssddee'
groupped_data = ((c, len(list(g))) for c, g in itertools.groupby(data))
result = ''.join(c + (str(n) if n > 1 else '') for c, n in groupped_data)
of course one can make this code more readable using generator instead of generator statement
This is an implementation of the algorithm I outlined in the comments:
from itertools import takewhile, count, islice, izip
def consume(items):
from collections import deque
deque(items, maxlen=0)
def ilen(items):
result = count()
consume(izip(items, result))
return next(result)
def pack_or_unpack(data):
start = 0
result = []
while start < len(data):
if data[start].isdigit():
# `data` is packed, bail
return unpack(data)
run = run_len(data, start)
# append the character that might repeat
result.append(data[start])
if run > 1:
# append the length of the run of characters
result.append(str(run))
start += run
return ''.join(result)
def run_len(data, start):
"""Return the end index of the run of identical characters starting at
`start`"""
return start + ilen(takewhile(lambda c: c == data[start],
islice(data, start, None)))
def unpack(data):
result = []
for i in range(len(data)):
if data[i].isdigit():
# skip digits, we'll look for them below
continue
# packed character
c = data[i]
# number of repetitions
n = 1
if (i+1) < len(data) and data[i+1].isdigit():
# if the next character is a digit, grab all the digits in the
# substring starting at i+1
n = int(''.join(takewhile(str.isdigit, data[i+1:])))
# append the repeated character
result.append(c*n) # multiplying a string with a number repeats it
return ''.join(result)
print pack_or_unpack('aaabbc')
print pack_or_unpack('a3b2c')
print pack_or_unpack('a10')
print pack_or_unpack('b5c5')
print pack_or_unpack('abc')
A regex-flavoured version of unpack() would be:
import re
UNPACK_RE = re.compile(r'(?P<char> [a-zA-Z]) (?P<count> \d+)?', re.VERBOSE)
def unpack_re(data):
matches = UNPACK_RE.finditer(data)
pairs = ((m.group('char'), m.group('count')) for m in matches)
return ''.join(char * (int(count) if count else 1)
for char, count in pairs)
This code demonstrates the most straightforward (or "basic") approach of implementing that algorithm. It's not particularly elegant or idiomatic or necessarily efficient. (It would be if written in C, but Python has the caveats such as: indexing a string copies the character into a new string, and algorithms that seem to copy data excessively might be faster than trying to avoid this if the copying is done in C and the workaround was implemented with a Python loop.)

Categories

Resources