Python regex split into characters except if followed by parentheses - python

I have a string like "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]", where each character except for parentheses and what they enclose represents a kind of instruction. A character can be followed by an optional list of arguments specified in optional parentheses.
Such a string i would like to split the string into
['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'], however at the moment i only get ['F(230,24)', 'F', '[', 'f(22)_(23);(2)', '%', '[', '+(45)', 'F', 'F', ']', ']'] (a substring was not split correctly).
Currently i am using list(filter(None, re.split(r'([A-Za-z\[\]\+\-\^\&\\\/%_;~](?!\())', string))), which is just a mess of characters and a negative lookahead for (. list(filter(None, <list>)) is used to remove empty strings from the result.
I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match, as discussed here.
However i was wondering what would be a good solution? Is there a better way than re.findall?
Thank you.
EDIT: Unfortunately i am not allowed to use custom packages like regex module

You can use re.findall to find out all single character optionally followed by a pair of parenthesis:
import re
s = "F(230,24)F[f(22)_(23);(2)%[+(45)FF]]"
re.findall("[^()](?:\([^()]*\))?", s)
['F(230,24)',
'F',
'[',
'f(22)',
'_(23)',
';(2)',
'%',
'[',
'+(45)',
'F',
'F',
']',
']']
[^()] match a single character except for parenthesis;
(?:\([^()]*\))? denotes a non-capture group(?:) enclosed by a pair of parenthesis and use ? to make the group optional;

I am aware that this is likely caused by Python's re.split having been designed not to split on a zero length match
You can use the VERSION1 flag of the regex module. Taking that example from the thread you've linked - see how split() produces zero-width matches as well:
>>> import regex as re
>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!", flags=re.V1)
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Another solution. This time the pattern recognize strings with the structure SYMBOL[(NUMBER[,NUMBER...])]. The function parse_it returns True and the tokens if the string match with the regular expression and False and empty if don't match.
import re
def parse_it(string):
'''
Input: String to parse
Output: True|False, Tokens|empty_string
'''
pattern = re.compile('[A-Za-z\[\]\+\-\^\&\\\/%_;~](?:\(\d+(?:,\d+)*\))?')
tokens = pattern.findall(string)
if ''.join(tokens) == string:
res = (True, tokens)
else:
res = (False, '')
return res
good_string = 'F(230,24)F[f(22)_(23);(2)%[+(45)FF]]'
bad_string = 'F(2a30,24)F[f(22)_(23);(2)%[+(45)FF]]' # There is an 'a' in a bad place.
print(parse_it(good_string))
print(parse_it(bad_string))
Output:
(True, ['F(230,24)', 'F', '[', 'f(22)', '_(23)', ';(2)', '%', '[',
'+(45)', 'F', 'F', ']', ']'])(False, '')

Related

Split a string after multiple delimiters and include it

Hello I'm trying to split a string without removing the delimiter and it can have multiple delimiters.
The delimiters can be 'D', 'M' or 'Y'
For example:
>>>string = '1D5Y4D2M'
>>>re.split(someregex, string) #should ideally return
['1D', '5Y', '4D', '2M']
To keep the delimiter I use Python split() without removing the delimiter
>>> re.split('([^D]+D)', '1D5Y4D2M')
['', '1D', '', '5Y4D', '2M']
For multiple delimiters I use In Python, how do I split a string and keep the separators?
>>> re.split('(D|M|Y)', '1D5Y4D2M')
['1', 'D', '5', 'Y', '4', 'D', '2', 'M', '']
Combining both doesn't quite make it.
>>> re.split('([^D]+D|[^M]+M|[^Y]+Y)', string)
['', '1D', '', '5Y4D', '', '2M', '']
Any ideas?
I'd use findall() in your case. How about:
re.findall(r'\d+[DYM]', string
Which will result in:
['1D', '5Y', '4D', '2M']
(?<=(?:D|Y|M))
You need 0 width assertion split.Can be done using regex module python.
See demo.
https://regex101.com/r/aKV13g/1
You can split at the locations right after D, Y or M but not at the end of the string with
re.split(r'(?<=[DYM])(?!$)', text)
See the regex demo. Details:
(?<=[DYM]) - a positive lookbehind that matches a location that is immediately preceded with D or Y or M
(?!$) - a negative lookahead that fails the match if the current position is the string end position.
Note
In the current scenario, (?<=[DYM]) can be used instead of a more verbose (?<=D|Y|M) since all alternatives are single characters. If you have multichar delimiters, you would have to use a non-capturing group, (?:...), with lookbehind alternatives inside it. For example, to separate right after Y, DX and MZB you would use (?:(?<=Y)|(?<=DX)|(?<=MZB)). See Python Regex Engine - "look-behind requires fixed-width pattern" Error
I think it will work fine without regex or split
time complexity O(n)
string = '1D5Y4D2M'
temp=''
res = []
for x in string:
if x=='D':
temp+='D'
res.append(temp)
temp=''
elif x=='M':
temp+='M'
res.append(temp)
temp=''
elif x=='Y':
temp+='Y'
res.append(temp)
temp=''
else:
temp+=x
print(res)
using translate
string = '1D5Y4D2M'
delimiters = ['D', 'Y', 'M']
result = string.translate({ord(c): f'{c}*' for c in delimiters}).strip('.*').split('*')
print(result)
>>> ['1D', '5Y', '4D', '2M']

How do I split up the following Python string in to a list of strings?

I have a string 'Predicate(big,small)'
How do I derive the following list of strings from that, ['Predicate','(','big',',','small',')']
The names can potentially be anything, and there can also be spaces between elements like so (I need to have the whitespace taken out of the list), Predicate (big, small)
So far I've tried this, but this is clearly not the result that I want
>>> str1 = 'Predicate(big,small)'
>>> list(map(str,str1))
Output:
['P', 'r', 'e', 'd', 'i', 'c', 'a', 't', 'e', '(', 'b', 'i', 'g', ',', 's', 'm', 'a', 'l', 'l', ')']
You can use re.split() to split your string on ( or ). You can capture the delimiters in the regex to include them in your final output. Combined with str.strip() to handle spaces and filtering out any ending empty strings you get something like:
import re
s = 'Predicate ( big ,small )'
[s.strip() for s in re.split(r'([\(\),])', s.strip()) if s]
# ['Predicate', '(', 'big', ',', 'small', ')']
You can use re here.
import re
text='Predicate(big,small)'
parsed=re.findall(r'\w+|[^a-zA-Z,\s])
# ['Predicate', '(', 'big', 'small', ')']
\w+ matches any word character (equal to [a-zA-Z0-9_]).
[^a-zA-Z,\s] matches a single character not present in the list.
\s for matching space.

How can I avoid those empty strings caused by preceding or trailing whitespaces?

>>> import re
>>> re.split(r'[ "]+', ' a n" "c ')
['', 'a', 'n', 'c', '']
When there is preceding or trailing whitespace, there will be empty strings after splitting.
How can I avoid those empty strings? Thanks.
The empty values are the things between the splits. re.split() is not the right tool for the job.
I recommend matching what you want instead.
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
If you must use split, you could use a list comprehension and filter it directly.
>>> [x for x in re.split(r'[ "]+', ' a n" "c ') if x != '']
['a', 'n', 'c']
That's what re.split is supposed to do. You're asking it to split the string on any runs of whitespace or quotes; if it didn't return an empty string at the start, you wouldn't be able to distinguish that case from the case with no preceding whitespace.
If what you're actually asking for is to find all runs of non-whitespace-or-quote characters, just write that:
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
I like abarnert solution.
However, you can also do (maybe not a pythonic way):
myString.strip()
Before your split (or etc).

Have I found a bug in Python's str.endswith()?

According to the Python documentation:
str.endswith(suffix[, start[, end]])
Return True if the string ends with the specified suffix, otherwise return False. suffix can also be a tuple of suffixes to look for. With optional start, test beginning at that position. Withoptional end, stop comparing at that position.
Changed in version 2.5: Accept tuples as suffix.
The following code should return True, but it returns False in Python 2.7.3:
"hello-".endswith(('.', ',', ':', ';', '-' '?', '!'))
It seems str.endswith() ignores anything beyond the forth tuple element:
>>> "hello-".endswith(('.', ',', ':', '-', ';' '?', '!'))
>>> True
>>> "hello;".endswith(('.', ',', ':', '-', ';' '?', '!'))
>>> False
Have I found a bug, or am I missing something?
or am I missing something?
You're missing a comma after the ';' in your tuple:
>>> "hello;".endswith(('.', ',', ':', '-', ';' '?', '!'))
# ^
# comma missing
False
Due to this, ; and ? are concatenated. So, the string ending with ;? will return True for this case:
>>> "hello;?".endswith(('.', ',', ':', '-', ';' '?', '!'))
True
After adding a comma, it would work as expected:
>>> "hello;".endswith(('.', ',', ':', '-', ';', '?', '!'))
True
If you write tuple as
>>> tuple_example = ('.', ',', ':', '-', ';' '?', '!')
then the tuple will become
>>> tuple_example
('.', ',', ':', '-', ';?', '!')
^
# concatenate together
So that is why return False
It has already been pointed out that adjacent string literals are concatenated, but I wanted to add a little additional information and context.
This is a feature that is shared with (and borrowed from) C.
Additionally, this is doesn't act like a concatenation operator like '+', and is treated identically as if they were literally joined together in the source without any additional overhead.
For example:
>>> 'a' 'b' * 2
'abab'
Whether this is useful feature or an annoying design is really a matter of opinion, but it does allow for breaking up string literals among multiple lines by encapsulating the literals within parentheses.
>>> print("I don't want to type this whole string"
"literal all on one line.")
I don't want to type this whole stringliteral all on one line.
That type of usage (along with being used with #defines) is why it was useful in C in the first place and was subsequently brought along in Python.

How is this weird behaviour of escaping special characters explained?

Just for fun, I wrote this simple function to reverse a string in Python:
def reverseString(s):
ret = ""
for c in s:
ret = c + ret
return ret
Now, if I pass in the following two strings, I get interesting results.
print reverseString("Pla\net")
print reverseString("Plan\et")
The output of this is
te
alP
te\nalP
My question is: Why does the special character \n get translated into a new line when passed into the function, but not when the function parses it together by reversing n\? Also, how could I stop the function from parsing \n and instead return n\?
You should take a look at the individual character sequences to see what happens:
>>> list("Pla\net")
['P', 'l', 'a', '\n', 'e', 't']
>>> list("Plan\et")
['P', 'l', 'a', 'n', '\\', 'e', 't']
So as you can see, \n is a single character while \e are two characters as it is not a valid escape sequence.
To prevent this from happening, escape the backslash itself, or use raw strings:
>>> list("Pla\\net")
['P', 'l', 'a', '\\', 'n', 'e', 't']
>>> list(r"Pla\net")
['P', 'l', 'a', '\\', 'n', 'e', 't']
The reason is that '\n' is a single character in the string. I'm guessing \e isn't a valid escape, so it's treated as two characters.
look into raw strings for what you want, or just use '\\' wherever you actually want a literal '\'
The translation is a function of python's syntax, so it only occurs during python's parsing of input to python itself (i.e. when python parses code). It doesn't occur at other times.
In the case of your programme, you have a string which by the time it is constructed as an object, contains the single character denoted by '\n', and a string which when constructed contains the sub-string '\e'. After you reverse them, python doesn't reparse them.

Categories

Resources