Splitting on regex without removing delimiters - python

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.

If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter

Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

Related

Python: split string by a multi-character delimiter unless inside quotes

In my case the delimiter string is ' ' (3 consecutive spaces, but the answer should work for any multi-character delimiter), and an edge case text to search in could be this:
'Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
The solution should return the following strings:
Coord="GLOB"AL
Axis=X
Type="Y ZR"
Color="Gray Dark"
Alt="Q Z"qz
Loc=End
I've looked for regex solutions, evaluating also the inverse problem (match multi-character delimiter unless inside quotes), since the re.split command of Python 3.4.3 allows to easily split a text by a regex pattern, but I'm not sure there is a regex solution, therefore I'm open also to (efficient) non regex solutions.
I've seen some solution to the inverse problem using lookahead/lookbehind containing regex pattern, but they did not work because Python lookahead/lookbehind (unlike other languages engine) requires fixed-width pattern.
This question is not a duplicate of Regex matching spaces, but not in "strings" or similar other questions, because:
matching a single space outside quotes is different
from matching a multi-character delimiter (in my example the
delimiter is 3 spaces, but the question is about any
multi-character delimiter);
Python regex engine is slightly different from C++ or other
languages regex engines;
matching a delimiter is side B of my question, the direct question
is about splitting a string.
x='Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
print re.split(r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)',x)
You need to use lookahead to see if the space it not in between ""
Output ['Coord="GLOB"AL', 'Axis=X', 'Type="Y ZR"', 'Color="Gray Dark"', 'Alt="Q Z"qz', 'Loc=End']
For a generalized version if you want to split on delimiters not present inside "" use
re.split(r'delimiter(?=(?:[^"]*"[^"]*")*[^"]*$)',x)

python regular expression of a string

I have python string
wrong_data_type is not one of the allowed values `([one_two, two_three, three_four])`
and I have a regexp:
\w+ is not one of the allowed values`\(\[\w,+\)\]`
However, it is not correct? Any help?
The regexp should be
\w+ is not one of the allowed values `\(\[(?:\w+, )*\w+\]\)`
Fixes:
Added space after values.
\]\) at the end instead of \)\].
Inside the brackets, need to allow multiple \w, so it should be \w+.
Need to have a space after ,.
Need a group around \w+, to match multiple comma-separated words using the * quantifier.
Then have to match a single last word with no comma after it.
data = re.search(r'\(\[[\w,\s]+\]\)', string).group()
You can use the following:
\w+ is not one of the allowed values `\(\[[\w,\s]+\]\)`

Using the .split() function based on conditions?

How would you be able to use the .split() function based on conditions?
Lets say I have the raw data:
Apples,Oranges,Strawberries Green beans,Tomatoes,Broccoli
My intended result is:
['Apples','Oranges','Strawberries','Green beans','Tomatoes','Brocolli']
Would it be able to have it split at commas and if there is a space and a capital letter following it?
The literal interpretation of what you asked for, using re.split:
import re
pat = re.compile(r'\s(?=[A-Z])|,')
pat.split(my_str)
This is more simply done, in your case:
pat = re.compile(r'.(?=[A-Z])')
Basically, split on any character that is followed by a capital letter.
Using regex will make the code simpler than a complicated split statement.
import re
...
re.findall(", [A-Z]",data)
Note you asked for a split for a command, space, capital, but in your example there are no spaces after commas.

Python regex example

If I want to replace a pattern in the following statement structure:
cat&345;
bat &#hut;
I want to replace elements starting from & and ending before (not including ;). What is the best way to do so?
Including or not including the & in the replacement?
>>> re.sub(r'&.*?(?=;)','REPL','cat&345;') # including
'catREPL;'
>>> re.sub(r'(?<=&).*?(?=;)','REPL','bat &#hut;') # not including
'bat &REPL;'
Explanation:
Although not required here, use a r'raw string' to prevent having to escape backslashes which often occur in regular expressions.
.*? is a "non-greedy" match of anything, which makes the match stop at the first semicolon.
(?=;) the match must be followed by a semicolon, but it is not included in the match.
(?<=&) the match must be preceded by an ampersand, but it is not included in the match.
Here is a good regex
import re
result = re.sub("(?<=\\&).*(?=;)", replacementstr, searchText)
Basically this will put the replacement in between the & and the ;
Maybe go a different direction all together and use HTMLParser.unescape(). The unescape() method is undocumented, but it doesn't appear to be "internal" because it doesn't have a leading underscore.
You can use negated character classes to do this:
import re
st='''\
cat&345;
bat &#hut;'''
for line in st.splitlines():
print line
print re.sub(r'([^&]*)&[^;]*;',r'\1;',line)

Find several strings with regular expressions

I'm looking for an OR capability to match on several strings with regular expressions.
# I would like to find either "-hex", "-mos", or "-sig"
# the result would be -hex, -mos, or -sig
# You see I want to get rid of the double quotes around these three strings.
# Other double quoting is OK.
# I'd like something like.
messWithCommandArgs = ' -o {} "-sig" "-r" "-sip" '
messWithCommandArgs = re.sub(
r'"(-[hex|mos|sig])"',
r"\1",
messWithCommandArgs)
This works:
messWithCommandArgs = re.sub(
r'"(-sig)"',
r"\1",
messWithCommandArgs)
Square brackets are for character classes that can only match a single character. If you want to match multiple character alternatives you need to use a group (parentheses instead of square brackets). Try changing your regex to the following:
r'"(-(?:hex|mos|sig))"'
Note that I used a non-capturing group (?:...) because you don't need another capture group, but r'"(-(hex|mos|sig))"' would actually work the same way since \1 would still be everything but the quotes.
Alternative you could use r'"-(hex|mos|sig)"' and use r"-\1" as the replacement (since the - is no longer a part of the group.
You should remove [] metacharacters in order to match hex or mos or sig. (?:-(hex|mos|sig))

Categories

Resources