Python: split string by a multi-character delimiter unless inside quotes

Python: split string by a multi-character delimiter unless inside quotes - python

In my case the delimiter string is ' ' (3 consecutive spaces, but the answer should work for any multi-character delimiter), and an edge case text to search in could be this:
'Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
The solution should return the following strings:
Coord="GLOB"AL
Axis=X
Type="Y ZR"
Color="Gray Dark"
Alt="Q Z"qz
Loc=End
I've looked for regex solutions, evaluating also the inverse problem (match multi-character delimiter unless inside quotes), since the re.split command of Python 3.4.3 allows to easily split a text by a regex pattern, but I'm not sure there is a regex solution, therefore I'm open also to (efficient) non regex solutions.
I've seen some solution to the inverse problem using lookahead/lookbehind containing regex pattern, but they did not work because Python lookahead/lookbehind (unlike other languages engine) requires fixed-width pattern.
This question is not a duplicate of Regex matching spaces, but not in "strings" or similar other questions, because:
matching a single space outside quotes is different
from matching a multi-character delimiter (in my example the
delimiter is 3 spaces, but the question is about any
multi-character delimiter);
Python regex engine is slightly different from C++ or other
languages regex engines;
matching a delimiter is side B of my question, the direct question
is about splitting a string.

x='Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
print re.split(r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)',x)
You need to use lookahead to see if the space it not in between ""
Output ['Coord="GLOB"AL', 'Axis=X', 'Type="Y ZR"', 'Color="Gray Dark"', 'Alt="Q Z"qz', 'Loc=End']
For a generalized version if you want to split on delimiters not present inside "" use
re.split(r'delimiter(?=(?:[^"]*"[^"]*")*[^"]*$)',x)

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?

The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]

This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

Substring between known two markers extraction with problem markers

#miernic asked long ago how do you extract an arbitrary string which is located between two known markers in another string.
My problem is that the two markers include Regular Expression's meta characters. Specifically, I need to extract ABCD from the string ('ABCD',), parenthesis, single quote and comma, all included in the source string. The extracted string itself might include single and double quotes, dots, parenthesis, and white space. The makers are always (' and ',).
I tried to use r' strings and lots of escape characters and nothing works.
Pleeeease....

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex with " as regex delimiter:
r"\('(.+?)',\)"
Use above regex in re.findall so that you get only captured group returned from it.

replace a comma only if is between two numbers [duplicate]

This question already has answers here:
Convert decimal mark when reading numbers as input
(8 answers)
Closed last year.
I'm trying to replace commas for cases like:
123,123
where the output should be:
123123
for that I tried this:
re.sub('\d,\d','','123,123')
but that is also deleting the the digits, how can avoid this?
I only want to remode the comma for that case in particular, that's way I'm using regex. For this case, e.g.
'123,123 hello,word'
The desired output is:
'123123 hello,word'

You can use regex look around to restrict the comma (?<=\d),(?=\d); use ?<= for look behind and ?= for look ahead; They are zero length assertions and don't consume characters so the pattern in the look around will not be removed:
import re
re.sub('(?<=\d),(?=\d)', '', '123,123 hello,word')
# '123123 hello,word'

This is one of the cases where you want regular expression "lookaround assertions" ... which have zero length (pattern capture semantics).
Doing so allows you to match cases which would otherwise be "overlapping" in your substitution.
Here's an example:
#!python
import re
num = '123,456,7,8.012,345,6,7,8'
pattern = re.compile(r'(?<=\d),(?=\d)')
pattern.sub('',num)
# >>> '12345678.012345678'
... note that I'm using re.compile() to make this more readable and also because that usage pattern is likely to perform better in many cases. I'm using the same regular expression as #Psidom; but I'm using a Python 'raw' string which is more commonly the way to express regular expressions in Python.
I'm deliberately using an example where the spacing of the commas would overlap if I were using a regular expression such as; re.compile(r'(\d),(\d)') and trying to substitute using back references to the captured characters pattern.sub(r'\1\2', num) ... that would work for many examples; but '1,2,3' would not match because the capturing causes them to be overlapping.
This one of the main reasons that these "lookaround" (lookahead and lookbehind) assertions exist ... to avoid cases where you'd have to repeatedly/recursively apply a pattern due to capture and overlap semantics. These assertions don't capture, they match "zero" characters (as with some PCRE meta patterns like \b ... which matches the zero length boundary between words rather than any of the whitespace (\s which or non-"word" (\W) characters which separate words).

Splitting on regex without removing delimiters

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.

If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter

Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.

VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.

Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}

When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: split string by a multi-character delimiter unless inside quotes - python

Related

Issues with re.search and unicode in python [duplicate]

Substring between known two markers extraction with problem markers

replace a comma only if is between two numbers [duplicate]

Splitting on regex without removing delimiters

Weird Python Regex Issues

Categories

Resources