Substring between known two markers extraction with problem markers - python

#miernic asked long ago how do you extract an arbitrary string which is located between two known markers in another string.
My problem is that the two markers include Regular Expression's meta characters. Specifically, I need to extract ABCD from the string ('ABCD',), parenthesis, single quote and comma, all included in the source string. The extracted string itself might include single and double quotes, dots, parenthesis, and white space. The makers are always (' and ',).
I tried to use r' strings and lots of escape characters and nothing works.
Pleeeease....

Converting my comment to answer so that solution is easy to find for future visitors.
You may use this regex with " as regex delimiter:
r"\('(.+?)',\)"
Use above regex in re.findall so that you get only captured group returned from it.

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

Match between two single quotes and continue matching if two single quotes appear in a row or if '.' appears in the middle?

I'm trying to setup a regex pattern match that will match 'hello' twice in 'hello'blah blah blah 'hello' but that will match the full string 'hello''hello' and the full string 'hello''hello' as well as the full string 'hello'.'hello'.
To put it simply, I want to start a match when there is a single quote, and continue the match until I encounter another single quote, unless there is another single quote immediately after it or if there is a .' after it, in which case I want to continue matching until I encounter a single quote that doesn't match those conditions.
This is what I have to match values between single quotes currently:
\'[^\']*\'
I already read the solution here: How to replace single quote to two single quotes, but do nothing if two single quotes are next to each other in perl but it doesn't quite fit what I'm looking for and can't get it to match the in-between stuff.
You can use this regex:
('[^']+'(?:\.?'[^']+')*)
It looks for a set of characters enclosed in single quotes, followed optionally by some number of sets of characters enclosed in single quotes, possibly preceded by a period.
Demo on regex101
'.*?'(\.?'.*?')*
If I understand your need correctly, it's easy to construct the above regex.

Python: split string by a multi-character delimiter unless inside quotes

In my case the delimiter string is ' ' (3 consecutive spaces, but the answer should work for any multi-character delimiter), and an edge case text to search in could be this:
'Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
The solution should return the following strings:
Coord="GLOB"AL
Axis=X
Type="Y ZR"
Color="Gray Dark"
Alt="Q Z"qz
Loc=End
I've looked for regex solutions, evaluating also the inverse problem (match multi-character delimiter unless inside quotes), since the re.split command of Python 3.4.3 allows to easily split a text by a regex pattern, but I'm not sure there is a regex solution, therefore I'm open also to (efficient) non regex solutions.
I've seen some solution to the inverse problem using lookahead/lookbehind containing regex pattern, but they did not work because Python lookahead/lookbehind (unlike other languages engine) requires fixed-width pattern.
This question is not a duplicate of Regex matching spaces, but not in "strings" or similar other questions, because:
matching a single space outside quotes is different
from matching a multi-character delimiter (in my example the
delimiter is 3 spaces, but the question is about any
multi-character delimiter);
Python regex engine is slightly different from C++ or other
languages regex engines;
matching a delimiter is side B of my question, the direct question
is about splitting a string.
x='Coord="GLOB"AL Axis=X Type="Y ZR" Color="Gray Dark" Alt="Q Z"qz Loc=End'
print re.split(r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)',x)
You need to use lookahead to see if the space it not in between ""
Output ['Coord="GLOB"AL', 'Axis=X', 'Type="Y ZR"', 'Color="Gray Dark"', 'Alt="Q Z"qz', 'Loc=End']
For a generalized version if you want to split on delimiters not present inside "" use
re.split(r'delimiter(?=(?:[^"]*"[^"]*")*[^"]*$)',x)

Python regular expression to pull text inside of HTML quotation marks

I'm attempting to pull ticker symbols from corporations' 10-K filings on EDGAR. The ticker symbol typically appears between a pair of HTML quotation marks, e.g., "‘" or "’". An example of a typical portion of relevant text:
Our common stock has been listed on the New York Stock Exchange (“NYSE”) under the symbol “RXN”
At this point I am just trying to figure out how to deal with the occurrence of one or more of a variety of quotation marks. I can write a regex that matches one particular type of quotation mark:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*[^<]*\n',fileText)
However, I can't write a regex that looks for more than one type of quotation mark. This regex produces nothing:
re.findall(r'under[^<]*the[^<]*symbol[^<]*“*‘*’*“*[^<]*\n',fileText)
Any help would be appreciated.
Your regex looks for all of the quotes occurring together. If you're looking for any one of the possibilities, you need to put parentheses around each string and or them:
(?:“)*|(?:‘)*|(?:’)*|(?:“)*
The ?: makes the paren groups non-capturing. I.e., the parser won't save each one as important text. As an aside, you'll probably want to use group-capturing to save the ticker symbol -- what you're actually looking for. Very quick-and-dirty (and ugly) expression that will return ['NYSE', 'RXN'] from the given string:
re.findall(r'(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))(.+?)(?:(?:“)|(?:&#14[567];)|(?:&#822[01];))', fileText)
You'd probably want to only include left-quotes in the first group and right-quotes in the last group. Plus either-or quotes in both.
You can use
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))), text)
this works because you can use search/replace providing a callable for the replace part. The number after "#" is the unicode point for the character and Python chr function can convert it to text.
For example:
re.sub("&#([0-9]+);", lambda x:chr(int(x.group(1))),
"this is a “test“")
results in
'this is a “test“'

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!
The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs
The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].
If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

Categories

Resources