strange behavior of parenthesis in python regex - python

I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:
Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.
For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.
I use a testing page which contains things like:
CA "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"
So I decided to start simple:
re.findall('("|\').*?\\1', page)
########## /("|').*?\1/ <-- raw regex I think I'm going for.
This regex acts very unexpectedly.
I thought it would:
( " | " ) Match EITHER single OR double quotes, save as back reference /1.
.*? Match non-greedy wildcard.
\1 Match whatever it finds in back reference \1 (step one).
Instead, it returns an array of quotes but never anything else.
['"', '"', "'", "'"]
I'm really confused because the equivalent (afaik) regex works just fine in VIM.
\("\|'\).\{-}\1/)
My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?
And how do I write the regex I'm looking for in python?
Thank you for your help!

You aren't capturing anything except for the quotes, which is what Python is returning.
If you add another group, things work much better:
for quote, match in re.finditer(r'("|\')(.*?)\1', page):
print match
I prefixed your string literal with an r to make it a raw string, which is useful when you need to use a ton of backslashes (\\1 becomes \1).

You need to catch everything with an extra pair of parentheses.
re.findall('(("|\').*?\\2)', page)

Read the documentation. re.findall returns the groups, if there are any. If you want the entire match you must group it all, or use re.finditer. See this question.

Related

Python3 regex not changing \" to "

i have a json file filled with user comments (from web scraping) which I've pulled into python with pandas
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName,encoding="utf-8"),columns=columnName)
data['full_text'] = data['full_text'].replace('^#ABC(\\u2019s)*[ ,\n]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2019)', "'", regex=True)
data.to_json('new_abc_short.json',orient='records')
The messages don't completely match the respective messages online. (emojis shown as \u0234 or something, apostrophes as \u2019, forward slash in links, and quote marks have back slash.
i want to clean them up so i learnt some regex, so i can pull into python, clean them up and then resave them back to json in a different name (for now) (https://docs.python.org/3/howto/regex.html)
second line helps to remove the twitter handle (if it exists in only in the beginning), then removes 's if it was used (e.g. #ABC's ). If there was no twitter handle at the beginning (maybe used in the middle of the message) then that is kept. then it removes any spaces and commas that were left behind (again only at the beginning of the string)
e.g. "#ABC, hi there" becomes "hi there". "hi there #ABC" stays the same. "#ABC's twitter is big" would become "twitter is big"
third line helps replace every apostrophe that could not be shown (e.g. don\u2019t changes back to don't)
i have thousands of records (not all of them have issues with apostrophes, quotes, links etc), and based on the very small examples i've looked at, they seem to work
but my third one doesn't work:
data['full_text'] = data['full_text'].replace('\\"', '"', regex=True)
Example message in the json: "full_text":"#ABC How can you \"accidentally close\" my account"
i want to remove the \ next to the double quotes so it looks like the real message (i assume it is a escape character which the user obviously didn't type)
but no matter what i do, i can't remove it
from my regex learning, " is't a metacharacter. so backslash shouldn't even be there. But anyway, I've tried:
\\" (which i think should be the obvious one, i have \", no special quirk in " but there is in \ so i need another back slash to escape that)
\\\\" (some forums posts online mention needing 4 slashes
\\\" ( i think someone mention in the forum posts that they got it workin with 3)
\\\(\") (i know that brackets provide groupings so i tried different combinations)
(\\\\")
all of the above expression i encased in single quotes, and they didn't work. I thought maybe the double quote was the problem since i only had one, so i replaced the single quotes with single quotes x3
'''\\"'''
but none of the above worked for triple single quotes either
I keep rechecking the newly saved json and i keep seeing:
"full_text":"How can you \"accidentally close\" my account"
(i.e. removing #ABC with space worked, but not the back slash bit)
originally, i tried looking into converting these unicode issues i.e. using encoding="utf-8") although my experience in this is limited and it kept failing, so regex is my best option
Ow, I missed the pandas hint, so pandas replace does use regexes. But, to be clear, str.replace doesn't work with regexes. re.sub does.
Now
to match a single backslash, your regex is: "\\"
string to describe that regex: "\\\\"
when using a raw string, a double backslash is enough: r'\\'
If your string really contains a \ preceding a ", a regex that would do is:
\\(?=\")
which does a lookahead for your " (Look at regex101).
You would have to use something like:
re.sub(r'\\(?=\")',"",s,0)
or a pandas equivalent using that regex.

Correctly parsing string literals with python's re module

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.
For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.
The regex is: (?<!\\)\"[^\"]+\"
Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.
To break it down:
(?<\\)\" # The group should begin with a quotation mark that is not escaped
[^\"]+ # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\" # and end at the first quotation mark it finds
That being said, I (obviously) start hitting problems with things like this:
"This is a string with an \"escaped quote\" inside it"
I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:
([^\"]|\\\")+ # a group of anything but a quote or an escaped quote
, but that lead to very strange results.
I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.
Thanks for any assistance.
EDIT
As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):
STRLIT = r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # non-greedy 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
'''
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)
The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.
I think I first saw this idea in... Jinja2's source code? Later transplanted it to Mako.
r'''(\"\"\"|\'\'\'|\"|\')((?<!\\)\\\1|.)*?\1'''
Which does the following:
(\"\"\"|\'\'\'|\"|\') matches a Python opening quote, because this happens to be taken from code for parsing Python. You probably don't need all those quote types.
((?<!\\)\\\1|.) matches: EITHER a matching quote that was escaped ONLY ONCE, OR any other character. So \\" will still be recognized as the end of the string.
*? non-greedily matches as many of those as possible.
And \1 is just the closing quote.
Alas, \\\" will still incorrectly be detected as the end of the string. (The template engines only use this to check if there is a string, not to extract it.) This is a problem very poorly suited for regular expressions; short of doing insane things in Perl, where you can embed real code inside a regex, I'm not sure it's possible even with PCRE. Though I'd love to be proven wrong. :) The killer is that (?<!...) has to be constant-length, but you want to check that there's any even number of backslashes before the closing quote.
If you want to get this correct, and not just mostly-correct, you might have to use a real parser. Have a look at parsley, pyparsing, or any of these tools.
edit: By the way, there's no need to check that the opening quote doesn't have a backslash before it. That's not valid syntax outside a string in JS (or Python).
Perhaps use two negative look behinds:
import re
text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''
for match in (re.findall(r'''(?x) # verbose mode
(?<!\\) # not preceded by a backslash
" # a literal double-quote
.*? # 1-or-more characters
(?<!\\) # not preceded by a backslash
" # a literal double-quote
''', text)):
print(match)
yields
"This is a string with an \"escaped quote\" inside it"
""
"data"
The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.
Using python, the correct regex matching double quoted string is:
pattern = r'"(\.|[^"])*"'
It describes strings starts and ends with ". For each character inside the two double quotes, it's either an escaped character OR any character expect ".
unutbu's ansever is wrong because for valid string "\\\\", cannot matched by that pattern.

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

How to properly process regex match

I have strings that come from outside to my application and may look like these with quotes:
Prefix content. "Some content goes here". More contents without quotes.
Prefix content. "Another "Additional" goes here". More contents without quotes.
Prefix content. "Just another "content". More contents without quotes.
The key note is that the strings come with quotes and I need to process these quotes properly. Actually I need to catch all the content inside quotes. I tried patterns like .*(".*").* and .*(".+").* but they seem to catch only content between two closest quotes.
Looks like you just want everything from the first quote to the last quote, even if there are other quotes between. This should suffice:
".*"
The leading and trailing .* in your regex were never needed, and the leading one was distorting your results. It would initially consume the whole input, then back off just far enough to let the rest of the regex match, meaning the (".*") would only ever match the last two quotes.
You don't need the parentheses, either. The part of the string you're after is now the entire match, so you can retrieve it with group(0) instead of group(1). If there may be newlines in the string and you want to match those, too, you can change it to:
(?s)".*"
The . metacharacter normally doesn't match newlines, but (?s) turns on DOTALL mode for the rest of the regex.
EDIT: I forgot to mention that you should be using the search() method in this case, not match(). match() only works if the match is found at the very beginning of the input, as if you had added the start anchor (e.g., ^".*"). search() performs a more traditional regex match, where the match can appear anywhere in the input. (ref)
I'm not certain what you're trying to extract, so I'm guessing. I'd suggest using partition and rpartition string methods.
Does this do what you want?
>>> samples = [
... 'Prefix content. "Some content goes here". More contents without quotes.',
... 'Prefix content. "Another "Additional" goes here". More contents without quotes.',
... 'Prefix content. "Just another "content". More contents without quotes.',
... ]
>>> def get_content(data):
... return data.partition('"')[2].rpartition('"')[0]
...
>>> for sample in samples:
... print get_content(sample)
...
Some content goes here
Another "Additional" goes here
Just another "content
EDIT: I now see the another answer and I might have misunderstood your question.
Try changing this
.*(".+").*
to
.*?(".+?")
The ? will make the search non-greedy and will stop as soon as it will find the next matching character (i.e. quote). I also removed the .* at the end as it would match rest of the string (regardless of the quotes). If you want to match empty quotes as well just change the + to *. Use re.findall to extract all the content from within the quotes.
PS: I assumed your last line is wrong as it doesn't have matching quotes.
I'm not quite sure if this is what you wanted to achieve.
finditer method from re module may be helpful here.
>>> import re
>>> s = '''Prefix content. "Some content goes here". More contents without quotes.
... Prefix content. "Another "Additional" goes here". More contents without quotes.
... Prefix content. "Just another "content". More contents without quotes.'''
>>> pattern = '".+?"'
>>> results = [m.group(0) for m in re.finditer(pattern, s)]
>>> print results
['"Some content goes here"', '"Another "', '" goes here"', '"Just another "']

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources