regEx matching a curly brace don't matched any way I try - python

solving a trivial task of finding the start of a body of a .php function, I'm not able to get a regEx match however I tried. Here's what I supposed to do the job:
import re
print re.search(r"addToHead(){", "addToHead(){\n\tcode...").group()
# addToHead is the function I'm looking for.
# --> AttributeError: 'NoneType' object has no attribute 'group'
print re.search(r"addToHead()\{", "addToHead(){\n\tcode...").group()
# Nor backslashing or double backslash works.
print re.search(r"addToHead()[\{]", "addToHead(){\n\tcode...").group()
print re.search(r"addToHead()[\x7b]", "addToHead(){\n\tcode...").group()
# Noting works...am I missing something??
Also I tried with re.DOTALL with the same unpleasant result. Do I sit on my nerve? Or a bug..?

Brackets () are used to logically group the matched string in regular expression. Basically, they have special meaning in regular expressions. So you have to escape the brackets () like \(\).
print re.search(r"addToHead\(\){", "addToHead(){\n\tcode...").group()
Output
addToHead(){

Oh, now, just a minute after I posted the question I found it, it's not with the curly brace, but the standard brackets...well, I should likely delete my question, but [Meta-question] would I be able to access it as a record of my past blindness?

Related

Is there something specific for using re.search() to match multiline pattern within a function? Python

I'm trying to search r'CONTENTS\.\n+CHAPTER I\.' within a string from Gutenberg project, but I'm getting AttributeError, as it doesn't match, but the same pattern does match outside the function. My code is below:
def gutenberg(url):
responce=request.urlopen(url)
raw=responce.read().decode('utf8')
print(re.search(r"CONTENTS\.\n+CHAPTER I\.",raw).group())
a=gutenberg("https://www.gutenberg.org/files/76/76-0.txt")
Output:
...
print(re.search(r"CONTENTS\.\n+CHAPTER I\.",raw).group())
AttributeError: 'NoneType' object has no attribute 'group'
And outside the function:
a="""Complete
CONTENTS.
CHAPTER I. Civilizing"""
re.search(r"CONTENTS\.\n+CHAPTER I\.",a).group()
Output:
'CONTENTS.\n\nCHAPTER I.'
Though, it works fine within the function when there's no new line character in the pattern: print(re.search(r"CONTENTS\.",raw).group()).
So, I believe I need something like flags.
What I've tried:
print(re.search(r"CONTENTS\.\n+CHAPTER I\.",raw,re.M).group())
pattern=re.compile(r'CONTENTS.\n+CHAPTER I.')
print(pattern.search(raw).group())
I even tried to add a backslash into my pattern: r"CONTENTS\.\\n+CHAPTER I\." - the same AttributeError.
I read about flags=regex.VERSION1 here but I couldn't find information about it in the last Python's regex guide, so I haven't tried to use it.
Any ideas how to search for multiline pattern within a function?
In general, what's confusing me much is different behavior of re.search() inside and outside the function. Is there a conception I'm not aware of?
Thanks in advance! I'll appreciate any help!
No, there isn't something special, and it doesn't matter whether you're "in a function" or not. The data you pulled down from that URL simply doesn't match your pattern: it has \r\n line endings and not \n. Your "outside the function" test case with the literal string is testing on different data which does match the pattern.

Escaping missing parenthesis using pandas str.match

I'm having trouble with regex. I'm trying to check if my database fully matches with the item name I'm working. The problem is that sometimes the data is incomplete and I'll get errors. I would like to ignore regex completely as it is not necessary at this point.
For example the code below returns re.error: missing ), unterminated subpattern at position 10 as the last item on the list is missing a parenthesis. I've tried using if database['Item Name'].str.match(item, regex=False).any(): but it's not enough as the items can be named quite similarly and I would need perfect match. I've also tried to read re module documentation but I do not understand it well enough to get rid of the problem.
Any ideas how could I bypass the issue?
database = pd.read_csv("database.csv", sep=";")
list = ["Test Name !", "Test Name (2020)", "Test name ("]
for item in list:
if database['Item Name'].str.match(item).any():
# do something
pass
else:
#do something else
pass
If I understand your post correctly, you are trying to use the data read to create a regex. Since you don't want these treated as regexes, you might simply use string comparisons.
However, if your application requires the use of regex, you can use re.escape() render the string as literal so the paren won’t be magic.
For example:
import re
string1 = 'this is a magic ( that will break your regex'
string2 = re.escape(string1) # escapes your string
re.match(string2, "this won't cause issues")
#re.match(string1, "this will cause issues")

Regex to match and clean quotes in python

I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

Why doesn't the regex match when I add groups?

I have this regex code in python :
if re.search(r"\{\\fad|fade\(\d{1,4},\d{1,4}\)\}", text):
print(re.search(r"\{\\fad|fade\((\d{1,4}),(\d{1,4})\)\}", text).groups())
text is {\fad(200,200)}Épisode 101 : {\i1}The Ghost{\i0}\Nv. 1.03 and read from a file (don't know if that helps).
This returns the following:
(None, None)
When I change the regex in the print to r"\{\\fad\((\d{1,4}),(\d{1,4})\)\}", it returns the correct values:
(200, 200)
Can anyone see why the conditional fad|fade matches the regex in the re.search but doesn't return the correct values of the groups in the print?
Thanks.
Put extra parens around the choice: re.search(r"{(?:\\fad|fade)\((\d{1,4}),(\d{1,4})\)}", text).groups()
Also, escaping {} braces isn't necessary, it just needlessly clutters your regexp.
The bracket is part of the or branch starting with fade, so it's looking for either "{fad" or "fade(...". You need to group the fad|fade part together. Try:
r"\{\\(?:fad|fade)\(\d{1,4},\d{1,4}\)\}"
[Edit]
The reason you do get into the if block is because the regex is matching, but only because it detects it starts with "{\fad". However, that part of the match contains no groups. You need to match with the part that defines the groups if you want to capture them.
Try this:
r"\{\\fade?\(\d{1,4},\d{1,4}\)\}"
I think your conditional is looking for "\fad" or "fade", I think you need to move a \ outside the grouping if you want to look for "\fad" or "\fade".
Try this instead:
r"\{\\fade?\((\d{1,4}),(\d{1,4})\)\}"
The e? is an optional e.
The way you have it now matches {\fad or fade(0000,0000)}
I don't know the python dialect of regular expressions, but wouldn't you need to 'group' the "fad|fade" somehow to make sure it isn't trying to find "fad OR fade(etc..."?

Categories

Resources