For regex, r'[!-\.&]' what does it mean?

For regex, r'[!-\.&]' what does it mean? - python

Is it !-\ (characters from 33=ord('!') to 92=ord('\')
and '.' and '&' in a set?
I think my interpretation is incorrect based on my test.
But python reference doesn't say anything wrong with my interpretation.
http://docs.python.org/library/re.html

In short, r'[!-\.&]' is just a complicated form of writing r'[!-.]'.
It matches all characters with ord between 33 = ord('!') and 46 = ord('.'), i.e. any of the following:
!"#$%&\'()*+,-.
The escaping backslash before . is ignored in character classes; it is unnecessary (. matching all characters in a character class wouldn't make any sense). Since the ampersand & is already in the character class, it is superfluous as well.

Tests may show that the pattern matches chr(33) through chr(46), but the pattern is not guaranteed to work that way on all systems. Here's why. Character sets vary from system to system.
This is why the Perl regex documentation specifically recommends “to use only ranges that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe.” (Perl regex is relevant because that's the regex used by Python.)
So, if this pattern is ever run on an EBCDIC based platform, it will match a different set of characters. It is only correct to say that the pattern matches chr(33) through chr(46) on ASCII based platforms.

It seems that the intention of this regex is to match any character between "!" and "." (notice that the slash is escaping the "." character), which are ! " # $ % & ' ( ) * + , - . (from the Unicode table at http://www.tamasoft.co.jp/en/general-info/unicode.html).
Two comments about the expression:
Usually, you don't need to escape characters within brackets [] (except, maybe, by the \ itself).
The ampersand symbol "&" is already contained in the range defined by "!-.", so it is redundant.

The backslash escapes the dot and the range will thus be from ! to .. The regex will match:
!"#$%&'()*+,-.
The last & is not necessary since it's included in the range, and escaping a dot is not needed either since it's inside a character class.

Related

Python regex OR expression

I have a file named Document.pdf and sometimes it is called Document-12345678.pdf where -12345678 is a random number.
I want to check a file is downloaded in folder. When the file is not finished it display Document.pdf.fkasfmq or Document-12345678.pdf.fkasfmq where .fkasfmq is a random hash from the downloader and I don't want it to match.
I try make a regex like r'Document(?:[\-0-9]+).pdf' and test it with either Document.pdf or Document-12345678.pdf it will always return false.
From my understanding (?:[\-0-9]+) means it can be or not in the set that matches any hyphen and any numbers before .pdf, is that correct? I am very very rusty with regex...

The parentheses only perform grouping, not optionality. If you want to make the expression optional, the ? quantifier does that (and actually the parentheses are unnecessary, as the character class is a single expression). Though as #anubhava notes in a comment, you might as well use the * quantifier then.
r'Document[-0-9]*\.pdf'
Notice also the backslash to match a literal dot; an unescaped . matches any character (other than newline). Inside a character class, an initial or final hyphen does not need to be backslash-escaped.
On the other hand, perhaps prefer a more precise expression:
r'^Document(-\d)?\.pdf$'
which says, opionally, a hyphen followed by numbers, and nothing before or after.

You should mark it as optional with the "?" symbol. Otherwise, you are requiring that the name should have the numbers and/or digits part.
r'Document(?:[\-0-9]+)?\.pdf'
Or as #anubhava pointed out in the comments, it can be simplified to:
r'Document[\-0-9]*\.pdf'
This way, it will also match e.g. "Document.pdf"
Also, you should consider putting the mark "$" to signify end of string so that it doesn't match e.g. "Document.pdf.fkasfmq"
r'^Document(?:[\-0-9]+)?\.pdf$'
Or
r'^Document[\-0-9]*\.pdf$'

You can just use (\d{8}) to see if there's a document there with 8 digits in the filename.

Backslashes in Python Regex

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:
searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
However - say pojo is 'MyObject' - the regex is not matching it to this line:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
If I print the string (while stopped in Pdb) I'm searching with, I see this:
'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'
I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:
searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)
And the string searched becomes
'<class name="(.*\\.|)MyObject".*table="(.*?)"'
It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?

Given this:
re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
The the first part of the pattern is interpreted like this:
1. class name=" a literal string beginning with c and ending with "
2. ( the beginning of a group
3. .* zero or more of any characters
4. \\ a literal single slash
5. . any single character
6. OR
7. nothing
8. ) end of the group
Since the string you're searching for does not have a literal backslash, it won't match.
If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.
Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".
This seems to work for me:
import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"
pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'
assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))

On Pythex, I tried this regex:
<class name="(.*)\.MyObject" table="([^"]*)"
on this string:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
and got these two match captures:
com.place.package
my_cool_object
So I think in your case, this line
searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)
will produce the result you want.
About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:
'\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".

Match LaTeX reserved characters with regex

I have an HTML to LaTeX parser tailored to what it's supposed to do (convert snippets of HTML into snippets of LaTeX), but there is a little issue with filling in variables. The issue is that variables should be allowed to contain the LaTeX reserved characters (namely # $ % ^ & _ { } ~ \). These need to be escaped so that they won't kill our LaTeX renderer.
The program that handles the conversion and everything is written in Python, so I tried to find a nice solution. My first idea was to simply do a .replace(), but replace doesn't allow you to match only if the first is not a \. My second attempt was a regex, but I failed miserably at that.
The regex I came up with is ([^\][#\$%\^&_\{\}~\\]). I hoped that this would match any of the reserved characters, but only if it didn't have a \ in front. Unfortunately, this matches ever single character in my input text. I've also tried different variations on this regex, but I can't get it to work. The variations mainly consisted of removing/adding slashes in the second part of the regex.
Can anyone help with this regex?
EDIT Whoops, I seem to have included the slashes as well. Shows how awake I was when I posted this :) They shouldn't be escaped in my case, but it's relatively easy to remove them from the regexes in the answers. Thanks all!

The [^\] is a character class for anything not a \, that is why it is matching everything. You want a negative lookbehind assertion:
((?<!\)[#\$%\^&_\{\}~\\])
(?<!...) will match whatever follows it as long as ... is not in front of it. You can check this out at the python docs

The regex ([^\][#\$%\^&_\{\}~\\]) is matching anything that isn't found between the first [ and the last ], so it should be matching everything except for what you want it to.
Moving around the parenthesis should fix your original regex ([^\\])[#\$%\^&_\{\}~\\].
I would try using regex lookbehinds, which won't match the character preceding what you want to escape. I'm not a regex expert so perhaps there is a better pattern, but this should work (?<!\\)[#\$%\^&_\{\}~\\].

If you're looking to find special characters that aren't escaped, without eliminating special chars preceded by escaped backslashes (e.g. you do want to match the last backslash in abc\\\def), try this:
(?<!\\)(\\\\)*[#\$%\^&_\{\}~\\]
This will match any of your special characters preceded by an even number (this includes 0) of backslashes. It says the character can be preceded by any number of pairs of backslashes, with a negative lookbehind to say those backslashes can't be preceded by another backslash.
The match will include the backslashes, but if you stick another in front of all of them, it'll achieve the same effect of escaping the special char, anyway.

Weird Python Regex Issues

whitespace_pattern = u"\s" # bug: tried to use unicode \u0020, broke regex
time_sig_pattern = \
"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}
time_sig = compile(time_sig_pattern, U|M)
For some reason, adding the Verbose flag, X, to compile breaks the pattern.
Also, I wanted to use unicode for whitespace_pattern recognition (supposedly, we'll get patterns that use non-unicode spaces and we need to explicitly check for that one unicode character as a valid space), but the pattern keeps breaking.

VERBOSE gives you the ability to write comments in your regex to document it.
In order to do so, it ignores spaces, since you need to use line breaks to write comments.
Replace all spaces in your regex by \s to specify they are spaces you want to match in your pattern, and not just some spaces to format your comments.
What's more, you may want to use the r prefix for the string you use as a pattern. It tells Python not to interpret special notations such as \n and let you use backslashes without escaping them.

Always define regexes with the r prefix to indicate they are raw strings.
r"""^%(ws)s*time signature:%(ws)s*(?P<top>\d+)%(ws)s*\/%(ws)s*(?P<bottom>\d+)%(ws)s*$""" %{"ws": whitespace_pattern}

When creating a regex to match unicode characters you do not want to use a Python unicode string. In your example regular expression needs to see the literal characters \u0020, so you should use whitespace_pattern = r"\u0020" instead of u"\u0020".
As other answers have mentioned, you should also use the r prefix for time_sig_pattern, after those two changes your code should work fine.
For VERBOSE to work correctly you need to escape all whitespace in the pattern, so towards the beginning of the pattern replace the space in time signature with "\ " (quotes for clarity), \s, or [ ] as documented here.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.

>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.

It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.

If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.

mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

For regex, r'[!-\.&]' what does it mean? - python

Is it !-\ (characters from 33=ord('!') to 92=ord('\') and '.' and '&' in a set? I think my interpretation is incorrect based on my test. But python reference doesn't say anything wrong with my interpretation. http://docs.python.org/library/re.html

The backslash escapes the dot and the range will thus be from ! to .. The regex will match: !"#$%&'()*+,-. The last & is not necessary since it's included in the range, and escaping a dot is not needed either since it's inside a character class.

Related

Python regex OR expression

Backslashes in Python Regex

Match LaTeX reserved characters with regex

Weird Python Regex Issues

Python: Regex to extract part of URL found between parentheses

Categories

Resources