Forward slash in a Python regex - python

I'm trying to use a Python regex to find a mathematical expression in a string. The problem is that the forward slash seems to do something unexpected. I'd have thought that [\w\d\s+-/*]* would work for finding math expressions, but it finds commas too for some reason. A bit of experimenting reveals that forward slashes are the culprit. For example:
>>> import re
>>> re.sub(r'[/]*', 'a', 'bcd')
'abacada'
Apparently forward slashes match between characters (even when it is in a character class, though only when the asterisk is present). Back slashes do not escape them. I've hunted for a while and not found any documentation on it. Any pointers?

Look here for documentation on Python's re module.
I think it is not the /, but rather the - in your first character class: [+-/] matches +, / and any ASCII value between, which happen to include the comma.
Maybe this hint from the docs help:
If you want to include a ']' or a '-' inside a set, precede it with a backslash, or place it as the first character.

You are saying it to replace zero or more slashes with 'a'. So it does replace each "no character" with 'a'. :)
You probably meant [/]+, i.e. one or more slashes.
EDIT: Read Ber's answer for a solution to the original problem. I didn't read the whole question carefully enough.

r'[/]*' means "Match 0 or more forward-slashes". There are exactly 0 forward-slashes between 'b' & 'c' and between 'c' & 'd'. Hence, those matches are replaced with 'a'.

The * matches its argument zero or more times, and thus matches the empty string. The empty string is (logically) between any two consecutive characters. Hence
>>> import re
>>> re.sub(r'x*', 'a', 'bcd')
'abacada'
As for the forward slash, it receives no special treatment:
>>> re.sub(r'/', 'a', 'b/c/d')
'bacad'
The documentation describes the syntax of regular expressions in Python. As you can see, the forward slash has no special function.
The reason that [\w\d\s+-/*]* also finds comma's, is because inside square brackets the dash - denotes a range. In this case you don't want all characters between + and /, but a the literal characters +, - and /. So write the dash as the last character: [\w\d\s+/*-]*. That should fix it.

Related

Python regex OR expression

I have a file named Document.pdf and sometimes it is called Document-12345678.pdf where -12345678 is a random number.
I want to check a file is downloaded in folder. When the file is not finished it display Document.pdf.fkasfmq or Document-12345678.pdf.fkasfmq where .fkasfmq is a random hash from the downloader and I don't want it to match.
I try make a regex like r'Document(?:[\-0-9]+).pdf' and test it with either Document.pdf or Document-12345678.pdf it will always return false.
From my understanding (?:[\-0-9]+) means it can be or not in the set that matches any hyphen and any numbers before .pdf, is that correct? I am very very rusty with regex...
The parentheses only perform grouping, not optionality. If you want to make the expression optional, the ? quantifier does that (and actually the parentheses are unnecessary, as the character class is a single expression). Though as #anubhava notes in a comment, you might as well use the * quantifier then.
r'Document[-0-9]*\.pdf'
Notice also the backslash to match a literal dot; an unescaped . matches any character (other than newline). Inside a character class, an initial or final hyphen does not need to be backslash-escaped.
On the other hand, perhaps prefer a more precise expression:
r'^Document(-\d)?\.pdf$'
which says, opionally, a hyphen followed by numbers, and nothing before or after.
You should mark it as optional with the "?" symbol. Otherwise, you are requiring that the name should have the numbers and/or digits part.
r'Document(?:[\-0-9]+)?\.pdf'
Or as #anubhava pointed out in the comments, it can be simplified to:
r'Document[\-0-9]*\.pdf'
This way, it will also match e.g. "Document.pdf"
Also, you should consider putting the mark "$" to signify end of string so that it doesn't match e.g. "Document.pdf.fkasfmq"
r'^Document(?:[\-0-9]+)?\.pdf$'
Or
r'^Document[\-0-9]*\.pdf$'
You can just use (\d{8}) to see if there's a document there with 8 digits in the filename.

re.sub for string starting with special character

Sorry if this question seems too similar to other's I have found. This is a variation of using re.sub to replace exact characters in a string.
I have a string that looks like:
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
I would like to only replace, for example, the '*:1' with 'Ar'. My current attempt looks like this:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
print(smiles_all)
new_smiles=re.sub('[*:]1','Ar',smiles_all)
print(new_smiles)
C1([*:5])C([*:6])C2=NC1=C([*Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*Ar0])C(=N4)C([*:3])=C5C([*Ar1])=C([*Ar2])C(=C2([*:4]))N5
As you can see, this is still changing the values that were previously 10,11, etc. I've tried different variations where I select [*:1], but that is also incorrect. Any help here would be greatly appreciated. In my current output, the * also remains. That needs to be swapped so that *:1 becomes Ar
Here is an example of what the output should be
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
*Edit:
At one point this question was flagged as answered by this question:
Escaping regex string
When I implement re.escape as suggested, I still get an error:
new_smiles=re.sub(re.escape('*:1'),'Ar',smiles_all)
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([Ar0])C(=N4)C([*:3])=C5C([Ar1])=C([Ar2])C(=C2([*:4]))N5
Given:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
desired='C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
You are trying to replace the literal string [*:1] with [Ar]. In a regex, the expression [*:1] is a character class that matches a single one of the characters inside the class with one match. If you add any regex repetition to a character class, it will match those characters in any order up to the repetition limit.
The easiest way to to replace the literal [*:1] with [Ar] is to use Python's string methods:
>>> smiles_all.replace('[*:1]','[Ar]')==desired
True
If you want to use a regex, you need to escape those metacharaters to get a literal string:
>>> re.sub(r'\[\*:1\]', "[Ar]", smiles_all)==desired
True
Or let Python do the escaping for you:
>>> re.sub(re.escape(r'[*:1]'), "[Ar]", smiles_all)==desired
True
You can try:
re.sub(r"[*:]+1(?=])", "Ar", smiles_all)
Difference from yours is to allow 1+ repetitions of literal * and : followed by 1 which is also ensured to be followed by a ] via the ?=, i.e., positive lookahead.
to get
"C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5"

Backslashes in Python Regex

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:
searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
However - say pojo is 'MyObject' - the regex is not matching it to this line:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
If I print the string (while stopped in Pdb) I'm searching with, I see this:
'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'
I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:
searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)
And the string searched becomes
'<class name="(.*\\.|)MyObject".*table="(.*?)"'
It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?
Given this:
re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
The the first part of the pattern is interpreted like this:
1. class name=" a literal string beginning with c and ending with "
2. ( the beginning of a group
3. .* zero or more of any characters
4. \\ a literal single slash
5. . any single character
6. OR
7. nothing
8. ) end of the group
Since the string you're searching for does not have a literal backslash, it won't match.
If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.
Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".
This seems to work for me:
import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"
pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'
assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))
On Pythex, I tried this regex:
<class name="(.*)\.MyObject" table="([^"]*)"
on this string:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
and got these two match captures:
com.place.package
my_cool_object
So I think in your case, this line
searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)
will produce the result you want.
About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:
'\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources