re.sub for string starting with special character - python

Sorry if this question seems too similar to other's I have found. This is a variation of using re.sub to replace exact characters in a string.
I have a string that looks like:
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
I would like to only replace, for example, the '*:1' with 'Ar'. My current attempt looks like this:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
print(smiles_all)
new_smiles=re.sub('[*:]1','Ar',smiles_all)
print(new_smiles)
C1([*:5])C([*:6])C2=NC1=C([*Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*Ar0])C(=N4)C([*:3])=C5C([*Ar1])=C([*Ar2])C(=C2([*:4]))N5
As you can see, this is still changing the values that were previously 10,11, etc. I've tried different variations where I select [*:1], but that is also incorrect. Any help here would be greatly appreciated. In my current output, the * also remains. That needs to be swapped so that *:1 becomes Ar
Here is an example of what the output should be
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
*Edit:
At one point this question was flagged as answered by this question:
Escaping regex string
When I implement re.escape as suggested, I still get an error:
new_smiles=re.sub(re.escape('*:1'),'Ar',smiles_all)
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([Ar0])C(=N4)C([*:3])=C5C([Ar1])=C([Ar2])C(=C2([*:4]))N5

Given:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
desired='C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
You are trying to replace the literal string [*:1] with [Ar]. In a regex, the expression [*:1] is a character class that matches a single one of the characters inside the class with one match. If you add any regex repetition to a character class, it will match those characters in any order up to the repetition limit.
The easiest way to to replace the literal [*:1] with [Ar] is to use Python's string methods:
>>> smiles_all.replace('[*:1]','[Ar]')==desired
True
If you want to use a regex, you need to escape those metacharaters to get a literal string:
>>> re.sub(r'\[\*:1\]', "[Ar]", smiles_all)==desired
True
Or let Python do the escaping for you:
>>> re.sub(re.escape(r'[*:1]'), "[Ar]", smiles_all)==desired
True

You can try:
re.sub(r"[*:]+1(?=])", "Ar", smiles_all)
Difference from yours is to allow 1+ repetitions of literal * and : followed by 1 which is also ensured to be followed by a ] via the ?=, i.e., positive lookahead.
to get
"C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5"

Related

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Dynamically Removing string with regex python

I am currently having trouble removing the end of strings using regex. I have tried using .partition with unsuccessful results. I am now trying to use regex unsuccessfully. All the strings follow the format of some random words **X*.* Some more words. Where * is a digit and X is a literal X. For Example 21X2.5. Everything after this dynamic string should be removed. I am trying to use re.sub('\d\d\X\d.\d', string). Can someone point me in the right direction with regex and how to split the string?
The expected output should read:
some random words 21X2.5
Thanks!
Use following regex:
re.search("(.*?\d\dX\d\.\d)", "some random words 21X2.5 Some more words").groups()[0]
Output:
'some random words 21X2.5'
Your regex is not correct. The biggest problem is that you need to escape the period. Otherwise, the regex treats the period as a match to any character. To match just that pattern, you can use something like:
re.findall('[\d]{2}X\d\.\d', 'asb12X4.4abc')
[\d]{2} matches a sequence of two integers, X matches the literal X, \d matches a single integer, \. matches the literal ., and \d matches the final integer.
This will match and return only 12X4.4.
It sounds like you instead want to remove everything after the matched expression. To get your desired output, you can do something like:
re.split('(.*?[\d]{2}X\d\.\d)', 'some random words 21X2.5 Some more words')[1]
which will return some random words 21X2.5. This expression pulls everything before and including the matched regex and returns it, discarding the end.
Let me know if this works.
To remove everything after the pattern, i.e do exactly as you say...:
s = re.sub(r'(\d\dX\d\.\d).*', r'\1', s)
Of course, if you mean something else than what you said, something different will be needed! E.g if you want to also remove the pattern itself, not just (as you said) what's after it:
s = re.sub(r'\d\dX\d\.\d.*', r'', s)
and so forth, depending on what, exactly, are your specs!-)

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

matching 3 or more of the same character in python

I'm trying to use regular expressions to find three or more of the same character in a string. So for example:
'hello' would not match
'ohhh' would.
I've tried doing things like:
re.compile('(?!.*(.)\1{3,})^[a-zA-Z]*$')
re.compile('(\w)\1{5,}')
but neither seem to work.
(\w)\1{2,} is the regex you are looking for.
In Python it could be quoted like r"(\w)\1{2,}"
if you're looking for the same character three times consecutively, you can do this:
(\w)\1\1
if you want to find the same character three times anywhere in the string, you need to put a dot and an asterisk between the parts of the expression above, like so:
(\w).*\1.*\1
The .* matches any number of any character, so this expression should match any string which has any single word character that appears three or more times, with any number of any characters in between them.
Hope that helps.

Find two of the same character in a string with regular expressions

This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.
Try this:
re.sub(r'([aeiou])\1', r'\1:', str)
Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi
What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag
You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).

Categories

Resources