i am using python with re module for regular expressions
i want to remove all the characters from the string except numbers and characters.
To achieve this i am using sub function
Code Snippet:-
>>> text="foo.bar"
>>> re.sub("[^A-Z][^a-z]","",text)
'fobar'
I wanted to know why above expression removes the "o."?
I am not able to understand why it removes the "o"
Can someone please explain me what is going on behind this?
I know to correct solution of this problem is
>>> re.sub("[^A-Z ^a-z]","",text)
'foobar'
Thanks in advance
A very important aspect to realize is that [^A-Z][^a-z] represents two characters (one for each character group), while [^A-Za-z] represents only one.
Explained in detail:
The [^A-Z] means all characters except uppercase A to Z, the second o in foo.bar is not uppercase so it matches as a matter of fact everything in foo.bar is matched at this point.
Then you add [^a-z] so you look for a character that is not lowercase, only the dot matches.
Combine both and you look for a non-uppercase character followed by a non-lowercase character so this matches o.
The solution is the one proposed by Ignacio.
Because o matches [^A-Z] and . matches [^a-z].
And the correct solution is [^A-Za-z0-9].
Related
Sorry if this question seems too similar to other's I have found. This is a variation of using re.sub to replace exact characters in a string.
I have a string that looks like:
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
I would like to only replace, for example, the '*:1' with 'Ar'. My current attempt looks like this:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
print(smiles_all)
new_smiles=re.sub('[*:]1','Ar',smiles_all)
print(new_smiles)
C1([*:5])C([*:6])C2=NC1=C([*Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*Ar0])C(=N4)C([*:3])=C5C([*Ar1])=C([*Ar2])C(=C2([*:4]))N5
As you can see, this is still changing the values that were previously 10,11, etc. I've tried different variations where I select [*:1], but that is also incorrect. Any help here would be greatly appreciated. In my current output, the * also remains. That needs to be swapped so that *:1 becomes Ar
Here is an example of what the output should be
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
*Edit:
At one point this question was flagged as answered by this question:
Escaping regex string
When I implement re.escape as suggested, I still get an error:
new_smiles=re.sub(re.escape('*:1'),'Ar',smiles_all)
C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([Ar0])C(=N4)C([*:3])=C5C([Ar1])=C([Ar2])C(=C2([*:4]))N5
Given:
smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
desired='C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
You are trying to replace the literal string [*:1] with [Ar]. In a regex, the expression [*:1] is a character class that matches a single one of the characters inside the class with one match. If you add any regex repetition to a character class, it will match those characters in any order up to the repetition limit.
The easiest way to to replace the literal [*:1] with [Ar] is to use Python's string methods:
>>> smiles_all.replace('[*:1]','[Ar]')==desired
True
If you want to use a regex, you need to escape those metacharaters to get a literal string:
>>> re.sub(r'\[\*:1\]', "[Ar]", smiles_all)==desired
True
Or let Python do the escaping for you:
>>> re.sub(re.escape(r'[*:1]'), "[Ar]", smiles_all)==desired
True
You can try:
re.sub(r"[*:]+1(?=])", "Ar", smiles_all)
Difference from yours is to allow 1+ repetitions of literal * and : followed by 1 which is also ensured to be followed by a ] via the ?=, i.e., positive lookahead.
to get
"C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5"
Suppose I am using the following regular expression to match, logically the regular expression means match anything with prefix foo: and ends with anything which is not a space. Match group will be the parts exclude prefix foo
My question is what exactly means anything in Python 2.7? Any ASCII or? If anyone could share some document, it will be great. Thanks.
a = re.compile('foo:([^ ]+)')
thanks in advance,
Lin
Try:
a = re.compile('foo:\S*')
\S means anything but whitespace.
I recommend you check out http://pythex.org.
It's really good for testing out regular expresions and has a decent cheat-sheet.
UPDATE:
Anything (.) matches anything, all unicode/UTF-8 characters.
The regular expression metacharacter which matches any character is . (dot).
a = re.compile('foo:(.+)')
The character class [^ ] matches any one character which isn't one of the characters between the square brackets (a literal space, in this example). The quantifier + specifies one or more repetitions of the preceding expression.
string: XXaaaXXbbbXXcccXXdddOO
I want to match the minimal string that begin with 'XX' and end with 'OO'.
So I write the non-greedy reg: r'XX.*?OO'
>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']
I thought it will return ['XXdddOO'] but it was so 'greedy'.
Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.
But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.
Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO).
And of course the fact is that the string is processed from left to right.
How about:
.*(XX.*?OO)
The match will be in group 1.
Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO. If your data structure is that fixed, you could do:
XX[a-z]{3}OO
to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO: for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO)
Indeed, issue is not with greedy/non-greedy… Solution suggested by #devnull should work, provided you want to avoid even a single X between your XX and OO groups.
Else, you’ll have to use a lookahead (i.e. a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:
re.findall(r'XX(?:.(?!XX))*?OO', str)
With this negative lookahead, you match (non-greedily) any char (.) not followed by XX…
The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:
XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO
I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')
This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.
Try this:
re.sub(r'([aeiou])\1', r'\1:', str)
Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi
What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag
You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).