Python: How to replace string enclosed by certain characters - python

I want to replace a certain variable name in a mathematical expression while avoid to replace in function names.
For example following replacement of n:
sin(2 pi*n d)" -> "sin(2 pi*REPL d) but not: siREPL(2 pi*REPL d)
My idea was to check whether the substr is enclosed by special characters (' ' , '(', '*', etc) at one side but I failed to put it in regex or python code.
Any ideas?

Use word boundary(\b)
>>> import re
>>> re.sub(r'\bn\b', 'REPL', 'sin(2 pi*n d)')
'sin(2 pi*REPL d)'
According to the re module documentation:
\b
Matches the empty string, but only at the beginning or end of a word.
A word is defined as a sequence of alphanumeric or underscore
characters, so the end of a word is indicated by whitespace or a
non-alphanumeric, non-underscore character. Note that formally, \b is
defined as the boundary between a \w and a \W character (or vice
versa), or between \w and the beginning/end of the string, ...

Related

How to add a space after a diacritical mark only if no space came after it

I'm using this regular expression to remove an Arabic diacritical mark from a subtitle file, How it could be modified to add a space after the diacritical mark only if no space came after the diacritical mark? I'm using python 2.7.
file_content = re.sub(u'\u0651', '', file_content)
like
أعطني المفكّ، الآن
I need to add space after ّ
to be
أعطني المفكّ ، الآن
With regular expressions you could search for all occurrences of your dictation mark that has no space immediately after it:
file_content = re.sub(u'\u0651[^ ]', '\u0651 ', file_content)
[^ ] would mean any character that is not a simple whitespace.
\S would also be possible instead of [^ ], since it would match anything that is not a space.
https://docs.python.org/2/library/re.html
[] Used to indicate a set of characters.
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
\S
Matches any character which is not a whitespace character. This is the opposite of \s. If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v].

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

RegEx match word in string containing + and - using re.findall() Python

myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".

Python regex match literal asterisk

Given the following string:
s = 'abcdefg*'
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk? I thought the following would work, but it does not:
re.match(r"^[a-z]\*+$", s)
It gives None and not a match object.
How can I match it or any other string only made of lowercase letters and optionally ending with an asterisk?
The following will do it:
re.match(r"^[a-z]+[*]?$", s)
The ^ matches the start of the string.
The [a-z]+ matches one or more lowercase letters.
The [*]? matches zero or one asterisks.
The $ matches the end of the string.
Your original regex matches exactly one lowercase character followed by one or more asterisks.
\*? means 0-or-1 asterisk:
re.match(r"^[a-z]+\*?$", s)
re.match(r"^[a-z]+\*?$", s)
The [a-z]+ matches the sequence of lowercase letters, and \*? matches an optional literal * chatacter.
Try
re.match(r"^[a-z]*\*?$", s)
this means "a string consisting zero or more lowercase characters (hence the first asterisk), followed by zero or one asterisk (the question mark after the escaped asterisk).
Your regex means "exactly one lowercase character followed by one or more asterisks".
You forgot the + after the [a-z] match to indicate you want 1 or more of them as well (right now it's matching just one).
re.match(r"^[a-z]+\*+$", s)

Python RegEx Meaning

I'm new to python regular expressions and was wondering if someone could help me out by walking me through what this means (I'll state what I think each bit means here as well).
Thanks!
RegExp:
r'(^.*def\W*)(\w+)\W*\((.*)\):'
r'...' = python definition of regular expression within the ''
(...) = a regex term
(^. = match the beginning of any character
*def\W* = ???
(\w+) = match any of [a, z] 1 or more times
\W*\ = ? i think its the same as the line above this but from 0+ more times instead of 1 but since it matches the def\W line above (which i dont really know the meaning of) i'm not sure.
((.*)\): = match any additional character within brackets ()
thanks!
It seems like a failed attempt to match a Python function signature:
import re
regex = re.compile(r""" # r'' means that \n and the like is two chars
# '\\','n' and not a single newline character
( # begin capturing group #1; you can get it: regex.match(text).group(1)
^ # match begining of the string or a new line if re.MULTILINE is set
.* # match zero or more characters except newline (unless
# re.DOTALL is set)
def # match string 'def'
\W* # match zero or more non-\w chars i.e., [^a-zA-Z0-9_] if no
# re.LOCALE or re.UNICODE
) # end capturing group #1
(\w+) # second capturing group [a-zA-Z0-9_] one or more times if
# no above flags
\W* # see above
\( # match literal paren '('
(.*) # 3rd capturing group NOTE: `*` is greedy `.` matches even ')'
# therefore re.match(r'\((.*)\)', '(a)(b)').group(1) == 'a)(b'
\) # match literal paren ')'
: # match literal ':'
""", re.VERBOSE|re.DEBUG)
re.DEBUG flag causes the output:
subpattern 1
at at_beginning
max_repeat 0 65535
any None
literal 100
literal 101
literal 102
max_repeat 0 65535
in
category category_not_word
subpattern 2
max_repeat 1 65535
in
category category_word
max_repeat 0 65535
in
category category_not_word
literal 40
subpattern 3
max_repeat 0 65535
any None
literal 41
literal 58
more
r'..' = Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.
(...) = a capture group which stores the captured value in a var to be used in replacement/mathing.
^ = the start of the string.
.* = 0 or more chars of any type.
def = the literal string def
\W* = 0 or more non word chars (Anything other than a-zA-Z or _)
\w+ = 1 or more word chars (see above)
\( = escapes the (, therefore means a literal (
\) = same as above.
: = literal :
PS: I like the effort that you made to try to understand the regex. It will serve you well, much better than people asking what does r'(^.*def\W*)(\w+)\W*\((.*)\):' mean.
r'...' = python definition of regular expression within the ''
The r'' syntax has nothing to with regular expressions (or at least, not directly). The r stands for raw and is simply an indicator to Python that no string interpolation should be performed on the string.
This is often used with regular expressions so that you don't have to escape backslash (\) characters, which would otherwise be eaten by the normal string interpolation mechanism.
(^. = match the beginning of any character
I'm not sure what "beginning of any character" means. The ^ character matches the beginning of a line.
def\W = ???
def matches the characters def. For \W, take a look at pydoc re, which describes the regular expression language.
\W*
As above.
Other than the above, your interpretation seems largely correct.
r'(^.*def\W\*)(\w+)\W*((.*)):'
==================================
r' tells python this is a raw string so you don't have to double escape all the \
^ match start of string
() in each case this means group this match where each () is a different group
.* match zero or more of any characters
def match the literal 'def'
\W* match zero or more of any non word character
() more grouping of the contained expression
\w+ match one or more of word character
\W* zero or more of any non word character
\( escape the left paren
() more grouping of the contained expression
.* zero of more of any character
\) escape the right paren
: match a single colon literal
This looks like it is trying to match a python method definition. Here is a link to play with this regular expression. Yes it is powered by Ruby, but the syntax is pretty much the same across all languages, I use this site to test regexes for Python, Java and Ruby.
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.
( ) groups a statement, it's treated as ONE thing, so you can do ()? or ()* or ()+ if the things in the brackets need to handled together.
^ matches it if it's the beginning of the string
(Dot.) In the default mode, this matches any character except a newline.
Since it is ".*" the * means match 0 or more matches of the previous thing, which in this case is any character.
def\W* - lines beginning with the string "def" and then \W matches any non-alphanumeric characters and is equivalent to [^a-zA-Z0-9_]. Since we have the * again, this time it matches 0 or more non-alphanumeric characters.
(\w+), the + is for 1 or more of the previous, which in this case is \w, equivalent to [a-zA-Z0-9_].
7.\W*, we know that one already.
"(" - means just to match "(", differentiate it from () grouping things, same for ")/"
(.*) - match 0 or more characters inside.
: - the matched string ends with a colon at the end.
The whole thing seems to be matching a definition of a function in python i.e. "def foo(x):" will be matched. Dealing with regular expressions is hard - using tools such as http://www.pythonregex.com/ helps me to try out different things. And since RE are slightly different in different languages, it's nice to have tools for those too.
r'...' → the preferred way to define the string of a regular expression in Python
(...) → regex term
^ → matches only with the beginning of the string
So, in the first pair of parenthesis (^.def\W), firstly it matches with the string.
. → matches any character
* → repeat the previous match 0 or more times
Then .* will match anything any number of times. The following 'def' is an exact match, that only matches with itself.
\W → matches anything that is NOT a letter, nor a number, nor the underscore character.
Then \W* will match zero or more of these non-letter-number-underscore characters. The next pair of parenthesis (\w+) you got it right. In the last part \W*\((.*)\): the initial \W* means the same thing as the previous \W* . Next, \( matches with ( , then there is the group (.*) which means, as previously, anything any number of times, followed by \): that matches ): .
An example of string that is matched by this regular expression is:
thing_def = function_name (123 anything in here):

Categories

Resources