Given the following string:
"-local locally local test local."
my objective is to replace the string "local" with "we" such that the result becomes
"-local locally we test local."
so far (with the help from the guys here at stackoverflow: Python: find exact match) I have been able to come up with the following regular expression:
variable='local'
re.sub(r'\b%s([\b\s])' %variable, r'we\1', "-local locally local test local.")
However I have two problems with this code:
The search goes through the minus sign and the output becomes:
'-we locally we test local.'
where it should have been
'-local locally we test local.'
searching for a string starting with a minus sign such as "-local" fails the search
Try the following:
re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', some_string)
The regex that was suggested in the other question is kind of odd, since \b in a character class means a backspace character.
Basically what you have now is a regex that searches for your target string with a word boundary at the beginning (going from a word character to a non-word character or vice versa), and a whitespace character at the end.
Since you don't want to match the final "local" since it is followed by a period, I don't think that word boundaries are the way to go here, instead you should look for whitespace or beginning/end of string, which is what the above regex does.
I also used re.escape on the variable, that way if you include a characters in your target string like . or $ that usually have special meanings, they will be escaped and interpreted as literal characters.
Examples:
>>> s = "-local locally local test local."
>>> variable = 'local'
>>> re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', s)
'-local locally we test local.'
>>> variable = '-local'
>>> re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', s)
'we locally local test local.'
You could separate the string into substrings using the spaces as the delimiter. Then check each string, replace if it matches what you are looking for, and recombine them.
Certainly not efficient though :)
sed 's/ local / we /g' filename
I don't do python, but the idea is just put a space before and after local in the pattern to find, and also include spaces in the replacement.
If you just want to replace all occurences of the word that are separeted by spaces, you could split the string and operate on the resulting list:
search = "local"
replace = "we"
s = "-local locally local test local."
result = ' '.join([x if not x == search else replace for x in s.split(" ")])
Related
I have a string where there is an errant comma (',') in an IP address that should just be a period ('.'). The whole string is:
a = 'This is a test, which uses commas for a bad IP Address. 54.128,5,5, 4.'
In the above string, the IP address 54.128,5,5 should be 54.128.5.5
I tried to use re.sub(), as follows, but it doesn't seem to work...
def stripBadCommas(string):
newString = re.sub(r'/(?<=[0-9]),(?<=[0-9])/i', '.', string)
return newString
a = 'This is a test, which uses commas for a bad IP Address. 54.128,5,5, 4.'
b = ''
b = stripBadCommas(a)
print a
print b
MY QUESTION: What is the proper way to use Regular Expressions to search for and replace only the commas that are bounded by whole numbers/numerics with periods without disrupting the other appropriate commas and periods?
Thanks, in advance, for any assistance you can offer.
You may use
def stripBadCommas(s):
newString = re.sub(r'(?<=[0-9]),(?=[0-9])', '.', s)
return newString
See the Python online demo.
Note that Python re patterns are not written using regex literal notations, the / and /i are treated as part of the pattern. Moreover, the pattern needs no case insensitive modifier as it has no letters inside (does not match case letters).
Besides, you used the second lookbehind (?<=[0-9]) while there must be a positive lookahead (?=[0-9]) because ,(?<=[0-9]) pattern never matches (the , is matched and then the engine tries to make sure the , is a digit, which is false).
I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)
I have an input string for e.g:
input_str = 'this is a test for [blah] and [blah/blahhhh]'
and I want to retain [blah] but want to remove [blah/blahhhh] from the above string.
I tried the following codes:
>>>re.sub(r'\[.*?\]', '', input_str)
'this is a test for and '
and
>>>re.sub(r'\[.*?\/.*?\]', '', input_str)
'this is a test for '
what should be the right regex pattern to get the output as "this is a test for [blah] and"?
I don't understand why your 2nd regex doesn't work, I tested it yes, you are correct, it doesn't work. So you can use the same idea but with different approaches.
Instead of using the wildcards you can use the \w like this:
\[\w+\/\w+\]
Working demo
By the way, if you can have non characters separated by /, then you can use this regex:
\[[^\]]*\/[^\]]*]
Working demo
The reason the second regex in the original post matches more than the OP wants is that . matches any character including ]. So \[.*?\/' (or just \[.*?/ since the \ before the / is superfluous) will match more than it seems the OP wanted: [blah] and [blah/ in input_str.
The ? adds confusion. It will limit repetition of the .* part of .*\] sub-expression, but you have to understand what repetition you're limiting [1]. It's better to explicitly match any non-closing bracket instead of the . wildcard to begin with. So-called "greedy" matching of .* is often a stumbling block since it will match zero or more occurrences of any character until that wildcard match fails (usually much longer than people expect). In your case it greedily matches as much of the input as possible until the last occurrence of the next explicitly specified part of the regex (] or / in your regexes). Instead of using ? to try to counteract or limit greedy matching with lazy matching, it is often better to be explicit about what to not match in the greedy part.
As an illustration, see the following example of .* grabbing everything until the last occurrence of the character after .*:
echo '////k////,/k' | sed -r 's|/.*/|XXX|'
XXXk
echo '////k////,/k' | sed -r 's|/(.*)?/|XXX|'
XXXk
And subtleties of greedy / lazy matching behavior can vary from one regex implementation to the next (pcre, python, grep/egrep). For portability and simplicity / clarity, be explicit when you can.
If you only want to look for strings with brackets that don't include a closing bracket character before the slash character, you could more explicitly look for "not-a-closing-bracket" instead of the wildcard match:
re.sub(r'\[[^]]*/[^]]*\]', '', input_str)
'this is a test for [blah] and '
This uses a character class expression - [^]] - instead of the wildcard . to match any character that is explicitly not a closing bracket.
If it's "legal" in your input stream to have one or more closing brackets within enclosing brackets (before the slash), then things get more complicated since you have to determine if it's just a stray bracket character or the start of a nested sub-expression. That's starting to sound more like the job of a token parser.
Depending on what you are trying to really achieve (I assume this is just a dummy example of something that is probably more complex) and what is allowed in the input, you may need something more than my simple modification above. But it works for your example anyway.
[1] http://www.regular-expressions.info/repeat.html
You can write a function that takes that input_str as an argument and loop trough the string and if it sees '/' between '[' and ']' jumps back to the position where '[' is and removes all elements including ']'
I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>