Say I have a string
"3434.35353"
and another string
"3593"
How do I make a single regular expression that is able to match both without me having to set the pattern to something else if the other fails? I know \d+ would match the 3593, but it would not do anything for the 3434.35353, but (\d+\.\d+) would only match the one with the decimal and return no matches found for the 3593.
I expect m.group(1) to return:
"3434.35353"
or
"3593"
You can put a ? after a group of characters to make it optional.
You want a dot followed by any number of digits \.\d+, grouped together (\.\d+), optionally (\.\d+)?. Stick that in your pattern:
import re
print re.match("(\d+(\.\d+)?)", "3434.35353").group(1)
3434.35353
print re.match("(\d+(\.\d+)?)", "3434").group(1)
3434
This regex should work:
\d+(\.\d+)?
It matches one ore more digits (\d+) optionally followed by a dot and one or more digits ((\.\d+)?).
Use the "one or zero" quantifier, ?. Your regex becomes: (\d+(\.\d+)?).
See Chapter 8 of the TextWrangler manual for more details about the different quantifiers available, and how to use them.
use (?:<characters>|). replace <characters> with the string to make optional. I tested in python shell and got the following result:
>>> s = re.compile('python(?:3|)')
>>> s
re.compile('python(?:3|)')
>>> re.match(s, 'python')
<re.Match object; span=(0, 6), match='python'>
>>> re.match(s, 'python3')
<re.Match object; span=(0, 7), match='python3'>```
Read up on the Python RegEx library. The link answers your question and explains why.
However, to match a digit followed by more digits with an optional decimal, you can use
re.compile("(\d+(\.\d+)?)")
In this example, the ? after the .\d+ capture group specifies that this portion is optional.
Example
Related
I'm trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here's an example
import regex as re
p = r'((?!\pL)|^)blah((?!\pL)|$)'
print(re.search(p, "blah u"))
print(re.search(p, "blahé u"))
print(re.search(p, "éblah u"))
print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'>
None
None
None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for "u blah".
I've tried to also use Pythons built in re module, but I cannot get it to work with \pL or \p{Latin} due to "bad-escape" errors. I've also tried to use unicode strings (without the "r") but despite adding slashes to \\\\pL, I just can't get this to work right.
Note: I'm using Python 3.8
The problem with your ((?!\pL)|^)blah((?!\pL)|$) regex is that the ((?!\pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!\pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!\pL)|$) and only matches at the start of string.
Note (?!\pL) already matches a position at the end of string, so ((?!\pL)|$) = (?!\pL).
You should use
(?<!\pL)blah(?!\pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re-compatible version of the regex is
(?<![^\W\d_])blah(?![^\W\d_])
See the regex demo.
Please accept my apology if this is a dumb question.
I want to make a regex expression that can make the two following changes in Python.
$12345.67890 to 12345.67
$12345 to 12345
What would be an appropriate regex expression to make both changes?
Thank you in advance.
We can try using re.sub here:
inp = "Here is a value $12345.67890 for replacement."
out = re.sub(r'\$(\d+(?:\.\d{1,2})?)\d*\b', '\\1', inp)
print(out)
This prints:
Here is a value 12345.67 for replacement.
Here is an explanation of the regex pattern:
\$ match $
( capture what follows
\d+ match one or more whole number digits
(?:\.\d{1,2})? then match an optional decimal component, with up to 2 digits
) close capture group (the output number you want)
\d* consume any remaining decimal digits, to remove them
\b until hitting a word boundary
in notepad++ style i would do something like
find \$(\d+\.)(\d\d)\d+
Replace \1\2
hope that helps
For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'
If I have a given string s in Python, is it possible to easily check if a regex matches the string starting at a specific position i in the string?
I would rather not slice the entire string from i to the end as it doesn't seem very scalable (ruling out re.match I think).
re.match doesn't support this directly. However, if you pre-compile your regular expression (often a good idea anyway) with re.compile, then the RegexObject's similar method, match (and search) both take an optional pos parameter:
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
Example:
import re
s = 'this is a test 4242 did you get it'
pat = re.compile('[a-zA-Z]+ ([0-9]+)')
print pat.match(s, 10).group(0)
Output:
'test 4242'
Although re.match does not support this, the new regex module (intended to replace the re module) has a treasure trove of new features, including pos and endpos arguments for search, match, sub, and subn. Although not official yet, the regex module can be pip installed and works for Python versions 2.5 through 3.4. Here's an example:
>>> import regex
>>> regex.match(r'\d+', 'abc123def')
>>> regex.match(r'\d+', 'abc123def', pos=3)
<regex.Match object; span=(3, 6), match='123'>
>>> regex.match(r'\d+', 'abc123def', pos=3, endpos=5)
<regex.Match object; span=(3, 5), match='12'>
Python metacharacter negation.
After scouring the net and writing a few different syntaxes I'm out of ideas.
Trying to rename some files. They have a year in the title e.g. [2002].
Some don't have the brackets, which I want to rectify.
So I'm trying to find a regex (that I can compile preferably) that in my mind looks something like (^[\d4^]) because I want the set of 4 numbers that don't have square brackets around them. I'm using the brackets in the hope of binding this so that I can then rename using something like [\1].
If you want to check for things around a pattern you can use lookahead and lookbehind assertions. These don't form part of the match but say what you expect to find (or not find) around it.
As we don't want brackets we'll need use a negative lookbehind and lookahead.
A negative lookahead looks like this (?!...) where it matches if ... does not come next. Similarly a negative lookbehind looks like this (?<!...) and matches if ... does not come before.
Our example is make slightly more complicated because we're using [ and ] which themselves have meaning in regular expressions so we have to escape them with \.
So we can build up a pattern as follows:
A negative lookbehind for [ - (?<!\[)
Four digits - \d{4}
A negative lookahead for ] - (?!\])
This gives us the following Python code:
>>> import re
>>> r = re.compile("(?<!\[)\d{4}(?!\])")
>>> r.match(" 2011 ")
>>> r.search(" 2011 ")
<_sre.SRE_Match object at 0x10884de00>
>>> r.search("[2011]")
To rename you can use the re.sub function or the sub function on your compiled pattern. To make it work you'll need to add an extra set of brackets around the year to mark it as a group.
Also, when specifying your replacement you refer to the group as \1 and so you have to escape the \ or use a raw string.
>>> r = re.compile("(?<!\[)(\d{4})(?!\])")
>>> name = "2011 - This Year"
>>> r.sub(r"[\1]",name)
'[2011] - This Year'