regex: don't match number preceded by certain character - python

Following code extracts the first sequence of numbers that appear in a string:
num = re.findall(r'^\D*(\d+)', string)
I'd like to add that the regular expression doesn't match numbers preceded by vor V.
Example:
string = 'foobarv2_34 423_wd"
Output: '34'

If you need to get the first match, you need to use re.search, not re.findall.
In this case, you can use a simpler regular expression like (?<!v)\d+ with re.I:
import re
m = re.search(r'(?<!v)\d+', 'foobarv2_34 423_wd', re.I)
if m:
print(m.group()) # => 34
See the Python demo.
Details
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
\d+ - one or more digits.
If you cannot use re.search for some reason, you can use
^.*?(?<!v)(\d+)
See this regex demo. Note that \D* (zero or more non-digits) is replaced with .*? that matches zero or more chars other than line break chars as few as possible (with re.S or re.DOTALL, it will also match line breaks) since there is a need to match all digits not preceded with v.
More details:
^ - start of string
.*? - zero or more chars other than line break chars as few as possible
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
(\d+) - Group 1: one or more digtis.

Related

How to extract string between space and symbol '>'?

String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo

RegEx for matching two digits and everything except new lines and dot

Using python v3, I'm trying to find a string only if it contains one to two digits (and not anymore than that in the same number) along with everything else following it. The match breaks on periods or new lines.
\d{1,2}[^.\n]+ is almost right except it returns numbers greater than two digits.
For example:
"5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(jn."
Should return:
5+years {} experience
10 asdasdas
1abc1
Based upon your description and your sample data, you can use following regex to match the intended strings and discard others,
^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)
Regex Explanation:
^ - Start of line
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Also, notice, multiline mode is enabled as ^ and $ need to match start of line and end of line.ad
Regex Demo 1
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^\d[^\d.]*\d?[^\d.\n]*(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1']
Also, if matching lines doesn't necessarily start with digits, you can use this regex to capture your intended string but here you need to get your string from group1 if you want captured string to start with number only, and if intended string doesn't necessarily have to start with digits, then you can capture whole match.
^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)
Regex Explanation:
^ - Start of line
[^\d\n]* - Allows zero or more non-digit characters before first digit
( - Starts first grouping pattern to capture the string starting with first digit
\d - Matches a digit
[^\d.]* - This matches any character other than digit or dot zero or more times. This basically allows optionally matching of non-digit non-dot characters.
\d? - As you want to allow one or two digits, this is the second digit which is optional hence \d followed by ?
[^\d.\n]* - This matches any character other than digit or dot or newline
`) - End of first capturing pattern
(?=\.|$) - This positive look ahead ensures, the match either ends with a dot or end of line
Multiline mode is enabled which you can enable by placing (?m) before start of regex also called inline modifier or by passing third argument to re.search as re.MULTILINE
Regex Demo 2
Code:
import re
s = '''5+years {} experience. stop.
10 asdasdas . 255
1abc1
aaa1abc1
aa2aa1abc1
5555afasfasf++++s()(2jn.'''
print(re.findall(r'(?m)^[^\d\n]*(\d[^\d.]*\d?[^\d.\n]*)(?=\.|$)', s))
Prints:
['5+years {} experience', '10 asdasdas ', '1abc1', '1abc1']

Regex for a third-person verb

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?
You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))
If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Python regex matching all but last occurrence

So I have expression such as "./folder/thisisa.test/file.cxx.h" How do I substitute/remove all the "." but the last dot?
To match all but the last dot with a regex:
'\.(?=[^.]*\.)'
Using a lookahead to check that's there another dot after the one we found (the lookahead's not part of the match).
Without regular expressions, using str.count and str.replace:
s = "./folder/thisisa.test/file.cxx.h"
s.replace('.', '', s.count('.')-1)
# '/folder/thisisatest/filecxx.h'
Specific one-char solution
In your current scenario, you may use
text = re.sub(r'\.(?![^.]*$)', '', text)
Here, \.(?![^.]*$) matches a . (with \.) that is not immediately followed ((?!...)) with any 0+ chars other than . (see [^.]*) up to the end of the string ($).
See the regex demo and the Python demo.
Generic solution for 1+ chars
In case you want to replace a . and any more chars you may use a capturing group around a character class with the chars you need to match and add the positive lookahead with .* and a backreference to the captured value.
Say, you need to remove the last occurrence of [, ], ^, \, /, - or . you may use
([][^\\./-])(?=.*\1)
See the regex demo.
Details
([][^\\./-]) - a capturing group matching ], [, ^, \, ., /, - (note the order of these chars is important: - must be at the end, ] must be at the start, ^ should not be at the start and \ must be escaped)
(?=.*\1) - a positive lookahead that requires any 0+ chars as many as possible and then the value captured in Group 1.
Python sample code:
import re
text = r"./[\folder]/this-is-a.test/fi^le.cxx.LAST[]^\/-.h"
text = re.sub(r'([][^\\./-])(?=.*\1)', '', text, flags=re.S)
print(text)
Mind the r prefix with string literals. Note that flags=re.S will make . match any linebreak sequences.

Categories

Resources