Regex - Regular expression for counting no.of digits between alphabets - python

Need to construct a regular expression that counts numbers between alphabets.
schowalte3rguss77ie85 - 2
xyz1zyx - 1
x1y1z1 - 2
I have constructed this . But this doesn't work for case 3.
[[a-z]+[0-9]+[a-z]]*
Any help would be appreciated. Thanks in advance.

Use regx:
(?<=[a-z])\d+(?=[a-z])
Demo: https://regex101.com/r/tpss6x/1
[Javascript]

If you want a count only, the last part should be a lookahead assertion.
If you want to also match uppercase chars, you can make the pattern case insensitive.
[a-z]\d+(?=[a-z])
Explanation
[a-z] Match a single char a-z
\d+ Match 1+ digits
(?=[a-z]) Positive lookahead, assert a char a-z to the right
Regex demo

You can use
(?<=[^\W\d_])\d+(?=[^\W\d_])
See the regex demo. If you want to only support ASCII letters, replace [^\W\d_] (that matches any Unicode letter) with [a-zA-Z].
Details:
(?<=[^\W\d_]) - immediately before the current location, there must be any Unicode letter
\d+ - one or more digits
(?=[^\W\d_]) - immediately after the current location, there must be any Unicode letter.
Counting can be done with len(...), see this Python demo:
import re
text = "schowalte3rguss77ie85"
matches = re.findall(r'(?<=[^\W\d_])\d+(?=[^\W\d_])', text)
print(len(matches)) # => 2

Related

How to extract string between space and symbol '>'?

String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

Regex matching digit in between?

I would like to get number in between these strings.
strings = ["point_right: account ISLAMIC: 860328 9221 asdsad",
"account 723123123",
"account823123213",
"account 823.123.213",
"account 823-123-213",
"account:123213123 ",
"account: 123213123 asdasdsad 017-299906",
"account: 123213123",
"point_right: account ISLAMIC: 860328 9221"
]
Result would be
[860328 9221,723123123, 823123213, 823.123.213, 823-123-213, 123213123, 123213123, 123213123]
And i can do processing later to make them into number. So far my strategy is to get everything after pattern and anything before a letter. I have tried:
for string in strings:
print(re.findall("(?<=account)(.*)", string.lower()))
Please help to give some pointers on the regex match.
Try this pattern:
(?=[^0-9]*)[0-9][0-9 .-]*[0-9]
Breakdown:
(?=[^0-9]*) Lookahead for a word, such as "account", non-matching
[0-9] Find a digit
[0-9 .-]* Find any number of digits or special characters (in your strings you have spaces, dashes, periods so I included those)
[0-9] Find another digit (to prevent spaces at the end)
Check it out here, and sample code here
(?!\W)([\d\s.-]+)(?<!\s)
The negative lookahead and lookbehind seems like overkills here but I wasn't able to get a clean match otherwise. You may see the results here
(?!\W) Negative lookahead to exclude any non-word characters [^a-zA-Z0-9_]
([\d\s.-]+) The capturing group for your numbers
(?<!\s) Negative lookbehind to exclude whitespace characters [\r\n\t\f\v ]
If the numbers must be the first numbers after the account substring use
re.findall("account\D*([\d\s.-]*\d)", s)
See the Python demo and the regex demo.
Pattern details
account - a literal substring
\D* - 0+ chars other than digits
([\d\s.-]*\d) - Capturing group 1 (the value returned by re.findall): 0 or more digits, whitespaces, . and - chars followed with a digit.

Extracting a string within a string and omitting the search string

I have a string:
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
I want to extract the numbers 23.99, 44.65, 98.44,33.44, 2,300.00. I have this regex
\$(.*[^\s])
There are 2 issues with this.
It returns the '$' sign. I only want the number.
It only works when there is a space at the end of the number but sometimes there might be letters and it won't work in that case.
Thanks.
You can use regex as shown:
import re
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
res = re.findall(pattern="[\d.,]+", string=string)
output:
['23.99', '44.65', '98.44', '33.94', '2,300.00']
Try this regex:
(?<=\$)\d+(?:,\d+)*(?:\.\d+)?
Click for Demo
Explanation
(?<=\$) - positive lookbehind to find the position just preceded by a $
\d+ - matches 1+ occurrences of a digit
(?:,\d+)* - matches 0+ occurrences of a , followed by 1 or more digits
(?:\.\d+)? - matches a . followed by 1+ digits. ? in the end makes this decimal part optional

Regex for a third-person verb

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?
You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))
If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Categories

Resources