I would like to get number in between these strings.
strings = ["point_right: account ISLAMIC: 860328 9221 asdsad",
"account 723123123",
"account823123213",
"account 823.123.213",
"account 823-123-213",
"account:123213123 ",
"account: 123213123 asdasdsad 017-299906",
"account: 123213123",
"point_right: account ISLAMIC: 860328 9221"
]
Result would be
[860328 9221,723123123, 823123213, 823.123.213, 823-123-213, 123213123, 123213123, 123213123]
And i can do processing later to make them into number. So far my strategy is to get everything after pattern and anything before a letter. I have tried:
for string in strings:
print(re.findall("(?<=account)(.*)", string.lower()))
Please help to give some pointers on the regex match.
Try this pattern:
(?=[^0-9]*)[0-9][0-9 .-]*[0-9]
Breakdown:
(?=[^0-9]*) Lookahead for a word, such as "account", non-matching
[0-9] Find a digit
[0-9 .-]* Find any number of digits or special characters (in your strings you have spaces, dashes, periods so I included those)
[0-9] Find another digit (to prevent spaces at the end)
Check it out here, and sample code here
(?!\W)([\d\s.-]+)(?<!\s)
The negative lookahead and lookbehind seems like overkills here but I wasn't able to get a clean match otherwise. You may see the results here
(?!\W) Negative lookahead to exclude any non-word characters [^a-zA-Z0-9_]
([\d\s.-]+) The capturing group for your numbers
(?<!\s) Negative lookbehind to exclude whitespace characters [\r\n\t\f\v ]
If the numbers must be the first numbers after the account substring use
re.findall("account\D*([\d\s.-]*\d)", s)
See the Python demo and the regex demo.
Pattern details
account - a literal substring
\D* - 0+ chars other than digits
([\d\s.-]*\d) - Capturing group 1 (the value returned by re.findall): 0 or more digits, whitespaces, . and - chars followed with a digit.
Related
Need to construct a regular expression that counts numbers between alphabets.
schowalte3rguss77ie85 - 2
xyz1zyx - 1
x1y1z1 - 2
I have constructed this . But this doesn't work for case 3.
[[a-z]+[0-9]+[a-z]]*
Any help would be appreciated. Thanks in advance.
Use regx:
(?<=[a-z])\d+(?=[a-z])
Demo: https://regex101.com/r/tpss6x/1
[Javascript]
If you want a count only, the last part should be a lookahead assertion.
If you want to also match uppercase chars, you can make the pattern case insensitive.
[a-z]\d+(?=[a-z])
Explanation
[a-z] Match a single char a-z
\d+ Match 1+ digits
(?=[a-z]) Positive lookahead, assert a char a-z to the right
Regex demo
You can use
(?<=[^\W\d_])\d+(?=[^\W\d_])
See the regex demo. If you want to only support ASCII letters, replace [^\W\d_] (that matches any Unicode letter) with [a-zA-Z].
Details:
(?<=[^\W\d_]) - immediately before the current location, there must be any Unicode letter
\d+ - one or more digits
(?=[^\W\d_]) - immediately after the current location, there must be any Unicode letter.
Counting can be done with len(...), see this Python demo:
import re
text = "schowalte3rguss77ie85"
matches = re.findall(r'(?<=[^\W\d_])\d+(?=[^\W\d_])', text)
print(len(matches)) # => 2
I have a regex to match words that starts and ends with the same letter (excluding single characters like 'a', '1' )
(^.).*\1$
and another regex to avoid matching any strings with the format 'xyyx' (e.g 'otto', 'trillion', 'xxxx', '-[[-', 'fitting')
^(?!.*(.)(.)\2\1)
How do I construct a single regex to meet both of the requirements?
You can start the pattern with the negative lookahead followed by the pattern for the match. But note to change the backreference to \3 for the last pattern as the lookahead already uses group 1 and group 2.
Note that the . also matches a space, so if you don't want to match spaces you can use \S to match non whitespace chars instead.
^(?!.*(.)(.)\2\1)(.).*\3$
Regex demo
I would place the negative look-ahead after the initial character, and let it exclude the final character (as those two should be part of a positive capture):
^(.)(?!.*(.)\2.).*\1$
Note that the negative check concerns characters between the start and ending character, and so these words would not be rejected:
oopso
livewell
I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo
I have such list (it's only a part);
not match me
norme
16/02574/REMMAJ
20160721
17/00016/FULM
OUT/2017/1071
SMD/2017/0391
17/01090/FULM
2017/30597
17/03940/MAO
18/00076/FULM
CH/17/323
18/00840/OUTMEI
17/00902/EIAM
PL/2017/02671/MINFOT
I need to find general rule to match them all but not this first rows (simple words) or any of \d nor \w if not mixed with each other and slash. Numbers like \d{8} are allowed.
I don't know how to use something like MUST clause applied for each of these 3 groups together - neither can be miss.
These patterns either match not fully or match words. Need as simple regex as possible if possible.
\d{8}|(\w+|/+|\d+)
\d{8}|[\w/\d]+
EDIT
It's funny, but some not provided examples doesn't match for proposed expressions. For example:
7/2018/4127
NWB/18CM032
but I know why and this is outside the scope. However, adding functionality for mixed numbers and letters in one group, like NWB/18CM032 would be great and wouldn't break previous idea I think.
You could match either 1 or more times an uppercase char or 1-8 digits and repeat that zero or more times with a forward slash prepended:
^(?:[a-z0-9]+(?:/[a-z0-9]+)+|\d{8})$
That will match
^ Start of string
(?: Non capturing group
[a-z0-9]+ Match a char a-z or a digit 1+ times
(?:/[a-z0-9]+)+ Match a / followed by a char or digit 1+ times and repeat 1+ times.
| Or
\d{8} Match 8 digits
) Close group
$ End of string
See it on regex101
I have a string:
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
I want to extract the numbers 23.99, 44.65, 98.44,33.44, 2,300.00. I have this regex
\$(.*[^\s])
There are 2 issues with this.
It returns the '$' sign. I only want the number.
It only works when there is a space at the end of the number but sometimes there might be letters and it won't work in that case.
Thanks.
You can use regex as shown:
import re
string="soupnot$23.99dedarikjdf$44.65 notworryfence$98.44coyoteugle$33.94rock$2,300.00"
res = re.findall(pattern="[\d.,]+", string=string)
output:
['23.99', '44.65', '98.44', '33.94', '2,300.00']
Try this regex:
(?<=\$)\d+(?:,\d+)*(?:\.\d+)?
Click for Demo
Explanation
(?<=\$) - positive lookbehind to find the position just preceded by a $
\d+ - matches 1+ occurrences of a digit
(?:,\d+)* - matches 0+ occurrences of a , followed by 1 or more digits
(?:\.\d+)? - matches a . followed by 1+ digits. ? in the end makes this decimal part optional