capture the number iwth comma or dot with regex - python

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides

The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

Related

Regular expression match when specific digits AND words appear

I am quite new to regex, working on string verification where I want both conditions to be met. I am matching text containing 7digit numbers starting with 4 or 7 + string needs to contain one of the provided words.
What I managed so far:
\b((4|7)\d{6})\b|(\border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)
Regex above correctly finds numbers but words are after OR statement which I would need to follow AND logic instead.
Could you please help me implement a change that would work as AND statement between digits and words?
You can use
(?s)^(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b).*\b([47]\d{6})\b
If you can and want use a case insensitive matching with re.I, you can use
(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b
See the regex demo.
This matches
^ - start of string
(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b) - a positive lookahead that matches any zero or more chars, as many as possible, up to any of the whole words listed in the group
.* - zero or more chars, as many as possible
\b([47]\d{6})\b - a 7-digit number as a whole word that starts with 4 or 7.
Do not forget to use a raw string literal to define a regex in Python code:
pattern = r'(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b'
By default, everything in regex is AND
if you do
abc,
it means "a" AND "b" AND "c"
so there is no need for an AND in regex
just remove the | between the numbers match and the words
\b(4|7)\d{6}(border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b
I assume the backslash with the first word \border was a mistake.
This can match stuff like : "4958374border"

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.
You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Why doesn't this regex pattern work as intended?

I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)
Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?
I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.
The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Categories

Resources