Regular expression match when specific digits AND words appear - python

I am quite new to regex, working on string verification where I want both conditions to be met. I am matching text containing 7digit numbers starting with 4 or 7 + string needs to contain one of the provided words.
What I managed so far:
\b((4|7)\d{6})\b|(\border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)
Regex above correctly finds numbers but words are after OR statement which I would need to follow AND logic instead.
Could you please help me implement a change that would work as AND statement between digits and words?

You can use
(?s)^(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b).*\b([47]\d{6})\b
If you can and want use a case insensitive matching with re.I, you can use
(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b
See the regex demo.
This matches
^ - start of string
(?=.*\b(?:order|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b) - a positive lookahead that matches any zero or more chars, as many as possible, up to any of the whole words listed in the group
.* - zero or more chars, as many as possible
\b([47]\d{6})\b - a 7-digit number as a whole word that starts with 4 or 7.
Do not forget to use a raw string literal to define a regex in Python code:
pattern = r'(?si)^(?=.*\b(?:order|bestellung|commande|ordine|objednavk[ua])\b).*\b([47]\d{6})\b'

By default, everything in regex is AND
if you do
abc,
it means "a" AND "b" AND "c"
so there is no need for an AND in regex
just remove the | between the numbers match and the words
\b(4|7)\d{6}(border|Order|Bestellung|bestellung|commande|Commande|ordine|Ordine|objednavku|Objednavku|objednavka|Objednavka)\b
I assume the backslash with the first word \border was a mistake.
This can match stuff like : "4958374border"

Related

Regex python do lookahead in a conditional statement

I'm trying to do lookaheads in a conditional statement.
Explanation by words:
(specified string that has to be a number (decimal or not) or a word character, a named capturing group is created) (if the named capturing group is a word character then check if the next string is a number (decimal or not) with a lookahead else check if the next string is a word character with a lookahead)
To understand, here some examples that are matched or not:
a 6 or 6.4 b-> matched, since the first and the second string haven't the same "type"
ab 7 or 7 rt -> not matched, need only a single word character
R 7.55t -> not matched, 7.55t is not a valid number
a r or 5 6-> not matched, the first and the second string have the same "type" (number and number, or, word character and word character)
I've already found the answer for the first string: (?P<var>([a-zA-Z]|(-?\d+(.\d+)?)))
I've found nothing on Internet about lookaheads in a condition statement in Python.
The problem is that Python doesn't support conditional statement like PCRE:
Python supports conditionals using a numbered or named capturing group. Python does not support conditionals using lookaround, even though Python does support lookaround outside conditionals. Instead of a conditional like (?(?=regex)then|else), you can alternate two opposite lookarounds: (?=regex)then|(?!regex)else. (source: https://www.regular-expressions.info/conditional.html)
Maybe there's a better solution that I've planned or maybe it's just impossible to do what I want, I don't know.
What I tried: (?P<var>([a-zA-Z]|(-?\d+(.\d+)?))) (?(?=[a-zA-Z])(?=(-?\d+(.\d+)?))|(?=[a-zA-Z]))(?P=var) but that doesn't work.
The named capture group (?P<var>...) contains the actual text which matched, not the regex itself. There is a way to create a named regex, too; but it's probably not particularly necessary or useful here.
Simply spell out the alternatives:
((?<![a-zA-Z0-9])[a-zA-Z]\s+-?\d+(.\d+)?(?![a-zA-Z.0-9])|(?<![a-zA-Z.0-9])-?\d+(.\d+)?\s+[a-zA-Z](?![a-zA-Z0-9]))
If you genuinely require the second token to remain unmatched, it should be obvious how to change the parts starting at each \s into a lookahead.
Demo: https://ideone.com/nPNAIN

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Why doesn't this regex pattern work as intended?

I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)
Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'

Match to string length by using regex in python

Writing python regex for string. I want the string to be at least 1 symbol and max 30. The problem is that im using 3 sub-blocks in regex letters, so there always must be 3 characters long length.
Is it possible to add that condition in this regex (1-30 characters length):
regex = re.compile("^[a-zA-Z]+[a-zA-Z0-9\.\-]+[a-zA-Z0-9]$")
r = regex.search(login)
Thank you.
Although it is not clear which 1 or 2 length character strings you want to accept I propose the following regex:
regex = re.compile("^[a-zA-Z][a-zA-Z0-9\.\-]{0,28}[a-zA-Z0-9]$")
As the middle set includes all other this will directly match all words with length 3-30 as you wish.
I hope this regex also matches your 2 length strings (I just assumed that the first character must be a letter), you need to add something (using '|') for single letter matches.
In general, this is difficult and doing some work outside of the RE (as suggested in the comment by M. Buettner) is often required. Your problem is easier because it can be reduced to a pattern with only one repeating element.
You have one or more letters, followed by one or more of (letter, digit, dot, hyphen) followed by a single (letter or digit), right? If so, the repetition of the first group is not needed. Leave off the + to get
r"^[a-zA-Z][a-zA-Z0-9\.\-]+[a-zA-Z0-9]$"
and you will match exactly the same set of strings. Any extra leading letters past the first will be matched in the second group instead of the first.
Now, the only variable portion of your RE is the middle section. To limit the overall length to 30, all you need do is limit that middle portion to 28 characters. Change the + to {1,28} to get:
r"^[a-zA-Z][a-zA-Z0-9\.\-]{1,28}[a-zA-Z0-9]$"
You can read more about Python REs at:
http://docs.python.org/2/library/re.html

Categories

Resources