Regex optional character with a conditional clause

Regex optional character with a conditional clause - python

I have a regex problem that combines the ideas of optional characters and conditional regex statements that I'm unsure how to solve.
I want to find a pattern that, in addition to matching an initial number, will also match the following uppercase letter if and only if that character is not followed by a lowercase letter. The string will only ever have one number. For example:
'fds;o2Ko ' ==> '2'
'rejy 3ked' ==> '3'
's.fg6G hb' ==> '6G'
'3M- gfafg' ==> '3M'
'dgfAN adg' ==> no pattern found
I've tried various combinations of conditional statements but can't seem to combine the concepts properly. I'm working in python using the following code:
pattern = r'[1-9][A-Z]?'
ID = str(re.findall(pattern, 's.fg6G hb')).strip('[]\'')
The above is what I want without the conditional statement. I'm unsure how to include an appropriate conditional statement. I think pattern would be something like r'[1-9](?(?=[A-Z][a-z])[A-Z]?|)' but don't understand how I can look ahead beyond the current character.

It seems you can try to use:
\d+(?:[A-Z](?![a-z]))?
See the online demo
\d+ - 1+ Digits.
(?: - Open non-capture group:
[A-Z] - A single uppercase alpha.
(?![a-z]) - A nested negative lookahead to assert position is not followed by a lowercase alpha.
)? - Close non-capture group and make it optional.

Related

regex non greedy quantifier catching nothing, greedy catching too much

I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9]+) ?(.*) ?:?"
r2 = r"Step ?([0-9]+) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing : at the end.
r2 is capturing the step number, and ''. I'm not sure how to handle the case where there is a .* followed by a string.
Necessary Edit:
The heading might contain : inside the string, I just want to ignore the trailing one. I know I can strip(':') but I want to understand what I'm doing wrong.

You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d+) ?([^:\n]+)
\bStep ? Match the word Step and optional space
(\d+) ? Capture 1+ digits in group 1 followed by matching an optional space
([^:\n]+) Capture 1+ chars other than : or a newline in group 2
Regex demo
If the colon has to be at the end of the string:
\bStep ?(\d+) ?([^:\n]+):?$
Regex demo

How to not capture a group in regex if it is followed by an another group

If I have a string eg.: 'hcto,231' or 'hcto.12' I want to be able to capture 'o,231' or 'o.12' and process it as a number ('hct' is random and any other string can replace it).
But I don't want to capture if the 'o' character if followed by a decimal number eg: 'wordo.23.12' or 'wordo,23,12'.
I've tried using the following regex:
([oO][.,][0-9]+)(?!([.,][0-9]+))
but it always matches.
In the string 'hcto.22.23' it matches the bold part, but I don't want it to match anything. Is there a way to combine groups so it won't match if the negative lookahead is true.

The match occurs in hcto.22.23 because the lookahead triggers backtracking, and since [0-9]+ match match a single 2 (it does not have to match 22) the match succeeds and returns a smaller, unexpected match:
It seems the simplest way to fix the current issue is to make the dot or comma pattern in the lookahead optional, and remove unnecessary groups:
[oO][.,]\d+(?![.,]?\d)
See the regex demo.
Details
[oO] - o or O
[.,] - a dot or comma
\d+ - one or more digits
(?![.,]?\d) - not followed with ./, and a digit, or just with a digit.

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.

Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

capture the number iwth comma or dot with regex

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides

The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

regex named group if exist

Good morning,
I have a string that I need to parse and print the content of two named group knowing that one might not exist.
The string looks like this (basically content of /proc/pid/cmdline):
"""
<some chars with letters / numbers / space / punctuation> /CLASS_NAME:myapp.server.starter.StarterHome /PARAM_XX:value_XX /PARAM_XX:value_XX /CONFIG_FILE:myapp.server.config.myconfig.txt /PARAM_XX:value_XX /PARAM_XX:value_XX /PARAM_XX:value_XX <some chars with letters / numbers / space / punctuation>
"""
my processes have almost the same pattern, that is:
/CLASS_NAME:myapp.server.starter.StarterHome is always present, but
/CONFIG_FILE:myapp.server.config.myconfig.txt is NOT always present.
I'm using python2 with re module to catch the values. So far my pattern looks like this and I'm able to catch the value I want corresponding to /CLASS_NAME
re.compile('CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+)')
The because /CONFIG_FILE is present or not, I added the following to myregexp:
re.compile(r"""CLASS_NAME:\w+\W\w+\W\w+\W(?P<class>\w+).*?
(CONFIG_FILE:\w+\W\w+\W\w+\W(?P<cnf>\w+.txt))?
""", re.X)
My understanding is that the second part of my rexexp is optional because the whole part is between parenthesis followed by ?.
Unfortunately my assumption is wrong as it couldn't catch it
I also tried by removing the 1st ? but it didn't help.
I gave several tries through PYTHEX to try to understand my regexp but couldn't find a solution.
Could anyone have any suggestion to resolve my case?

You can wrap the whole optional part within an optional non-capturing group and make the capturing group for CONFIG_FILE obligatory:
re.compile(r"""CLASS_NAME:(?:\w+\W+){3}(?P<class>\w+)(?:.*?
(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))?
""", re.X)
In case there are newlines, use re.X | re.S modifier options. Note that \w+\W\w+\W\w+\W is better written as (?:\w+\W+){3}.
See the regex demo
The main difference is (?:.*?(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)))? part:
(?: - start of an optional (as there is a greedy ? quantifier after it) non-capturing group matching
.*? - any 0+ chars, as few as possible
(CONFIG_FILE:(?:\w+\W+){3}(?P<cnf>\w+\.txt)) - matches
CONFIG_FILE: - a literal substring
(?:\w+\W+){3} - three sequences of 1+ word chars followed with 1+ non-word chars
(?P<cnf>\w+\.txt) - Group cnf: 1+ word chars, a dot (note it should be escaped) and then txt
)? - end of the optional non-capturing group (that will be tried once)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex optional character with a conditional clause - python

It seems you can try to use: \d+(?:[A-Z](?![a-z]))? See the online demo \d+ - 1+ Digits. (?: - Open non-capture group: [A-Z] - A single uppercase alpha. (?![a-z]) - A nested negative lookahead to assert position is not followed by a lowercase alpha. )? - Close non-capture group and make it optional.

Related

regex non greedy quantifier catching nothing, greedy catching too much

How to not capture a group in regex if it is followed by an another group

Regex match for non hyphenated words

capture the number iwth comma or dot with regex

regex named group if exist

Categories

Resources