Find first ReGex pattern following a different pattern - python

Objective: find a second pattern and consider it a match only if it is the first time the pattern was seen following a different pattern.
Background:
I am using Python-2.7 Regex
I have a specific Regex match that I am having trouble with. I am trying to get the text between the square brackets in the following sample.
Sample comments:
[98 g/m2 Ctrl (No IP) 95 min 340oC ]
[ ]
I need the line:
98 g/m2 Ctrl (No IP) 95 min 340oC
The problem is the undetermined number of white-spaces, tabs, and new-lines between the search pattern Sample comments: and the match I want is giving me trouble.
Best Attempt:
I am able to match the first part easily,
match = re.findall(r'Sample comments:[.+\n+]+', string)
But I can't get the match to the length I want to grab the portion between the square brackets,
match = re.findall(r'Sample comments:[.+\n+]+\[(.+)\]', string)
My Thinking:
Is there a way to use ReGex to find the first instance of the pattern \[(.+)\] after a match of the pattern Sample comments:? Or is there a more robust way to find the bit between the square braces in my example case.
Thanks,
Michael

I suggest using
r'Sample comments:\s*\[(.*?)\s*]'
See the regex and IDEONE demo
The point is the \s* matches zero or more whitespace, both vertical (linebreaks) and horizontal. See Python re reference:
\s
When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.
Pattern details:
Sample comments: - a sequence of literal chars
\s* - 0 or more whitespaces
\[ - a literal [
(.*?) - Group 1 (returned by re.findall) capturing 0+ any chars but a newline as few as possible up to the first...
\s* - 0+ whitespaces and
] - a literal ] (note it does not have to be escaped outside the character class).

Not sure if I understand your problem correctly, but re.findall('Sample comments:[^\\[]*\\[([^\\]]*)\\]', string) seems to work.
Or maybe re.findall('Sample comments:[^\\[]*\\[[ \t]*([^\\]]*?)[ \t]*\\]', string) if you want to strip the final spaces from your line?

Related

Getting a correct regex for word starting and ending with different letters

I am quite new to regex and I right now Have a problem formulating a regex to match a string where the first and last letter are different. I looked up on the internet and found a regex that just does it's opposite. i.e. matches words that have same starting and ending letter. Can anyone please help me to understand if I can negeate this regex in some way or can create a new regex to match my requirements. The regex that needs to be modiifed or changed is:
^\s|^[a-z]$|^([a-z]).*\1$
This matches these Strings :
aba,
a,
b,
c,
d,
" ",
cccbbbbbbac,
aaaaba
But I want it to match strings like:
aaabbcz,
zba,
ccb,
cbbbba
Can anyone please help me in this regard? Thank you.
Note: I will be using this with Python Regex, so the regex should be compataible to be used with Python.
You don't need a regex for this, just use
s[0] != s[-1]
where s is your string. If you must use a regex, you can use this:
^(.).*(?!\1).$
This looks for
^ : beginning of string
(.) : a character (captured in group 1)
.* : some number of characters
(?!\1). : a character which is not the character captured in group 1
$ : end of string
Regex demo on regex101
This part of your pattern ^([a-z]).*\1$ only accounts for chars a-z, but you also want to exclude " "
You can rewrite that pattern by putting the part after the capture group inside a negative lookahead.
^(.)(?!.*\1$).+
^ Start of string
(.) Capture a single char (including spaces) in group 1
(?!.*\1$) Negative lookahead, assert that the string does not end with the same character
.+ Match 1+ characters so that the string has a minimum of 2 characters
See a regex demo.
If the string should start and end with a non whitespace character to prevent / trailing trailing spaces, you can start the match with a non whitespace character \S and also end the match with a non whitespace character.
^(\S)(?!.*\1$).*\S$
See another regex demo.

regex non greedy quantifier catching nothing, greedy catching too much

I'm writing a python regex formula that parses the content of a heading, however the greedy quantifier is not working well, and the non greedy quantifier is not working at all.
My string is
Step 1 Introduce The Assets:
Step2 Verifying the Assets
Step 3Making sure all the data is in the right place:
What I'm trying to do is extract the step number, and the heading, excluding the :.
Now I've tried multiple regex string and came up with these 2:
r1 = r"Step ?([0-9]+) ?(.*) ?:?"
r2 = r"Step ?([0-9]+) ?(.*?) ?:?"
r1 is capturing the step number, but is also capturing : at the end.
r2 is capturing the step number, and ''. I'm not sure how to handle the case where there is a .* followed by a string.
Necessary Edit:
The heading might contain : inside the string, I just want to ignore the trailing one. I know I can strip(':') but I want to understand what I'm doing wrong.
You can write the pattern using a negated character class without the non greedy and optional parts using a negated character class:
\bStep ?(\d+) ?([^:\n]+)
\bStep ? Match the word Step and optional space
(\d+) ? Capture 1+ digits in group 1 followed by matching an optional space
([^:\n]+) Capture 1+ chars other than : or a newline in group 2
Regex demo
If the colon has to be at the end of the string:
\bStep ?(\d+) ?([^:\n]+):?$
Regex demo

Extract number when still attached to string [duplicate]

How can I get first character if not have int inside:
I need to look all the place have '[' without integer after.
For example:
[abc] pass
[cxvjk234] pass
[123] fail
Right now, I have this:
((([[])([^0-9])))
It gets the first 2 characters while I need only one.
In general, to match some pattern not followed with a digit, you need to add a (?!\d) / (?![0-9]) negative lookahead to the expression:
\[(?!\d)
\[(?![0-9])
^^^^^^^^^
See the regex demo. This matches any [ symbol that is not immediately followed with a digit.
Your current regex pattern is overloaded with capturing groups, and if we remove those redundant ones, it looks like (\[)([^0-9]) - it matches a [ and then a char other than an ASCII digit.
You may use
(?<=\[)\D
or (if you want to only match the ASCII digits with the pattern only)
(?<=\[)[^0-9]
See the regex demo
Details:
(?<=\[) - a positive lookbehind requiring a [ (but not consuming the [ char, i.e. not returning it as part of the match value) before...
\D / [^0-9] - a non-digit. NOTE: to only negate ASCII digits, you may use \D with the RegexOptions.ECMAScript flag.
One possible solution would be:
\[\D[^]]*\]
# look for [
# \D - not a digit
# anything not ], zero or more times
# followed by ]
See a demo on regex101.com.
Don't use so many parentheses. Parentheses are both grouping and determining what's returned as the match.
\[([^0-9])
If you absolutely need to use the parentheses, use (?:…) for parentheses that group but are not returned as part of the match.
(?:(?:(?:\[)([^0-9])))
maybe what you're searching for is \[([^0-9])
Do you expect [123abc] to pass?

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

Exclude the characters before a given pattern

For this question, I am not interested in alternative pythonic methods, I am only interested in solving the Regex in my code. I can't figure out why it does not work.
Let's say I have the following string:
hello.world
I want to get all characters, excluding all characters before the dot, except the first one before it. So, I am trying to extract the following substring:
o.world
This is my code:
re.sub('^.*[^.\..*]', '', string)
My Regex logic is broken down as follows, the first characters ^.* which are not one character followed by a dot followed by any number of characters [^.\..*], are removed.
However, the Regex doesn't work, can someone help me out?
Your current code is not working because your pattern is not matching what you think it is. Putting .* in a character set does not mean "zero or more characters". Instead, it means the characters . or * literally. Also, \. is treated as \ or ., not an escaped . (since . has no special meaning in a character set).
This means that your pattern is actually equivalent to:
^.*[^\.*]
which matches:
^ # The start of the string
.* # Zero or more characters
[^\.*] # A character that is not \, ., or *
To do what you want with re.sub, you can use:
>>> import re
>>> re.sub('[^.]*(.\..*)', r'\1', 'hello.world')
'o.world'
>>>
Below is an explanation of what the pattern does:
[^.]* # Matches zero or more characters that are not .
( # Starts a capture group
. # Matches any character (save a newline).
\. # Matches a literal .
.* # Matches zero or more characters
) # Closes the capture group
The important part though is the capture group. Inside the replace string, \1 will refer to whatever was matched by it, which in this case is the text that you want to keep. So, the code above can be seen as replacing all of the text with only that which we need.
That said, it seems like it would be better to just use re.search:
>>> import re
>>> re.search('[^.]*(.\..*)', 'hello.world').group(1)
'o.world'
>>>

Categories

Resources