Extract number when still attached to string [duplicate] - python

How can I get first character if not have int inside:
I need to look all the place have '[' without integer after.
For example:
[abc] pass
[cxvjk234] pass
[123] fail
Right now, I have this:
((([[])([^0-9])))
It gets the first 2 characters while I need only one.

In general, to match some pattern not followed with a digit, you need to add a (?!\d) / (?![0-9]) negative lookahead to the expression:
\[(?!\d)
\[(?![0-9])
^^^^^^^^^
See the regex demo. This matches any [ symbol that is not immediately followed with a digit.
Your current regex pattern is overloaded with capturing groups, and if we remove those redundant ones, it looks like (\[)([^0-9]) - it matches a [ and then a char other than an ASCII digit.
You may use
(?<=\[)\D
or (if you want to only match the ASCII digits with the pattern only)
(?<=\[)[^0-9]
See the regex demo
Details:
(?<=\[) - a positive lookbehind requiring a [ (but not consuming the [ char, i.e. not returning it as part of the match value) before...
\D / [^0-9] - a non-digit. NOTE: to only negate ASCII digits, you may use \D with the RegexOptions.ECMAScript flag.

One possible solution would be:
\[\D[^]]*\]
# look for [
# \D - not a digit
# anything not ], zero or more times
# followed by ]
See a demo on regex101.com.

Don't use so many parentheses. Parentheses are both grouping and determining what's returned as the match.
\[([^0-9])
If you absolutely need to use the parentheses, use (?:…) for parentheses that group but are not returned as part of the match.
(?:(?:(?:\[)([^0-9])))

maybe what you're searching for is \[([^0-9])
Do you expect [123abc] to pass?

Related

How to not capture a group in regex if it is followed by an another group

If I have a string eg.: 'hcto,231' or 'hcto.12' I want to be able to capture 'o,231' or 'o.12' and process it as a number ('hct' is random and any other string can replace it).
But I don't want to capture if the 'o' character if followed by a decimal number eg: 'wordo.23.12' or 'wordo,23,12'.
I've tried using the following regex:
([oO][.,][0-9]+)(?!([.,][0-9]+))
but it always matches.
In the string 'hcto.22.23' it matches the bold part, but I don't want it to match anything. Is there a way to combine groups so it won't match if the negative lookahead is true.
The match occurs in hcto.22.23 because the lookahead triggers backtracking, and since [0-9]+ match match a single 2 (it does not have to match 22) the match succeeds and returns a smaller, unexpected match:
It seems the simplest way to fix the current issue is to make the dot or comma pattern in the lookahead optional, and remove unnecessary groups:
[oO][.,]\d+(?![.,]?\d)
See the regex demo.
Details
[oO] - o or O
[.,] - a dot or comma
\d+ - one or more digits
(?![.,]?\d) - not followed with ./, and a digit, or just with a digit.

capture the number iwth comma or dot with regex

I have regex code
https://regex101.com/r/o5gdDt/8
As you see this code
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d,])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
can capture all digits which sperated by 3 digits in text like
"here is 100,100"
"23,456"
"1,435"
all more than 4 digit number like without comma separated
2345
1234 " here is 123456"
also this kind of number
65,656½
65,656½,
23,123½
The only tiny issue here is if there is a comma(dot) after the first two types it can not capture those. for example, it can not capture
"here is 100,100,"
"23,456,"
"1,435,"
unfortunately, there is a few number intext which ends with comma...can someone gives me an idea of how to modify this to capture above also?
I have tried to do it and modified version is so:
(?<!\S)(?<![\d,])(?:(?!(?:1[2-9]\d\d|20[01]\d|2020))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?![\d])[\u00BC-\u00BE\u2150-\u215E]?(?!x)(?!/)
basically I delete comma in (?![\d,]) but it causes to another problem in my context
it captures part of a number that is part of equation like this :
4,310,747,475x2
57,349,565,416,398x.
see here:
https://regex101.com/r/o5gdDt/10
I know that is kind of special question I would be happy to know your ides
The main problem here is that (?![\d,]) fails any match followed with a digit or comma while you want to fail the match when it is followed with a digit or a comma plus a digit.
Replace (?![\d,]) with (?!,?\d).
Also, (?<!\S)(?<![\d,]) looks redundant, as (?<!\S) requires a whitespace or start of string and that is certainly not a digit or ,. Either use (?<!\S) or (?<!\d)(?<!\d,) depending on your requirements.
Join the negative lookaheads with OR: (?!x)(?!/) => (?!x|/) => (?![x/]).
You wnat to avoid matching years, but you just fail all numbers that start with them, so 2020222 won't get matched. Add (?!\d) to the lookahead, (?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d)).
So, the pattern might look like
(?<!\S)(?:(?!(?:1[2-9]\d\d|20[01]\d|2020)(?!\d))\d{4,}[\u00BC-\u00BE\u2150-\u215E]?|\d{1,3}(?:,\d{3})+)(?!,?\d)[\u00BC-\u00BE\u2150-\u215E]?(?![x/])
See the regex demo.
IMPORTANT: You have [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) at the end, a negative lookahead after an optional pattern. Once the engine fails to find the match for x or /, it will backtrack and will most probably find a partial match. If you do not want to match 65,656 in 65,656½x, replace [\u00BC-\u00BE\u2150-\u215E]?(?![x/]) with (?![\u00BC-\u00BE\u2150-\u215E]?[x/])[\u00BC-\u00BE\u2150-\u215E]?.
See another regex demo.

Find first ReGex pattern following a different pattern

Objective: find a second pattern and consider it a match only if it is the first time the pattern was seen following a different pattern.
Background:
I am using Python-2.7 Regex
I have a specific Regex match that I am having trouble with. I am trying to get the text between the square brackets in the following sample.
Sample comments:
[98 g/m2 Ctrl (No IP) 95 min 340oC ]
[ ]
I need the line:
98 g/m2 Ctrl (No IP) 95 min 340oC
The problem is the undetermined number of white-spaces, tabs, and new-lines between the search pattern Sample comments: and the match I want is giving me trouble.
Best Attempt:
I am able to match the first part easily,
match = re.findall(r'Sample comments:[.+\n+]+', string)
But I can't get the match to the length I want to grab the portion between the square brackets,
match = re.findall(r'Sample comments:[.+\n+]+\[(.+)\]', string)
My Thinking:
Is there a way to use ReGex to find the first instance of the pattern \[(.+)\] after a match of the pattern Sample comments:? Or is there a more robust way to find the bit between the square braces in my example case.
Thanks,
Michael
I suggest using
r'Sample comments:\s*\[(.*?)\s*]'
See the regex and IDEONE demo
The point is the \s* matches zero or more whitespace, both vertical (linebreaks) and horizontal. See Python re reference:
\s
When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.
Pattern details:
Sample comments: - a sequence of literal chars
\s* - 0 or more whitespaces
\[ - a literal [
(.*?) - Group 1 (returned by re.findall) capturing 0+ any chars but a newline as few as possible up to the first...
\s* - 0+ whitespaces and
] - a literal ] (note it does not have to be escaped outside the character class).
Not sure if I understand your problem correctly, but re.findall('Sample comments:[^\\[]*\\[([^\\]]*)\\]', string) seems to work.
Or maybe re.findall('Sample comments:[^\\[]*\\[[ \t]*([^\\]]*?)[ \t]*\\]', string) if you want to strip the final spaces from your line?

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Python reference to regex in parentheses

I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!
Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.

Categories

Resources