Regular expression - How to eliminate certain pattern in python - python

I have some articles containing match scores like 13-9, 34-12, 22-10 which I want to extract using a regular expression to find the pattern in Python. re.compile(r'[0-9]+-[0-9]')works but how can I modify to eliminate 1999-06, 2020-01? I tried re.compile(r'[0-9]{1,2}-[0-9]')but those year values return as 99-06 which is also invalid in my case.

You can match for exact number of digits required with look behind assertions, not to slice log numbers, like below
(?<!\d)\d{2}-\d{1,2}
Demo

You can avoid matching in the middle of a number with
r'(?<!\d)[0-9]{1,2}-[0-9]'
The negative lookbehind prohibits matching immediately after another digit.
Perhaps also add
(?!\d)
at the end to impose a similar restriction at the end of the match.

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Regex: Another way to match structure separated by commas

I want to know if a string is a collection of, by example, numbers ([0-9]).
I this case, i'm using the regular expression [0-9](,[0-9])* to find one or more numbers separated by commas (A collection of numbers).
Is there a better way to do it? I mean a shorter expression perhaps.
I would suggest the following pattern:
(?<=^|,|\s)(\d+)
(?<=...) is a lookbehind assertion that will not be captured into the groups nor be included into the matched string. It is used to identify the starting position of the number to be matched.
You can try the above pattern interactively in the following website:
https://regex101.com/r/IKGWtA/1
\d*(,\d*)* will catch the situation where you have multiple digits before and after a comma e.g. 100,000. This regex will only grab 0,0 from that same number.

How to search double digit numbers in python with regular expression?

I have a piece of code that records times in this format:
0.0-8.0
0.0-9.0
0.0-10.0
I want to use a regular expression that will find all of these strings and have checked here and here for help but am still confused. I understand how to do it if I only wanted to do single digit numbers, but I can't figure out how to handle double digit numbers like 10 or 20.
It is also important that the expression does not find the string
0.0-1.0
as it should be ignored.
So far my expression looks like this:
expression = re.compile(',0\.0\-[0-2][0-9])
If you want to match each line shown in your question, try an expression like this:
0\.0\-[0-2]?\d\.\d
\d is the same as [0-9]. The ? means 0 or 1 occurrences, so this will only match 1- or 2-digit numbers. If you need the comma at the start of the regex, add that in.
If you want to exclude 0.0-1.0, then you should do that in code, not in the regular expression, since that would make it less readable. But if you insist, I have included one that will exclude that string for you:
Try it here
0\.0\-[0-2]?[0-9]\.(?<!0-1\.)\d
This uses a negative lookbehind to ensure the previous part is not 0-1., which would only occur in the match you didn't want.

Match a sequence of numbers preceded by certain text

How do I match a sequence of numbers preceded by certain text but not return the text, just the sequence of numbers?
For example, let's assume I have the following string:
url = "sampleurl/485734/abcdefgh/83275/"
I want to match all numbers that comes after the word sampleurl. So far, I`ve been using the following code
re.search("sampleurl/[0-9]+", url).group(0)[9:]
that works, but I'm assuming there is a fancier way of doing that instead of needing to use [9:] at the end.
For a quick reference, I've been using regex101 to check the validation of the regex.
You can place a capturing group around the part you want and refer to that group number for the match result.
re.search(r'sampleurl/(\d+)', url).group(1)
Another way would be implementing a lookaround assertion.
re.search(r'(?<=sampleurl/)\d+', url).group(0)

Negating match if a string is just before another string

I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory
Just use a negative look-behind in your regex:
(?<!with\s)accessory
Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$

Categories

Resources