How can I match any substring except a particular one in python - python

I want to write a regular expression that will match the following string
a (any substring except 'ABC') ABC
An example for this would be a pqrs h js ABC
The tricky part is to match any substring except 'ABC'. Since the document in which I am searching for, can contain multiple lines that contain such pattern and I want to find all the lines separately I can't use the following expression
a.*ABC
because this would just give me the line where the first a is found extending uptill where the last 'ABC' is found in the document.
There is this answer which says I can use look ahead negation but that is not working in python, or maybe in my case because there is substring before and I have not tested simply using that expression because it will not serve my purpose

Use the non greedy quantifier i.e ?
^a.*?ABC

Related

python regex: match everything inside brackets including other brackets [duplicate]

In python, I can easily search for the first occurrence of a regex within a string like this:
import re
re.search("pattern", "target_text")
Now I need to find the last occurrence of the regex in a string, this doesn't seems to be supported by re module.
I can reverse the string to "search for the first occurrence", but I also need to reverse the regex, which is a much harder problem.
I can also iterate to find all occurrences from left to right, and just keep the last one, but that looks awkward.
Is there a smart way to find the rightmost occurrence?
One approach is to prefix the regex with (?s:.*) and force the engine to try matching at the furthest position and gradually backing off:
re.search("(?s:.*)pattern", "target_text")
Do note that the result of this method may differ from re.findall("pattern", "target_text")[-1], since the findall method searches for non-overlapping matches, and not all substrings which can be matched are included in the result.
For example, executing the regex a.a on abaca, findall would return aba as the only match and select it as the last match, while the code above will return aca as the match.
Yet another alternative is to use regex package, which supports REVERSE matching mode.
The result would be more or less the same as the method with (?s:.*) in re package as described above. However, since I haven't tried the package myself, it's not clear how backreference works in REVERSE mode - the pattern might require modification in such cases.
import re
re.search("pattern(?!.*pattern)", "target_text")
or
import re
re.findall("pattern", "target_text")[-1]
You can use these 2 approaches.
If you want positions use
x="abc abc abc"
print [(i.start(),i.end(),i.group()) for i in re.finditer(r"abc",x)][-1]
One approach is to use split. For example if you wanted to get the last group after ':' in this sample string:
mystr = 'dafdsaf:ewrewre:cvdsfad:ewrerae'
':'.join(mystr.split(':')[-1:])

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Beautiful soup if class not like "string" or regex

I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!
This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:
^ asserts position at start of a line
Capturing Group ((?!listing-col-).)*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
Negative Lookahead (?!listing-col-).
Assert that the Regex below does not match.
listing-col- matches the characters listing-col- literally (case sensitive)
. matches any character
$ asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.
One possible solution is utilizing regex directly.
You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all
You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Regexp - find a value between a part of the string and a second part of the string OR end of line

I've looked through many regexp examples here, but still fail to find a solution.
I have to check a request string for a certain substring in it. The substring in question will have something before it might have something after:
?something=xxx&to_dep=YYY&from_dep=zzz&...
OR
?something=xxx&to_dep=YYY
I need to extract YYY without a & in first case and simply YYY in the second case.
For now I use this kind of regexp:
re.search('to_dep=(.+?)&', req.query_string)
but works only in one case and can't be used if I want to re.sub it. (replace YYY with something else - & gets replaced too)
Any help?
Just try with:
[?&]to_dep=([^&]*)
[^&]* will match any characters that are not & or it will stop on the next & (first case) or stop on the end of the string (second case).
For both, you might use a positive lookbehind and a negated class:
re.search(r'(?<=to_dep=)[^&]+', req.query_string)
And this will give you only YYY, which then means you can also use it in re.sub:
re.sub(r'(?<=to_dep=)[^&]+', 'new_value', req.query_string)
[^&] matches any character except &.
(?<=to_dep=) makes sure there's a to_dep= before the part to match.

Categories

Resources