regex expression to exclude lines based on beginning or ending patterns - python

I searching a file for lines that do not match one of three possible regex patterns in python. If I was to search each individually, the patterns are:
pattern1 = '_[AB]_[0-9]+$'
pattern2 = 'uce.+'
pattern3 = 'ENSOFAS.+'
Pattern2 & pattern3 are near the beginning of the line (these lines technically start with >). Pattern1 at the end of the string.
I've seen ways of combining pattern2 and pattern3 with something like ^>(?:(?!uce|ENSOFAS).+$) (I'm not sure if this formatted correctly). How can I also include pattern1 in a single regex search. The reason I'm doing this is to skip over lines that match to any one of these patterns.

In essence, you are combining three smaller-regexes into one, saying that the matcher could match any of those in place of the other. The general method for this is the alternation operator, as #TallChuck has commented. So, in keeping with his example and your variables, I might do this:
pattern1 = '_[AB]_[0-9]+$'
pattern2 = '^>uce.+'
pattern3 = '^>ENSOFAS.+'
re_pattern = '(?:{}|{}|{})'.format(pattern1, pattern2, pattern3)
your_re = re.compile( re_pattern )
There's no reason you cannot include the beginning-of-line anchor ^ in each subpattern, so I've done that. Meanwhile, your example used the grouping (non-capturing) operator which is `(?:...), so I've mimicked that here as well.
The above is the exact same as if you had put it together all at once:
your_re = re.compile('(?:_[AB]_[0-9]+$|^>uce.+|^>ENSOFAS.+)')
Take your pick as to which is more readable and maintainable by you or your team.
Finally, note that it may be more efficient to pull out the beginning of line anchor (^) as the last paragraph of your question suggested, or the regex engine may be smart enough to do that on its own. Suggest to get it working first, then optimize if you need to.
Another option is to match all three at the beginning of the line by simply adding the "match anything" operator (.*) to the first pattern:
^(?:.*_[AB]_[0-9]+$|>uce.+|>ENSOFAS.+)

Related

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

How to perform this regex replacement more effectively in python without repeating the search?

In python, I want to search for a pattern in a given line and surround it with the html tags. I am doing it as follows:
pattern = "(boy|girl)"
line = "I am a boy"
m = re.search(pattern, line)
line = re.sub(pattern, "<strong><u>"+m.group(0)+"</u></strong>", line)
But I feel like I am repeating the search twice. In other words, I feel like I should be able to accomplish in one line, but I just don't know the right command yet in python.
Is there something like "&" from perl? that you can use to do something like:
s/pattern/<tag>&</tag>/;
Use:
line = re.sub(pattern, r'<strong><u>\1</u></strong>', line)
The \1 is the key part -- it's replaced by the text that matched the pattern. (the r prefix is recommended in all RE patterns to keep backslash escapes as literals).

Using regex to find multiple matches on the same line

I need to build a program that can read multiple lines of code, and extract the right information from each line.
Example text:
no matches
one match <'found'>
<'one'> match <found>
<'three'><'matches'><'found'>
For this case, the program should detect <'found'>, <'one'>, <'three'>, <'matches'> and <'found'> as matches because they all have "<" and "'".
However, I cannot work out a system using regex to account for multiple matches on the same line. I was using something like:
re.search('^<.*>$')
But if there are multiple matches on one line, the extra "'<" and ">'" are taken as part of the .*, without counting them as separate matches. How do I fix this?
This works -
>>> r = re.compile(r"\<\'.*?\'\>")
>>> r.findall(s)
["<'found'>", "<'one'>", "<'three'>", "<'matches'>", "<'found'>"]
Use findall instead of search:
re.findall( r"<'.*?'>", str )
You can use re.findall and match on non > characters inside of the angle brackets:
>>> re.findall('<[^>]*>', "<'three'><'matches'><'found'>")
["<'three'>", "<'matches'>", "<'found'>"]
Non-greedy quantifier '?' as suggested by anubhava is also an option.

Matching new lines

I have the following regexp:
pattern = re.compile(r"HESAID:|SHESAID:")
It's working correctly. I use it to split by multiple delimiters like this:
result = pattern.split(content)
What I want to add is verification so that the split does NOT happend unless HESAID: or SHESAID: are placed on new lines. This is not working:
pattern = re.compile(r"\nHESAID:\n|\nSHESAID:\n")
Please help.
It would be helpful if you elaborated on how exactly it is not working, but I am guessing that the issue is that it does not match consecutive lines of HESAID/SHESAID. You can fix this by using beginning and end of line anchors instead of actually putting \n in your regex:
pattern = re.compile(r'^HESAID:$|^SHESAID:$', re.MULTILINE)
The re.MULTILINE flag is necessary so that ^ and $ match at beginning and end of lines, instead of just the beginning and end of the string.
I would probably rewrite the regex as follows, the ? after the S makes it optional:
pattern = re.compile(r'^S?HESAID:$', re.MULTILINE)

Categories

Resources