Python regex: excluding square brackets and the text inside - python

I am trying to write a regex that excludes square brackets and the text inside them.
My sample text looks like this: 'WordA, WordB, WordC, [WordD]'
I want to match each text item in the string except '[WordD]'. I've tried using a negative lookahead, something like... [A-Z][A-Za-z]+(?!\[[A-Z]+\]) but doing so is still matching the text inside the brackets.
Is negative lookahead the best approach? If so, where am I going wrong?

Rather than a regex, you might consider splitting by commas and then filtering by whether the word starts with [:
output = [word for word in str.split(', ') if word[0] != '[']
If you use a regex, you can match either the beginning of the string, or lookbehind for a space:
re.findall(r'(?:^|(?<= ))[A-Z][A-Za-z]+', str)
Or you could negative lookahead for ] at the end, after a word boundary:
output = re.findall(r'[A-Z][A-Za-z]+\b(?!\])', str)

This can be as simple as
(\w+),
Regex Demo
Retrieve value of Group 1 for desired result.

I'm guessing that maybe you were trying to write some expression similar to:
[A-Z][a-z]*[A-Z](?=,|$)
or,
[A-Z][a-z]+[A-Z](?=,|$)
Test
import re
regex = r"[A-Z][a-z]*[A-Z](?=,|$)"
string = """
WordA, WordB, WordC, [WordD]
WordA, WordB, WordC, [WordD], WordE
"""
print(re.findall(regex, string))
Output
['WordA', 'WordB', 'WordC', 'WordA', 'WordB', 'WordC', 'WordE']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

Regular expression with condition

I have been working on the python code to extract document Ids from text documents where IDs can be at the random line in the text using regex.
This document ID is comprised of four letters followed by a hyphen, followed by three numbers and optionally ending in a letter. For example, each of the following is valid document IDs:
ABCD-123
ABCD-123V
XKCD-999
COMP-200
I have tried following regular expression for finding all ids:
re = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z]{0,1})", text.read())
These expressions work correctly but I have a problem when Ids are connected to words like:
XKCD-999James
The regular expression should return XKCD-999 but it is returning XKCD-999J which is incorrect.
What changes should I do in RE to get the correct?
Use a negative lookahead assertion to ignore patterns that have trailing letters:
exp = re.findall(r"([A-Z]{4})(-)([0-9]{3})([A-Z](?![A-Za-z]))?", text.read())
# ^^^^^^^^^^^^^^^^^^^^
As you are using word characters, you can optionally match a char A-Z followed by a word boundary.
\b[A-Z]{4}-[0-9]{3}(?:[A-Z]\b)?
Regex demo
Note that using re.findall will return the captured groups, so if you want to return just the whole match, you can omit the groups.
With the capture groups, the pattern can be:
\b([A-Z]{4})(-)([0-9]{3}(?:[A-Z]\b)?)
Regex demo
How about you use a boundary operation \b ?
[A-Z]{4}-\d{3}(?:[A-Z]\b)?
Regex101 Sample - https://regex101.com/r/DhC5Vd/4
text = "XKCD-999James"
exp = re.findall(r"[A-Z]{4}-\d{3}(?:[A-Z]\b)?", text)
#OUTPUT: ['XKCD-999']

in regex how to match multiple Or conditions but exclude one condition

If I need to match a string "a" with any combination of symbols ##$ before and after it, such as #a#, #a#, $a$ etc, but not a specific pattern #a$. How can I exclude this? Suppose there're too many combinations to manually spell out one-by-one. And it's not negative lookahead or behind cases as seen in other SO answers.
import re
pattern = "[#|#|&]a[#|#|&]"
string = "something#a&others"
re.findall(pattern, string)
Currently the pattern returns results like '#a&' as expected, but also wrongly return on the string to be excluded. The correct pattern should return [] on re.findall(pattern,'#a$')
You can use the character class to list all the possible characters, and use a single negative lookbehind after the match to assert not #a$ directly to the left.
Note that you don't need the | in the character class, as it would match a pipe char and is the same as [#|#&]
[##&$]a[##&$](?<!#a\$)
Regex demo | Python demo
import re
pattern = r"[##&$]a[##&$](?<!#a\$)"
print(re.findall(pattern,'something#a&others#a$'))
Output
['#a&']
I was going to suggest a fairly ugly and complex regex pattern with lookarounds. But instead, you could just proceed with your current pattern and then use a list comprehension to remove the false positive case:
inp = "something#a&others #a$"
matches = re.findall(r'[##&$]+a[##&$]+', inp)
matches = [x for x in matches if x != '#a$']
print(matches) # ['#a&']

Match a whole string variable using python regex

I am new to python.
I want to write a regex to find a whole string using re.match.
For example,
word1=I
word2=U
str_to_match = word1+"[There could be some space in between]"+ word2[this is the end of string I want to match]"
In this case, it should match I[0 or more spaces in between]U.
It shouldn't match abI[0 or more spaces in between]U, OR abI[0 or more spaces in between]Ucd, OR
I[0 or more spaces in between]Ucd.
I know you could use boundary \b to set the word boundary but since there could be a number of spaces in between word1 and word2, the whole string is like a variable, not a fixed string, it won't work for me:
ret=re.match(r"\b"+word1+r"\s+" +word2+r"\b")
not working
Does anyone know how could I find the correct way to match this case?
Thanks!
match only matches at the beginning of the string. Which since you are using \b should not be what you want. findall will find all matching strings.
If you want to find the first occurrence and just test if it exists in the string, and not find all matches you can use search:
word1 = 'I'
word2 = 'U'
ret = re.findall(rf"\b{word1}\s*{word2}\b"," I U IU I U aI U I Ub")
print(ret)
ret = re.search(rf"\b{word1}\s*{word2}\b"," IUb .I U ")
print(ret)

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Categories

Resources