Case-insensitivity exclusively in lookbehind / lookahead groups for Python regex [duplicate] - python

This question already has answers here:
How to set ignorecase flag for part of regular expression in Python?
(3 answers)
Closed 3 years ago.
I understand how to make matching case in-sensitive in Python, and I understand how to use lookahead / lookbehinds, but how do I combine the two?
For instance, my text is
mytext = I LOVE EATING popsicles at home.
I want to extract popsicles from this text (my target food item). This regex works great:
import re
regex = r'(?<=I\sLOVE\sEATING\s)[a-z0-9]*(?=\sat\shome)'
re.search(regex, mytext)
However, I'd like to account for the scenario where someone writes
i LOVE eating apples at HOME.
That should match. But "I LOVE eating Apples at home" should NOT match, since Apples is uppercase.
Thus, I'd like to have local case insensitivity in my two lookahead (?=\sat\shome)and lookbehind (?<=I\sLOVE\sEATING\s) groups. I know I can use re.IGNORECASE flags for global case insensitivity, but I just want the lookahead/behind groups to be case insensitive, not my actual target expression.
Traditionally, I'd prepend (?i:I LOVE EATING) to create a case-insensitive non-capturing group that is capable of matching both I LOVE EATING and I love eating. However, If I try to combine the two together:
(?i:<=I\sLOVE\sEATING\s)
I get no matches, since it now interprets the i: as a literal expression to match. Is there a way to combine lookaheads/behinds with case sensitivity?
Edit: I don’t think this is a duplicate of the marked question. That question specifically asks about a part of a group- I’m asking for a specific subset- look ahead and behinds. The syntax is different here. The answers in that other post do not directly apply. As the answers on this post suggest, you need to apply some work arounds to achieve this functionality that don’t apply to the supposed duplicate SO post.

You can set the regex to case-insensitive globally with (?i) and switch a group to case-sensitive with (?-i:groupcontent):
regex = r'(?i)(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
Instead of (?i), you can also use re.I in the search. The following is equivalent to the regex above:
regex = r'(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
re.search(regex, mytext, re.I)

Unfortunately python re module doesn't allow inline use of mode modifiers in the middle of a regex.
As a workaround, you may use this regex:
reg = re.compile(r'(?<=[Ii]\s[Ll][Oo][Vv][Ee]\s[Ee][Aa][Tt][Ii][Nn][Gg]\s)[a-z0-9]*(?=\s[Aa][Tt]\s[Hh][Oo][Mm][Ee])')
print "Case 1: ", reg.findall('I LOVE Eating popsicles at HOME.')
print "Case 2: ", reg.findall('I LOVE EATING popsicles at home.')
print "Case 3: ", reg.findall('I LOVE Eating Popsicles at HOME.')
Output:
Case 1: ['popsicles']
Case 2: ['popsicles']
Case 3: []

Using (?i:...) you can set a regex a flag (in this case i)
locally (inline) for some part of the regex.
Such a local flag setting is allowed also within lookbehind or
lookahead, while keeping the rest of the regex without any option.
I modified your code, so it compliles the regex once and then
calls is 2 times for different strings:
mytext1 = 'i LOVE eating Apples at HOME.'
mytext2 = 'i LOVE eating apples at HOME.'
pat = re.compile(r'(?<=(?i:I\sLOVE\sEATING\s))[a-z0-9]+(?=(?i:\sAT\sHOME))')
m = pat.search(mytext1)
print('1:', m.group() if m else '** Not found **')
m = pat.search(mytext2)
print('2:', m.group() if m else '** Not found **')
It prints:
1: ** Not found **
2: apples
so the match is only for the second source string.

Related

Regex - match until a group of multiple possibilities

I have the following text:
You may have that thing NO you dont BUT maybe yes
I'm trying to write a regex which can match everything until it finds some specific words, "NO" and "BUT" in this example, and if the string has both of the words, then stop at the first one:
You may have that thing NO you dont BUT maybe yes
You may have that thing
You may have that thing you dont BUT maybe yes
You may have that thing you dont
I was trying the regex below, but the problem is that it stops at BUT even when it has NO:
(.*)(?:NO|BUT)
Match example of the above regex, in bold being the full match and in italic being group 1:
You may have that thing NO you dont BUT maybe yes
What i expect:
You may have that thing NO you dont BUT maybe yes
Let us fix your regex pattern
^(.*?)\s*(?:NO|BUT)
Now we can use the above regex pattern with search
s = 'You may have that thing NO you dont BUT maybe yes'
match = re.search(r'^(.*?)\s*(?:NO|BUT)', s)
>>> match.group(1)
'You may have that thing'
Regex details:
^ : Assert position at the start of line
(.*?) : First capturing group
.*? : Matches any character zero or more times but as few times as possible
\s* : Zero or more whitespace characters
(?:NO|BUT) : Non capturing group
NO|BUT : Matches one of NO, BUT
See the online regex demo

How can I find the best fuzzy string match?

Python's new regex module supports fuzzy string matching. Sing praises aloud (now).
Per the docs:
The ENHANCEMATCH flag makes fuzzy matching attempt to improve the fit
of the next match that it finds.
The BESTMATCH flag makes fuzzy matching search for the best match
instead of the next match
The ENHANCEMATCH flag is set using (?e) as in
regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog"
but there's nothing on actually setting the BESTMATCH flag. How's it done?
Documentation on the BESTMATCH flag functionality is partial (but improving). Poke-n-hope shows that BESTMATCH is set using (?b).
>>> import regex
>>> regex.search(r"(?e)(?:hello){e<=4}", "What did you say, oh - hello")[0]
'hat d'
>>> regex.search(r"(?b)(?:hello){e<=4}", "What did you say, oh - hello")[0]
'hello'

Split Sentence on Punctuation or Camel-Case

I have a very long string in python and i'm trying to break it up into a list of sentences. Only some of these sentences are missing puntuation and spaces between them.
Example
I have 9 sheep in my garageVideo games are super cool.
I can't figure out the regex to separate the two! It's drive me nuts.
There are properly punctuated sentences as well, so I thought i'd make several different regex patterns, each splitting off different styles of combination.
Input
I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!
Output
['I have 9 sheep in my garage',
'Video games are super cool.'
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Thanks!
Position Split: Use the regex module
I will give you both a "Split" and a "Match All" option. Let's start with "Split".
In many engines, but not Python's re module, you can split at a position defined by a zero-width match.
In Python, to split on a position, I would use Matthew Barnett's outstanding regex module, whose features far outstrip those of Python's default re engine. That is my default regex engine in Python.
With your input, you can use this regex:
(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])
Note that if you had strangely-formatted acronyms such as B. B. C., we would need to tweak this.
Sample Python Code:
string = "I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!"
result = regex.split("(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])", string)
print(result)
Output:
['I have 9 sheep in my garage',
'Video games are super cool.',
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Explanation
(?V1) instructs the engine to use the new behavior, where we can split on zero-width matches.
(?<=[a-z])(?=[A-Z]) matches a position where the lookbehind (?<=[a-z]) can assert that what precedes is a lower-case letter and the lookahead (?=[A-Z]) can assert that what follows is an uppercase letter.
| OR...
(?<=[.!?]) +(?=[A-Z]) matches one or more spaces + where the lookbehind (?<=[.!?]) can assert that what precedes is a dot, bang, question mark and a space, and where the lookahead (?=[A-Z]) can assert that what follows is a capital letter.
Option 2: Use findall (again with the regex module)
Since the "Split" and "Match All" operations are two sides of the same coin, you can do this:
print(regex.findall(r".+?(?:(?<=[.!?])|(?<=[a-z])(?=[A-Z]))",string))
Again, this would not work with re (which would skip the V that starts the second sentence Video).

detect emoticon in a sentence using regex python [duplicate]

This question already has answers here:
Capturing emoticons using regular expression in python
(4 answers)
Closed 9 years ago.
Here is the list of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
I want to form a regex which checks if any of these emoticons exist in the sentence. For example, "hey there I am good :)" or "I am angry and sad :(" but there are a lot of emoticons in the list on wikipedia so wondering how I can achieve this task.
I am new to regex. & python.
>>> s = "hey there I am good :)"
>>> import re
>>> q = re.findall(":",s)
>>> q
[':']
I see two approaches to your problem:
Either, you can create a regular expression for a "generic smiley" and try to match as many as possible without making it overly complicated and insane. For example, you could say that each smiley has some sort of eyes, a nose (optional), and a mouth.
Or, if you want to match each and every smiley from that list (and none else) you can just take those smileys, escape any regular-expression specific special characters, and build a huge disjunction from those.
Here is some code that should get you started for both approaches:
# approach 1: pattern for "generic smiley"
eyes, noses, mouths = r":;8BX=", r"-~'^", r")(/\|DP"
pattern1 = "[%s][%s]?[%s]" % tuple(map(re.escape, [eyes, noses, mouths]))
# approach 2: disjunction of a list of smileys
smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D""".split()
pattern2 = "|".join(map(re.escape, smileys))
text = "bla bla bla :-/ more text 8^P and another smiley =-D even more text"
print re.findall(pattern1, text)
Both approaches have pros, cons, and some general limitations. You will always have false positives, like in a mathematical term like 18^P. It might help to put spaces around the expression, but then you can't match smileys followed by punctuation. The first approach is more powerful and catches smileys the second approach won't match, but only as long as they follow a certain schema. You could use the same approach for "eastern" smileys, but it won't work for strictly symmetric ones, like =^_^=, as this is not a regular language. The second approach, on the other hand, is easier to extend with new smileys, as you just have to add them to the list.

How to find all words followed by symbol using Python Regex?

I need re.findall to detect words that are followed by a "="
So it works for an example like
re.findall('\w+(?=[=])', "I think Python=amazing")
but it won't work for "I think Python = amazing" or "Python =amazing"...
I do not know how to possibly integrate the whitespace issue here properly.
Thanks a bunch!
'(\w+)\s*=\s*'
re.findall('(\w+)\s*=\s*', 'I think Python=amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python = amazing') \\ return 'Python'
re.findall('(\w+)\s*=\s*', 'I think Python =amazing') \\ return 'Python'
You said "Again stuck in the regex" probably in reference to your earlier question Looking for a way to identify and replace Python variables in a script where you got answers to the question that you asked, but I don't think you asked the question you really wanted the answer to.
You are looking to refactor Python code, and unless your tool understands Python, it will generate false positives and false negatives; that is, finding instances of variable = that aren't assignments and missing assignments that aren't matched by your regexp.
There is a partial list of tools at What refactoring tools do you use for Python? and more general searches with "refactoring Python your_editing_environment" will yield more still.
Just add some optional whitespace before the =:
\w+(?=\s*=)
Use this instead
re.findall('^(.+)(?=[=])', "I think Python=amazing")
Explanation
# ^(.+)(?=[=])
#
# Options: case insensitive
#
# Assert position at the beginning of the string «^»
# Match the regular expression below and capture its match into backreference number 1 «(.+)»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[=])»
# Match the character “=” «[=]»
You need to allow for whitespace between the word and the =:
re.findall('\w+(?=\s*[=])', "I think Python = amazing")
You can also simplify the expression by using a capturing group around the word, instead of a non-capturing group around the equals:
re.findall('(\w+)\s*=', "I think Python = amazing")
r'(.*)=.*' would do it as well ...
You have anything #1 followed with a = followed with anything #2, you get anything #1.
>>> re.findall(r'(.*)=.*', "I think Python=amazing")
['I think Python']
>>> re.findall(r'(.*)=.*', " I think Python = amazing oh yes very amazing ")
[' I think Python ']
>>> re.findall(r'(.*)=.*', "= crazy ")
['']
Then you can strip() the string that is in the list returned.
re.split(r'\s*=', "I think Python=amazing")[0].split() # returns ['I', 'think', 'Python']

Categories

Resources