detect emoticon in a sentence using regex python [duplicate] - python

This question already has answers here:
Capturing emoticons using regular expression in python
(4 answers)
Closed 9 years ago.
Here is the list of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
I want to form a regex which checks if any of these emoticons exist in the sentence. For example, "hey there I am good :)" or "I am angry and sad :(" but there are a lot of emoticons in the list on wikipedia so wondering how I can achieve this task.
I am new to regex. & python.
>>> s = "hey there I am good :)"
>>> import re
>>> q = re.findall(":",s)
>>> q
[':']

I see two approaches to your problem:
Either, you can create a regular expression for a "generic smiley" and try to match as many as possible without making it overly complicated and insane. For example, you could say that each smiley has some sort of eyes, a nose (optional), and a mouth.
Or, if you want to match each and every smiley from that list (and none else) you can just take those smileys, escape any regular-expression specific special characters, and build a huge disjunction from those.
Here is some code that should get you started for both approaches:
# approach 1: pattern for "generic smiley"
eyes, noses, mouths = r":;8BX=", r"-~'^", r")(/\|DP"
pattern1 = "[%s][%s]?[%s]" % tuple(map(re.escape, [eyes, noses, mouths]))
# approach 2: disjunction of a list of smileys
smileys = """:-) :) :o) :] :3 :c) :> =] 8) =) :} :^)
:D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 B^D""".split()
pattern2 = "|".join(map(re.escape, smileys))
text = "bla bla bla :-/ more text 8^P and another smiley =-D even more text"
print re.findall(pattern1, text)
Both approaches have pros, cons, and some general limitations. You will always have false positives, like in a mathematical term like 18^P. It might help to put spaces around the expression, but then you can't match smileys followed by punctuation. The first approach is more powerful and catches smileys the second approach won't match, but only as long as they follow a certain schema. You could use the same approach for "eastern" smileys, but it won't work for strictly symmetric ones, like =^_^=, as this is not a regular language. The second approach, on the other hand, is easier to extend with new smileys, as you just have to add them to the list.

Related

Regular expression for 'b' not preceded by an odd number of 'a's [duplicate]

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:
oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")
In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:
oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)
Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.
Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.
Notice that if you can use groups, you generally do not need lookbehinds. So how about
match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
text = match.group(1)
In practice:
>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
You need to use capture groups in this case you described:
"(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+"
will become
r"ORIG\s?:\s?/\s?([A-Z0-9]+)"
The value will be in .group(1). Note that raw strings are preferred.
Here is a sample code:
import re
p = re.compile(r'ORIG\s?:\s?/\s?([A-Z0-9]+)', re.IGNORECASE)
test_str = "ORIG:/texthere"
print re.search(p, test_str).group(1)
IDEONE demo
Unless you need overlapping matches, capturing groups usage instead of a look-behind is rather straightforward.
print re.findall(r"ORIG\s?:\s?/\s?([A-Z0-9]+)",test_str)
You can directly use findall which will return all the groups in the regex if present.

Case-insensitivity exclusively in lookbehind / lookahead groups for Python regex [duplicate]

This question already has answers here:
How to set ignorecase flag for part of regular expression in Python?
(3 answers)
Closed 3 years ago.
I understand how to make matching case in-sensitive in Python, and I understand how to use lookahead / lookbehinds, but how do I combine the two?
For instance, my text is
mytext = I LOVE EATING popsicles at home.
I want to extract popsicles from this text (my target food item). This regex works great:
import re
regex = r'(?<=I\sLOVE\sEATING\s)[a-z0-9]*(?=\sat\shome)'
re.search(regex, mytext)
However, I'd like to account for the scenario where someone writes
i LOVE eating apples at HOME.
That should match. But "I LOVE eating Apples at home" should NOT match, since Apples is uppercase.
Thus, I'd like to have local case insensitivity in my two lookahead (?=\sat\shome)and lookbehind (?<=I\sLOVE\sEATING\s) groups. I know I can use re.IGNORECASE flags for global case insensitivity, but I just want the lookahead/behind groups to be case insensitive, not my actual target expression.
Traditionally, I'd prepend (?i:I LOVE EATING) to create a case-insensitive non-capturing group that is capable of matching both I LOVE EATING and I love eating. However, If I try to combine the two together:
(?i:<=I\sLOVE\sEATING\s)
I get no matches, since it now interprets the i: as a literal expression to match. Is there a way to combine lookaheads/behinds with case sensitivity?
Edit: I don’t think this is a duplicate of the marked question. That question specifically asks about a part of a group- I’m asking for a specific subset- look ahead and behinds. The syntax is different here. The answers in that other post do not directly apply. As the answers on this post suggest, you need to apply some work arounds to achieve this functionality that don’t apply to the supposed duplicate SO post.
You can set the regex to case-insensitive globally with (?i) and switch a group to case-sensitive with (?-i:groupcontent):
regex = r'(?i)(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
Instead of (?i), you can also use re.I in the search. The following is equivalent to the regex above:
regex = r'(?<=I\sLOVE\sEATING\s)(?-i:[a-z0-9]*)(?=\sat\shome)'
re.search(regex, mytext, re.I)
Unfortunately python re module doesn't allow inline use of mode modifiers in the middle of a regex.
As a workaround, you may use this regex:
reg = re.compile(r'(?<=[Ii]\s[Ll][Oo][Vv][Ee]\s[Ee][Aa][Tt][Ii][Nn][Gg]\s)[a-z0-9]*(?=\s[Aa][Tt]\s[Hh][Oo][Mm][Ee])')
print "Case 1: ", reg.findall('I LOVE Eating popsicles at HOME.')
print "Case 2: ", reg.findall('I LOVE EATING popsicles at home.')
print "Case 3: ", reg.findall('I LOVE Eating Popsicles at HOME.')
Output:
Case 1: ['popsicles']
Case 2: ['popsicles']
Case 3: []
Using (?i:...) you can set a regex a flag (in this case i)
locally (inline) for some part of the regex.
Such a local flag setting is allowed also within lookbehind or
lookahead, while keeping the rest of the regex without any option.
I modified your code, so it compliles the regex once and then
calls is 2 times for different strings:
mytext1 = 'i LOVE eating Apples at HOME.'
mytext2 = 'i LOVE eating apples at HOME.'
pat = re.compile(r'(?<=(?i:I\sLOVE\sEATING\s))[a-z0-9]+(?=(?i:\sAT\sHOME))')
m = pat.search(mytext1)
print('1:', m.group() if m else '** Not found **')
m = pat.search(mytext2)
print('2:', m.group() if m else '** Not found **')
It prints:
1: ** Not found **
2: apples
so the match is only for the second source string.

How do I check if a string matches a set pattern in Python?

I want to match a string to a specific pattern or set of words, like below:
the apple is red is the query and
the apple|orange|grape is red|orange|violet is the pattern to match.
The pipes would represent words that would substitute each other. The pattern could also be grouped like [launch app]|[start program]. I would like the module to return True or False whether the query matches the pattern, naturally.
What is the best way to accomplish this if there is not a library that does this already? If this can be done with simple regex, great; however I know next to nothing about regex. I am using Python 2.7.11
import re
string = 'the apple is red'
re.search(r'^the (apple|orange|grape) is (red|orange|violet)', string)
Here's an example of it running:
In [20]: re.search(r'^the (apple|orange|grape) is (red|orange|violet)', string). groups()
Out[20]: ('apple', 'red')
If there are no matches then re.search() will return nothing.
You may know "next to nothing about regex" but you nearly wrote the pattern.
The sections within the parentheses can also have their own regex patterns, too. So you could match "apple" and "apples" with
r'the (apple[s]*|orange|grape)
The re based solutions for this kind of problem work great. But it would sure be nice if there were an easy way to pull data out of strings in Python without have to learn regex (or to learn it AGAIN, which what I always end up having to do since my brain is broken).
Thankfully, someone took the time to write parse.
parse
parse is a nice package for this kind of thing. It uses regular expressions under the hood, but the API is based on the string format specification mini-language, which most Python users will already be familiar with.
For a format spec you will use over and over again, you'd use parse.compile. Here is an example:
>>> import parse
>>> theaisb_parser = parse.compile('the {} is {}')
>>> fruit, color = theaisb_parser.parse('the apple is red')
>>> print(fruit, color)
apple red

regexp for nvda to put spaces between all capital letters?

So, I use NVDA, a free screen reader for the blind that many people use, and a speech synthesizer. I am building a library of modified versions of addons which it takes, and dictionaries that can contain regular expressions acceptable by python, as well as standard word replacement operation.
My thing is, I do not know how to design a regular expression that will place a space between capital letters such as in ANM, which the synth says as one word rather than spelling it like it should.
I do not know enough python to manually code an addon for this thing, I only use regexp for this kind of thing. I do know regular expressions basics, the general implementation, which you can find by googling "regular expressions in about 55 minutes".
I want it to do something like this.
Input: ANM
Output: A N M
Also with the way this speech synth works, I may have to replace A with eh, which would make this.
Input: ANM
Output: Eh N M
Could any of you provide me a regular expression to do this if it is possible? And no, I don't think I can compile them in loops because I didn't write the python.
This should do the trick for the capital letters, it uses ?= to look ahead for the next capital letter without 'eating it up':
>>> import re
>>> re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZ a Test")
'A B C thIs iS X Y Z a Test'
If you have a lot of replacements to make, it might be easiest to put them into a single variable:
replacements = [("A", "eh"), ("B", "bee"), ("X", "ex")]
result = re.sub("([A-Z])(?=[A-Z])", r"\1 ", "ABC thIs iS XYZX. A Xylophone")
for source, dest in replacements:
result = re.sub("("+source+r")(?=\W)" , dest, result)
print(result)
Output:
eh bee C thIs iS ex Y Z ex. eh Xylophone
I build a regex in the 'replacements' code to handle capitalised words and standalone capitals at the end of sentences correctly. If you want to avoid replacing e.g. the standalone 'A' with 'eh' then the more advanced regex replacement function as mentioned in #fjarri's answer is the way to go.
While #Galax's solution certainly works, it may be easier to perform further processing of abbreviations if you use callbacks on matches (this way you won't replace any standalone capitals):
import re
s = "This is a normal sentence featuring an abbreviation ANM. One, two, three."
def process_abbreviation(match_object):
spaced = ' '.join(match_object.group(1))
return spaced.replace('A', 'Eh')
print(re.sub("([A-Z]{2,})", process_abbreviation, s))
okay, found the answer. Using a sequence of regexes in a certain order, i got it to work. THanks you guys, you helped me form the basis and you are appreciated.

Simple Python Regex Find pattern

I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock
import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).
I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).
You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.
>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.
import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html

Categories

Resources