Capturing emoticons using regular expression in python

Capturing emoticons using regular expression in python - python

I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .
I have this with me, but it matches ":( ("
bool( re.match("(:\()",str) )
I maybe missing something obvious here, and I'd like some help for this seemingly simple task.

I think it finally "clicked" exactly what you're asking about here. Take a look at the below:
import re
smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("
def test_match(s):
print 'Value: %s; Result: %s' % (
s,
'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
)
should_match = [
':)', # Single smile
':(', # Single frown
':):)', # Two smiles
':(:(', # Two frowns
':):(', # Mix of a smile and a frown
]
should_not_match = [
'', # Empty string
':(foo', # Extraneous characters appended
'foo:(', # Extraneous characters prepended
':( :(', # Space between frowns
':( (', # Extraneous characters and space appended
':((' # Extraneous duplicate of final character appended
]
print('The following should all match:')
for x in should_match: test_match(x);
print('') # Newline for output clarity
print('The following should all not match:')
for x in should_not_match: test_match(x);
The problem with your original code is that your regex is wrong: (:\(). Let's break it down.
The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:
( begin a group
:\( ... do regex stuff ...
')' end the group
The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says
( begin a group
: a colon character
\( a left parenthesis character
) end the group
The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:\))+$.
^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...
^ beginning of line
(:\(|:\))+ ... do regex stuff ...
$ end of line
... so it only matches things that comprise the entire line, not simply occur in the middle of the string.
We know that ( and ) denote a grouping. + means "one of more of these". Now we have:
^ beginning of line
( start a group
:\(|:\) ... do regex stuff ...
) end the group
+ match one or more of this
$ end of line
Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:
^ beginning of line
( start a group
: a colon character
\( a left parenthesis character
| or
: a colon character
\) a right parenthesis character
) end the group
+ match one or more of this
$ end of line
I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

Maybe something like:
re.match('[:;][)(](?![)(])', str)

Try (?::|;|=)(?:-)?(?:\)|\(|D|P). Haven't tested it extensively, but does seem to match the right ones and not more...
In [15]: import re
In [16]: s = "Just: to :)) =) test :(:-(( ():: :):) :(:( :P ;)!"
In [17]: re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)',s)
Out[17]: [':)', '=)', ':(', ':-(', ':)', ':)', ':(', ':(', ':P', ';)']

I got the answer I was looking for from the comments and answers posted here.
re.match("^(:[)(])*$",str)
Thanks to all.

Related

Split a string by comma except when in bracket and except when directly before and/or after the comma is a dash "-"?

just trying to figure out how to plit a string by comma except when in bracket AND except when directly before and/or after the comma is a dash. I have already found some good solutions for how to deal with the bracket problem but I do not have any clue how to extend this to my problem.
Here is an example:
example_string = 'A-la-carte-Küche, Garnieren (Speisen, Getränke), Kosten-, Leistungsrechnung, Berufsausbildung, -fortbildung'
aim = ['A-la-carte-Küche', 'Garnieren (Speisen, Getränke)', 'Kosten-, Leistungsrechnung', 'Berufsausbildung, -fortbildung']
So far, I have managed to do the following:
>>> re.split(r',\s*(?![^()]*\))', example_string)
>>> out: ['A-la-carte-Küche', 'Garnieren (Speisen, Getränke)', 'Kosten-', 'Leistungsrechnung', 'Berufsausbildung', '-fortbildung']
Note the difference between aim and out for the terms 'Kosten-, Leistungsrechnung' and 'Berufsausbildung, -fortbildung'.
Would be glad if someone could help me out such that the output looks like aim.
Thanks in advance!
Alex

If you can make use of the python regex module, you could do:
\([^()]*\)(*SKIP)(*F)|(?<!-)\s*,\s*(?!,)
The pattern matches:
\([^()]*\) Match from an opening till closing parenthesis
(*SKIP)(*F) Skip the match
| Or
(?<!-)\s*,\s*(?!,) Match a comma between optional whitespace chars to split on
Regex demo
import regex
example_string = 'A-la-carte-Küche, Garnieren (Speisen, Getränke), Kosten-, Leistungsrechnung, Berufsausbildung, -fortbildung'
print(regex.split(r"\([^()]*\)(*SKIP)(*F)|(?<!-)\s*,\s*(?!,)", example_string))
Output
['A-la-carte-Küche', ' Garnieren (Speisen, Getränke)', ' Kosten-, Leistungsrechnung', ' Berufsausbildung', ' -fortbildung']

You can use
re.split(r'(?<!-),(?!\s*-)\s*(?![^()]*\))', example_string)
See the Python demo. Details:
(?<!-) - a negative lookbehind that fails the match if there is a - char immediately to the left of the current location
, - a comma
(?!\s*-) - a negative lookahead that fails the match if there is a - char immediately to the right of the current location
\s* - zero or more whitespaces
(?![^()]*\)) - a negative lookahead that fails the match if there are zero or more chars other than ) and ( and then a ) char immediately to the right of the current location.
See the regex demo, too.

How to use RegEx in an if statement in Python?

I'm doing something like "Syntax Analyzer" with Kivy, using re (regular expresions).
I only want to check a valid syntax for basic operations (like +|-|*|/|(|)).
The user tape the string (with keyboard) and I validate it with regex.
But I don't know how to use regex in an if statement. That I want is: If the string that user brings me isn't correct (or doesn't check with regex) print something like "inavlid string" and if is correct print "Valid string".
I've tried with:
if re.match(patron, string) is not None:
print ("\nTrue")
else:
print("False")
but, it doesn't matter what do string has, the app always show True.
Sorry my poor english. Any help would be greatly appreciated!
import re
patron= re.compile(r"""
(
-?\d+[.\d+]?
[+*-/]
-?\d+[.\d+]?
[+|-|*|/]?
)*
""", re.X)
obj1= self.ids['text'].text #TextInput
if re.match(patron, obj1) is not None:
print ("\nValid String")
else:
print("Inavlid string")
if obj1= "53.22+22.11+10*555+62+55.2-66" actually it's correct and app prints "Valid..." but if I put an a like this "a53.22+22.11+10*555+62+55.2-66" it's incorrect and the app must prints invalid.. but instead it still valid.

Your regex always matches because it allows the empty string to match (since the entire regex is enclosed in an optional group.
If you test this live on regex101.com, you can immediately see this and also that it doesn't match the entire string but only parts of it.
I've already corrected two errors in your character classes concerning the use of unnecessary/harmful alternation operators (|) and incorrect placement of the dash, making it into a range operator (-), but it's still incorrect.
I think you want something more like this:
^ # Make sure the match begins at the start of the string
(?: # Start a non-capturing group that matches...
-? # an optional minus sign,
\d+ # one or more digits
(?:\.\d+)? # an optional group that contains a dot and one or more digits.
(?: # Start of a non-capturing group that either matches...
[+*/-] # an operator
| # or
$ # the end of the string.
) # End of inner non-capturing group
)+ # End of outer non-capturing group, required to match at least once.
(?<![+*/-]) # Make sure that the final character isn't an operator.
$ # Make sure that the match ends at the end of the string.
Test it live on regex101.com.

This answers your question about how to use if with regex:
Caveat: the regex formula will not weed out all invalid inputs, e.g., two decimal points (".."), two operators ("++"), and such. So please adjust it to suit your exact needs)
import re
regex = re.compile(r"[\d.+\-*\/]+")
input_list = [
"53.22+22.11+10*555+62+55.2-66", "a53.22+22.11+10*555+62+55.2-66",
"53.22+22.pq11+10*555+62+55.2-66", "53.22+22.11+10*555+62+55.2-66zz",
]
for input_str in input_list:
mmm = regex.match(input_str)
if mmm and input_str == mmm.group():
print('Valid: ', input_str)
else:
print('Invalid: ', input_str)
Above as a function for use with a single string instead of a list:
import re
regex = re.compile(r"[\d.+\-*\/]+")
def check_for_valid_string(in_string=""):
mmm = regex.match(in_string)
if mmm and in_string == mmm.group():
return 'Valid: ', in_string
return 'Invalid: ', in_string
check_for_valid_string('53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('a53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.pq11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.11+10*555+62+55.2-66zz')
Output:
## Valid: 53.22+22.11+10*555+62+55.2-66
## Invalid: a53.22+22.11+10*555+62+55.2-66
## Invalid: 53.22+22.pq11+10*555+62+55.2-66
## Invalid: 53.22+22.11+10*555+62+55.2-66zz

How to match the string in $(....) in python

with open('templates/data.xml', 'r') as s:
for line in s:
line = line.rstrip() #removes trailing whitespace and '\n' chars
if "\\$\\(" not in line:
if ")" not in line:
continue
print(line)
start = line.index("$(")
end = line.index(")")
print(line[start+2:end])
I need to match the strings which are like $(hello). But now this even matches (hello).
Im really new to python. So what am i doing wrong here ?

Use the following regex:
\$\(([^)]+)\)
It matches $, followed by (, then anything until the last ), and catches the characters between the parenthesis.
Here we did escape the $, ( and ) since when you use a function that accepts a regex (like findall), you don't want $ to be treated as the special character $, but as the literal "$" (same holds for the ( and )). However, note that the inner parenthesis didn't get quoted since you want to capture the text between the outer parenthesis.
Note that you don't need to escape the special characters when you're not using regex.

You can do:
>>> import re
>>> escaper = re.compile(r'\$\((.*?)\)')
>>> escaper.findall("I like to say $(hello)")
['hello']

I believe something along the lines of:
import re
data = "$(hello)"
matchObj = re.match( r'\$\(([^)]+)\)', data, re.M|re.I)
print matchObj.group()
might do the trick.

If you don't want to do it with regexes (I wouldn't necessarily; they can be hard to read).
Your for loop indentation is wrong.
"\$\(" means \$\( (you're escaping the brackets, not the $ and (.
You don't need to escpae $ or (. Just do if "$(" not in line
You need to check the $( is found before ). Currently your code will match "foo)bar$(baz".
Rather than checking if $( and ) are in the string twice, it would be better to just do the .index() anyway and catch the exception. Something like this:
with open('templates/data.xml', 'r') as s:
for line in s:
try:
start = line.index("$(")
end = line.index(")", start)
print(line[start+2:end])
except ValueError:
pass
Edit: That will only match one $() per line; you'll want to add a loop.

Python/Regex splitting a specifically formatted return string

I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!

If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]

That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'

For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)

Match series of (non-nested) balanced parentheses at end of string

How can I match one or more parenthetical expressions appearing at the end of string?
Input:
'hello (i) (m:foo)'
Desired output:
['i', 'm:foo']
Intended for a python script. Paren marks cannot appear inside of each other (no nesting), and the parenthetical expressions may be separated by whitespace.
It's harder than it might seem at first glance, at least so it seems to me.

paren_pattern = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)")
def getParens(s):
return paren_pattern.findall(s)
or even shorter:
getParens = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)").findall
explaination:
\( # opening paren
([^()]*) # content, captured into group 1
\) # closing paren
(?= # look ahead for...
(?:\s*\([^()]*\))* # a series of parens, separated by whitespace
\s* # possibly more whitespace after
$ # end of string
) # end of look ahead

You don't need to use regex:
def splitter(input):
return [ s.rstrip(" \t)") for s in input.split("(") ][1:]
print splitter('hello (i) (m:foo)')
Note: this solution only works if your input is already known to be valid. See MizardX's solution that will work on any input.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capturing emoticons using regular expression in python - python

Maybe something like: re.match('[:;][)(](?![)(])', str)

I got the answer I was looking for from the comments and answers posted here. re.match("^(:[)(])*$",str) Thanks to all.

Related

Split a string by comma except when in bracket and except when directly before and/or after the comma is a dash "-"?

How to use RegEx in an if statement in Python?

How to match the string in $(....) in python

Python/Regex splitting a specifically formatted return string

Match series of (non-nested) balanced parentheses at end of string

Categories

Resources