Python regular expression for a sentence does not want to match

Python regular expression for a sentence does not want to match - python

Can anyone explain why this re (in Python):
pattern = re.compile(r"""
^
([[a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1}]+)
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
if re.match(pattern, line):
does not match "A sentence."
I would actually like to return the entire sentence (including the period) as a returned group (), but have been failing miserably.

I think that maybe you meant to do this:
(([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+)
^ ^
I don't think the nested square brackets you had do what you think they do.

This regex works:
pattern = re.compile(r"""
^
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
line = "A sentence."
match = re.match(pattern, line)
>>> print "'%s'" % match.group(0)
'A sentence.'
>>> print "'%s'" % match.group(1)
'A '
>>> print "'%s'" % match.group(2)
'sentence'
To return the entire match (line in this case), use match.group(0).
Because the first match group can match multiple times (once for each word except the last one), you can only access the next to last word using match.group(1).
Btw, the {1} notation is not necessary in this case, matching once and only once is the default behavior, so this bit can be removed.
The extra set of square brackets definitely weren't helping you :)

It turns out the following actually works and includes all the extended ascii characters I wanted
^
([\w+\s{1}]+\w{1}\.{1})
$

Related

Python Find entire word in string using regex and user input

I'm trying to find the entire word exactly using regex but have the word i'm searching for be a variable value coming from user input. I've tried this:
regex = r"\b(?=\w)" + re.escape(user_input) + r"\b"
if re.match(regex, string_to_search[i], re.IGNORECASE):
<some code>...
but it matches every occurrence of the string. It matches "var"->"var" which is correct but also matches "var"->"var"iable and I only want it to match "var"->"var" or "string"->"string"
Input: "sword"
String_to_search = "There once was a swordsmith that made a sword"
Desired output: Match "sword" to "sword" and not "swordsmith"

You seem you want to use a pattern that matches an entire string. Note that \b word boundary is needed when you wan to find partial matches. When you need a full string match, you need anchors. Since re.match anchors the match at the start of string, all you need is $ (end of string position) at the end of the pattern:
regex = '{}$'.format(re.escape(user_input))
and then use
re.match(regex, search_string, re.IGNORCASE)

You can try re.finditer like that:
>>> import re
>>> user_input = "var"
>>> text = "var variable var variable"
>>> regex = r"(?=\b%s\b)" % re.escape(user_input)
>>> [m.start() for m in re.finditer(regex, text)]
[0, 13]
It'll find all matches iteratively.

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(

You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.

re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.

Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Python Regex - non-greedy match does not work

I have a flat file with one C++ function name and part of its declaration like this:
virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const
void function_name2
void NameSpace2::NameSpace4::ClassName2::function_name3
function_name4
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
I can understand why function_name2 and function_name4 do not match (because there is no leading :. But I am seeing that even for function_name1 and function_name3, it does not do non-greedy match. The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I have three questions:
I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
Why is line 3 not being extracted?
How do I get the function names from all the lines using a single regex?
Please help.

This works pretty well, with your example at least:
^(?:\w+ +)*(?:\w+::)*(\w+)
i.e., in Python code:
import re
function_name = re.compile(r'^(?:\w+ +)*(?:\w+::)*(\w+)', re.MULTILINE)
matches = function_name.findall(your_txt)
# -> ['function_name1', 'function_name2', 'function_name3', 'function_name4']
Takeaway: If you can do it with greedy matching, do it with greedy matching.
Note that \w is not correct for a C identifier, but writing down the technically correct character class that matches those is besides the question. Find and use the correct set of characters instead of \w.

1) Always use r" " strings for regexes.
2)
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I'm not seeing that:
import re
line = "virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const"
fn_name = re.search(r":(.*?)\(?", line)
print(fn_name.group())
--output:--
:
In any case, if you want to see how non-greedy works, look at this code:
import re
line = "N----1----2"
greedy_pattern = r"""
N
.*
\d
"""
match_obj = re.search(greedy_pattern, line, flags=re.X)
print(match_obj.group())
non_greedy_pattern = r"""
N
.*?
\d
"""
match_obj = re.search(non_greedy_pattern, line, flags=re.X)
print(match_obj.group())
--output:--
N----1----2
N----1
The non-greedy version asks for all the characters matching .* up until the first digit that is encountered, while the greedy version will try to find the longest match for .* that is followed by a digit.
3) Warning! No regex zone!
func_names = [
"virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const",
"void function_name2",
"void NameSpace2::NameSpace4::ClassName2::function_name3",
"function_name4",
]
for func_name in func_names:
name = func_name.rsplit("::", 1)[-1]
pieces = name.rsplit(" ", 1)
if pieces[-1] == "const":
name = pieces[-2]
else:
name = pieces[-1]
name = name.split('(', 1)[0]
print(name)
--output:--
function_name1
function_name2
function_name3
function_name4

I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
This is the result from your regex ":(.*?)\(?"
I think your regex is "Too Lazy". It will match only : because (.*?) stand for match any characters "as less as possible" then regex engine chooses to match zero character. It will not match till \(? as you expected because ? just means "optional".
Why is line 3 not being extracted?
As I've tested your regex. It doesn't work at all not only the third line.
How do I get the function names from all the lines using a single regex?
You can start from this minimal example
(?:\:\:|void\s+)(\w+)(?:\(|$)|(function_name4)
Where (?:\:\:|void\s+) represents to anything that leading your function name and (?:\(|$) represents to anything that follow you function name.
Note that function_name4 suppose to be declared explicitly due to lacking of pattern.
see: DEMO

I've been stumped before by something similar when trying to capture the "N----1" from "N foo bar N----1----2". Adding a leading .* gave the desired result.
import re
line = "N foo bar N----1----2"
match_obj = re.search(r'(N.*?\d)', line)
print(match_obj.group(1))
match_obj = re.search(r'.*(N.*?\d)', line)
print(match_obj.group(1))
--output:--
N foo bar N----1
N----1

Python regex group() works explanation

Could someone please explain why below each print gives different result? thanks.
import re
s = "-h5ello"
m = re.match("-\w(\d\w+)", s)
print ' m.group(): ',(m.group())
print ' m.group(0): ',(m.group(0))
print ' m.group(1): ',(m.group(1))

m.group() and m.group(0) simply return the whole string if there was a match.
The reason they're identical is that the function is defined with a default value of zero:
def group(num=0):
As for the matches:
m.group(1), m.group(2)... returns the matched groups (in your case - there's only one)
More about matche groups can be found in the docs

m.group() and m.group(0) should be, and are, identical.
m.group(1) only gives you the match from inside the first pair of parentheses.
EDIT to clarify what a "matched group" is:
In regular expressions, plain parentheses are called "captures". The reason for this is the fact that they capture submatches into capture groups. Consider this:
import re
m = re.match(r'a(b)c(d(e)f)g', 'abcdefg')
print m.group()
# => 'abcdefg'
print m.groups()
# => ('b', 'def', 'e')
m.group(0), or equivalently m.group(), is the whole match. Parentheses pick out submatches, with first parenthesis pair yielding m.group(1), second m.group(2), and third m.group(3).
In your example, you have parentheses too. They do not include -\w, so your m.group(1) does not include -h part of your string - they only include the submatch for \d\w+, which is 5ello.

Capturing emoticons using regular expression in python

I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .
I have this with me, but it matches ":( ("
bool( re.match("(:\()",str) )
I maybe missing something obvious here, and I'd like some help for this seemingly simple task.

I think it finally "clicked" exactly what you're asking about here. Take a look at the below:
import re
smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("
def test_match(s):
print 'Value: %s; Result: %s' % (
s,
'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
)
should_match = [
':)', # Single smile
':(', # Single frown
':):)', # Two smiles
':(:(', # Two frowns
':):(', # Mix of a smile and a frown
]
should_not_match = [
'', # Empty string
':(foo', # Extraneous characters appended
'foo:(', # Extraneous characters prepended
':( :(', # Space between frowns
':( (', # Extraneous characters and space appended
':((' # Extraneous duplicate of final character appended
]
print('The following should all match:')
for x in should_match: test_match(x);
print('') # Newline for output clarity
print('The following should all not match:')
for x in should_not_match: test_match(x);
The problem with your original code is that your regex is wrong: (:\(). Let's break it down.
The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:
( begin a group
:\( ... do regex stuff ...
')' end the group
The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says
( begin a group
: a colon character
\( a left parenthesis character
) end the group
The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:\))+$.
^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...
^ beginning of line
(:\(|:\))+ ... do regex stuff ...
$ end of line
... so it only matches things that comprise the entire line, not simply occur in the middle of the string.
We know that ( and ) denote a grouping. + means "one of more of these". Now we have:
^ beginning of line
( start a group
:\(|:\) ... do regex stuff ...
) end the group
+ match one or more of this
$ end of line
Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:
^ beginning of line
( start a group
: a colon character
\( a left parenthesis character
| or
: a colon character
\) a right parenthesis character
) end the group
+ match one or more of this
$ end of line
I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

Maybe something like:
re.match('[:;][)(](?![)(])', str)

Try (?::|;|=)(?:-)?(?:\)|\(|D|P). Haven't tested it extensively, but does seem to match the right ones and not more...
In [15]: import re
In [16]: s = "Just: to :)) =) test :(:-(( ():: :):) :(:( :P ;)!"
In [17]: re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)',s)
Out[17]: [':)', '=)', ':(', ':-(', ':)', ':)', ':(', ':(', ':P', ';)']

I got the answer I was looking for from the comments and answers posted here.
re.match("^(:[)(])*$",str)
Thanks to all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression for a sentence does not want to match - python

I think that maybe you meant to do this: (([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+) ^ ^ I don't think the nested square brackets you had do what you think they do.

It turns out the following actually works and includes all the extended ascii characters I wanted ^ ([\w+\s{1}]+\w{1}\.{1}) $

Related

Python Find entire word in string using regex and user input

repetition in regular expression in python

Python Regex - non-greedy match does not work

Python regex group() works explanation

Capturing emoticons using regular expression in python

Categories

Resources