extract words before optional parentheses and inside parentheses - python

I have a bunch of strings that look like the following two sentences:
A couple of words (abbreviation)
A couple of words
I am trying to get python to extract the 'a couple of words' part and the 'abbreviation' part with a single regex, while also allowing strings where no abbreviation is given.
I've come up with this:
re_both = re.compile(r"^(.*)(?:\((.*)\))$")
It works for the first case, but not for the second case:
[in] re_both.findall('a couple of words (abbreviation)')
[out] [('a couple of words ', 'abbreviation')]
[in] re_both.findall('a couple of words')
[out] []
I would like the second case to yield:
[out] [('a couple of words','')]
Can this be done somehow?

Your regex is fine except you have to make the second part optional and make the first part non greedy.:
re_both = re.compile(r"^(.*?)(?:\((.*)\))?$")
# here __^ here __^

You need to make the second part as optional by adding a quantifier ?, and also you need to add the quantifer ? inside the first capturing group just after to .* so that it would do a non-greedy match.
^(.*?)(?:\((.*)\))?$
^ ^
DEMO
If you don't want to capture the space which was just before to the ( by the first capturing group then you could try the below regex,
^(.*?)(?: \((.*)\))?$
DEMO
>>> import re
>>> s = """A couple of words (abbreviation)
... A couple of words"""
>>> m = re.findall(r'^(.*?)(?: \((.*)\))?$', s, re.M)
>>> m
[('A couple of words', 'abbreviation'), ('A couple of words', '')]
>>> m = re.findall(r'^(.*?)(?:\((.*)\))?$', s, re.M)
>>> m
[('A couple of words ', 'abbreviation'), ('A couple of words', '')]

Related

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

python regular expression grouping

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Match any word in string except those preceded by a curly brace in python

I have a string like
line = u'I need to match the whole line except for {thisword for example'
I have a difficulty doing this. What I've tried and it doesn't work:
# in general case there will be Unicode characters in the pattern
matchobj = re.search(ur'[^\{].+', line)
matchobj = re.search(ur'(?!\{).+', line)
Could you please help me figure out what's wrong and how to do it right?
P.S. I don't think I need to substitute "{thisword" with empty string
I am not exactly clear what you need. From your question title It looks you wants to find "All words in a string e.g 'line' those doesn't starts with {", but you are using re.search() function that confuses me.
re.search() and re.findall()
The function re.search() return a corresponding MatchObject instance, re.serach is usually used to match and return a patter in a long string. It doesn't return all possible matches. See below a simple example:
>>> re.search('a', 'aaa').group(0) # only first match
'a'
>>> re.search('a', 'aaa').group(1) # there is no second matched
Traceback (most recent call last):
File "<console>", line 1, in <module>
IndexError: no such group
With regex 'a' search returns only one patters 'a' in string 'aaa', it doesn't returns all possible matches.
If your objective to find – "all words in a string those doesn't starts with {". You should use re.findall() function:- that matches all occurrences of a pattern, not just the first one as re.search() does. See example:
>>> re.findall('a', 'aaa')
['a', 'a', 'a']
Edit: On the basis of comment adding one more example to demonstrate use of re.search and re.findall:
>>> re.search('a+', 'not itnot baaal laaaaaaall ').group()
'aaa' # returns ^^^ ^^^^^ doesn't
>>> re.findall('a+', 'not itnot baaal laaaaaaall ')
['aaa', 'aaaaaaa'] # ^^^ ^^^^^^^ match both
Here is a good tutorial for Python re module: re – Regular Expressions
Additionally, there is concept of group in Python-regex – "a matching pattern within parenthesis". If more than one groups are present in your regex patter then re.findall() return a list of groups; this will be a list of tuples if the pattern has more than one group. see below:
>>> re.findall('(a(b))', 'abab') # 2 groups according to 2 pair of ( )
[('ab', 'b'), ('ab', 'b')] # list of tuples of groups captured
In Python regex (a(b)) contains two groups; as two pairs of parenthesis (this is unlike regular expression in formal languages – regex are not exactly same as regular
expression in formal languages but that is different matter).
Answer: The words in sentence line are separated by spaces (other either at starts of string) regex should be:
ur"(^|\s)(\w+)
Regex description:
(^|\s+) means: either word at start or start after some spaces.
\w*: Matches an alphanumeric character, including "_".
On applying regex r to your line:
>>> import pprint # for pretty-print, you can ignore thesis two lines
>>> pp = pprint.PrettyPrinter(indent=4)
>>> r = ur"(^|\s)(\w+)"
>>> L = re.findall(r, line)
>>> pp.pprint(L)
[ (u'', u'I'),
(u' ', u'need'),
(u' ', u'to'),
(u' ', u'match'),
(u' ', u'the'),
(u' ', u'whole'),
(u' ', u'line'),
(u' ', u'except'),
(u' ', u'for'), # notice 'for' after 'for'
(u' ', u'for'), # '{thisword' is not included
(u' ', u'example')]
>>>
To find all words in a single line use:
>>> [t[1] for t in re.findall(r, line)]
Note: it will avoid { or any other special char from line because \w only pass alphanumeric and '_' chars.
If you specifically only avoid { if it appears at start of a word (in middle it is allowed) then use regex: r = ur"(^|\s+)(?P<word>[^{]\S*)".
To understand diffidence between this regex and other is check this example:
>>> r = ur"(^|\s+)(?P<word>[^{]\S*)"
>>> [t[1] for t in re.findall(r, "I am {not yes{ what")]
['I', 'am', 'yes{', 'what']
Without Regex:
You could achieve same thing simply without any regex as follows:
>>> [w for w in line.split() if w[0] != '{']
re.sub() to replace pattern
If you wants to just replace one (or more) words starts with { you should use re.sub() to replace patterns start with { by emplty string "" check following code:
>>> r = ur"{\w+"
>>> re.findall(r, line)
[u'{thisword']
>>> re.sub(r, "", line)
u'I need to match the whole line except for for example'
Edit Adding Comment's reply:
The (?P<name>...) is Python's Regex extension: (it has meaning in Python) - (?P<name>...) is similar to regular parentheses - create a group (a named group). The group is accessible via the symbolic group name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. example-1:
>>> r = "(?P<capture_all_A>A+)"
>>> mo = re.search(r, "aaaAAAAAAbbbaaaaa")
>>> mo.group('capture_all_A')
'AAAAAA'
example-2: suppose you wants to filter name from a name-line that may contain title also e.g mr use regex: name_re = "(?P<title>(mr|ms)\.?)? ?(?P<name>[a-z ]*)"
we can read name in input string using group('name'):
>>> re.search(name_re, "mr grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "ms. xyz").group('name')
'xyz'
You can simply do:
(?<!{)(\b\w+\b) with the g flag enabled (all matches)
Demo: http://regex101.com/r/zA0sL6
Try this pattern:
(.*)(?:\{\w+)\s(.*)
Code:
import re
p = re.compile(r'(.*)(?:\{\w+)\s(.*)')
str = "I need to match the whole line except for {thisword for example"
p.match(str)
Example:
http://regex101.com/r/wR8eP6

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说:“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说:“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3
Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说:“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说:“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。?!;~“]+(?:[。?!;~]|“.*?”)', s)
Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说:“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。?!;~.]*?[?!。.;~])|(?:[^“。?!;~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说:“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。?!;~.]*?)[?!。.;~]|([^“。?!;~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说:“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))
First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))
I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.
May be this will work:
$str = '他说:“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说:“今天天气很好
[1] => 我很开心
[2] => ”
)

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources