Understanding a regular expression output to get strings between matched strings

Understanding a regular expression output to get strings between matched strings - python

I have the following string:
'Commenter:\n\sabc\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\nCommenter'.
Initially, I was trying to extract all comments with this regexp re.findall ( '<!--(.*?)-->', string, re.DOTALL) which gave me the proper output [' one ', ' two -- -- ', ' three '].
Then, I tried to get comments made by a particular user "abc" with the following regexp: re.findall ( 'Commenter.*abc.*<!--(.*?)-->.*Commenter', string, re.DOTALL) but I get only [' three '].
I am having trouble understanding the output. Can anyone please help?

There may be a smarter way to do this, but I'd use two regexes, like this:
>>> s='Commenter:\n\sabc\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\nCommenter'
>>> t=re.search(r'Commenter:.*?abc[^<]*?(.*?)Commenter', s, re.DOTALL).group(1);t
'\n<!-- one -->\ntext1<!-- two -- -- -->\ntext2<!-- three -->text3\n'
>>> re.findall(r'<!([^>]*)>([^<]*)', t)
[('-- one --', '\ntext1'), ('-- two -- -- --', '\ntext2'), ('-- three --', 'text3\n')]

You just need to make the first and second .* in your regex to do a shortest possible match. This could be done by adding a reluctant quantifier ? after to *
>>> re.findall ( 'Commenter.*?abc.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' one ']
>>> re.findall ( 'Commenter.*?text1.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' two -- -- ']
>>> re.findall ( 'Commenter.*?text2.*?<!--(.*?)-->.*Commenter', s, re.DOTALL)
[' three ']

Related

Not getting required output using findall in python

Earlier ,I could not put the exact question.My apologies.
Below is what I am looking for :
I am reading a string from file as below and there can be multiple such kind of strings in the file.
" VEGETABLE 1
POTATOE_PRODUCE 1.1 1SIMLA(INDIA)
BANANA 1.2 A_BRAZIL(OR INDIA)
CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA"
I want to capture the entire string as output using findall only.
My script:
import re
import string
f=open('log.txt')
contents = f.read()
output=re.findall('(VEGETABLE.*)(\s+\w+\s+.*)+',contents)
print output
Above script is giving output as
[('VEGETABLE 1', '\n CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA')]
But contents in between are missing.

Solution in last snippet in this answer.
>>> import re
>>> str2='d1 talk walk joke'
>>> re.findall('(\d\s+)(\w+\s)+',str2)
[('1 ', 'walk ')]
output is a list with only one occurrence of the given pattern. The tuple in the list contains two strings that matched corresponding two groupings given within () in the pattern
Experiment 1
Removed the last '+' which made pattern to select the first match instead of greedy last match
>>> re.findall('(\d\s+)(\w+\s)',str2)
[('1 ', 'talk ')]
Experiment 2
Added one more group to find the third words followed with one or more spaces. But if the sting has more than 3 words followed by spaces, this will still find only three words.
>>> re.findall('(\d\s+)(\w+\s)(\w+\s)',str2)
[('1 ', 'talk ', 'walk ')] #
Experiment 3
Using '|' to match the pattern multipel times. Note the tuple has disappeared. Also note that the first match is not containing only the number. This may be because \w is superset of \d
>>> re.findall('\d\s+|\w+\s+',str2)
['d1 ', 'talk ', 'walk ']
Final Experiment
>>> re.findall('\d\s+|[a-z]+\s+',str2)
['1 ', 'talk ', 'walk ']
Hope this helps.

Finding longest consecutive sequence of a character

Assume we have a string like 'w q a a a a a e d a a', I would like to find the longest sequence of 'a' with length of at least 2, which is 'a a a a a' in the above example. I tried the following:
re.findall(r'(a a a*)', text)
but it only gives the shortest possible match. Then I tried:
re.findall(r'([^a] a a a* [^a])', text)
but the results for the above example string is empty. How can I do that?

That's because you have space between you a character. You can use a character class which match any combination of a and space with length 5 or more:
>>> re.findall(r'([a ]{5,})', text)
[' a a a a a ']
Also note that you don't need a capture group around the whole of your regex, In that case you can use a none-capture group with a and space (to refuse of matching the short pattern) and since you just want one match you can use re.search():
>>> M = re.search(r'(?:a ){2,}', text)
>>>
>>> M.group(0)
'a a a a a '

extract words before optional parentheses and inside parentheses

I have a bunch of strings that look like the following two sentences:
A couple of words (abbreviation)
A couple of words
I am trying to get python to extract the 'a couple of words' part and the 'abbreviation' part with a single regex, while also allowing strings where no abbreviation is given.
I've come up with this:
re_both = re.compile(r"^(.*)(?:\((.*)\))$")
It works for the first case, but not for the second case:
[in] re_both.findall('a couple of words (abbreviation)')
[out] [('a couple of words ', 'abbreviation')]
[in] re_both.findall('a couple of words')
[out] []
I would like the second case to yield:
[out] [('a couple of words','')]
Can this be done somehow?

Your regex is fine except you have to make the second part optional and make the first part non greedy.:
re_both = re.compile(r"^(.*?)(?:\((.*)\))?$")
# here __^ here __^

You need to make the second part as optional by adding a quantifier ?, and also you need to add the quantifer ? inside the first capturing group just after to .* so that it would do a non-greedy match.
^(.*?)(?:\((.*)\))?$
^ ^
DEMO
If you don't want to capture the space which was just before to the ( by the first capturing group then you could try the below regex,
^(.*?)(?: \((.*)\))?$
DEMO
>>> import re
>>> s = """A couple of words (abbreviation)
... A couple of words"""
>>> m = re.findall(r'^(.*?)(?: \((.*)\))?$', s, re.M)
>>> m
[('A couple of words', 'abbreviation'), ('A couple of words', '')]
>>> m = re.findall(r'^(.*?)(?:\((.*)\))?$', s, re.M)
>>> m
[('A couple of words ', 'abbreviation'), ('A couple of words', '')]

Match any word in string except those preceded by a curly brace in python

I have a string like
line = u'I need to match the whole line except for {thisword for example'
I have a difficulty doing this. What I've tried and it doesn't work:
# in general case there will be Unicode characters in the pattern
matchobj = re.search(ur'[^\{].+', line)
matchobj = re.search(ur'(?!\{).+', line)
Could you please help me figure out what's wrong and how to do it right?
P.S. I don't think I need to substitute "{thisword" with empty string

I am not exactly clear what you need. From your question title It looks you wants to find "All words in a string e.g 'line' those doesn't starts with {", but you are using re.search() function that confuses me.
re.search() and re.findall()
The function re.search() return a corresponding MatchObject instance, re.serach is usually used to match and return a patter in a long string. It doesn't return all possible matches. See below a simple example:
>>> re.search('a', 'aaa').group(0) # only first match
'a'
>>> re.search('a', 'aaa').group(1) # there is no second matched
Traceback (most recent call last):
File "<console>", line 1, in <module>
IndexError: no such group
With regex 'a' search returns only one patters 'a' in string 'aaa', it doesn't returns all possible matches.
If your objective to find – "all words in a string those doesn't starts with {". You should use re.findall() function:- that matches all occurrences of a pattern, not just the first one as re.search() does. See example:
>>> re.findall('a', 'aaa')
['a', 'a', 'a']
Edit: On the basis of comment adding one more example to demonstrate use of re.search and re.findall:
>>> re.search('a+', 'not itnot baaal laaaaaaall ').group()
'aaa' # returns ^^^ ^^^^^ doesn't
>>> re.findall('a+', 'not itnot baaal laaaaaaall ')
['aaa', 'aaaaaaa'] # ^^^ ^^^^^^^ match both
Here is a good tutorial for Python re module: re – Regular Expressions
Additionally, there is concept of group in Python-regex – "a matching pattern within parenthesis". If more than one groups are present in your regex patter then re.findall() return a list of groups; this will be a list of tuples if the pattern has more than one group. see below:
>>> re.findall('(a(b))', 'abab') # 2 groups according to 2 pair of ( )
[('ab', 'b'), ('ab', 'b')] # list of tuples of groups captured
In Python regex (a(b)) contains two groups; as two pairs of parenthesis (this is unlike regular expression in formal languages – regex are not exactly same as regular
expression in formal languages but that is different matter).
Answer: The words in sentence line are separated by spaces (other either at starts of string) regex should be:
ur"(^|\s)(\w+)
Regex description:
(^|\s+) means: either word at start or start after some spaces.
\w*: Matches an alphanumeric character, including "_".
On applying regex r to your line:
>>> import pprint # for pretty-print, you can ignore thesis two lines
>>> pp = pprint.PrettyPrinter(indent=4)
>>> r = ur"(^|\s)(\w+)"
>>> L = re.findall(r, line)
>>> pp.pprint(L)
[ (u'', u'I'),
(u' ', u'need'),
(u' ', u'to'),
(u' ', u'match'),
(u' ', u'the'),
(u' ', u'whole'),
(u' ', u'line'),
(u' ', u'except'),
(u' ', u'for'), # notice 'for' after 'for'
(u' ', u'for'), # '{thisword' is not included
(u' ', u'example')]
>>>
To find all words in a single line use:
>>> [t[1] for t in re.findall(r, line)]
Note: it will avoid { or any other special char from line because \w only pass alphanumeric and '_' chars.
If you specifically only avoid { if it appears at start of a word (in middle it is allowed) then use regex: r = ur"(^|\s+)(?P<word>[^{]\S*)".
To understand diffidence between this regex and other is check this example:
>>> r = ur"(^|\s+)(?P<word>[^{]\S*)"
>>> [t[1] for t in re.findall(r, "I am {not yes{ what")]
['I', 'am', 'yes{', 'what']
Without Regex:
You could achieve same thing simply without any regex as follows:
>>> [w for w in line.split() if w[0] != '{']
re.sub() to replace pattern
If you wants to just replace one (or more) words starts with { you should use re.sub() to replace patterns start with { by emplty string "" check following code:
>>> r = ur"{\w+"
>>> re.findall(r, line)
[u'{thisword']
>>> re.sub(r, "", line)
u'I need to match the whole line except for for example'
Edit Adding Comment's reply:
The (?P<name>...) is Python's Regex extension: (it has meaning in Python) - (?P<name>...) is similar to regular parentheses - create a group (a named group). The group is accessible via the symbolic group name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. example-1:
>>> r = "(?P<capture_all_A>A+)"
>>> mo = re.search(r, "aaaAAAAAAbbbaaaaa")
>>> mo.group('capture_all_A')
'AAAAAA'
example-2: suppose you wants to filter name from a name-line that may contain title also e.g mr use regex: name_re = "(?P<title>(mr|ms)\.?)? ?(?P<name>[a-z ]*)"
we can read name in input string using group('name'):
>>> re.search(name_re, "mr grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "ms. xyz").group('name')
'xyz'

You can simply do:
(?<!{)(\b\w+\b) with the g flag enabled (all matches)
Demo: http://regex101.com/r/zA0sL6

Try this pattern:
(.*)(?:\{\w+)\s(.*)
Code:
import re
p = re.compile(r'(.*)(?:\{\w+)\s(.*)')
str = "I need to match the whole line except for {thisword for example"
p.match(str)
Example:
http://regex101.com/r/wR8eP6

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说：“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说：“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3

Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说：“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说：“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。？！；~“]+(?:[。？！；~]|“.*?”)', s)

Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说：“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。？！；~.]*?[?!。.；~])|(?:[^“。？！；~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说：“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。？！；~.]*?)[?!。.；~]|([^“。？！；~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说：“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))

First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))

I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.

May be this will work:
$str = '他说：“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说：“今天天气很好
[1] => 我很开心
[2] => ”
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding a regular expression output to get strings between matched strings - python

Related

Not getting required output using findall in python

Finding longest consecutive sequence of a character

extract words before optional parentheses and inside parentheses

Match any word in string except those preceded by a curly brace in python

How to explode sentences with "。" but ignore the "。" in the double quotation marks

Categories

Resources