Regex findall produces strange results in python3

Regex findall produces strange results in python3 - python

I want to find all the docblocks of a string using python.
My first attempt was this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.search(string)
print(match.group(0))
And that worked, but as you'll notice yourself: it'll only print out 1 docblock, not all of them.
So I wanted to use the findall function, which says it would output all the matches, like this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.findall(string)
print(match)
But I never get anything useful, only these kinds of arrays:
[' ', ' ', ' ', '\t', ' ', ' ', ' ', ' ', ' ', '\t', ' ', ' ', ' ']
The documentation does say it'll return empty strings, but I don't know how this can be useful.

You need to move the quatifier inside the capture group:
b = re.compile('\/\*(.*?)\*/', re.M|re.S)

To expand a bit on Rohit Jain's (correct) answer, with the qualifier outside the parentheses you're saying "match (non-greedily) any number of the one character inside the parens, and capture that one character". In other words, it would match " " or "aaaaaa", but in "abcde" it would only match the "a". (And since it's non-greedy, even in "aaaaaa" it would only match a single "a"). By moving the qualifier inside the parens (that is, (.*?) instead of what you had before) you're now saying "match any number of characters, and capture all of them".
I hope this helps you understand what's going on a bit better.

Related

Why I need an extra space in the regex pattern to make it work properly?

When I write following code:
m = re.findall('\sf.*?\s','a f fast and friendly dog');
I get output: [' f ', ' friendly ']
But when I provide extra space between f & fast, I get following output which I expected from the previous one.
Code is as follows
m = re.findall('\sf.*?\s','a f fast and friendly dog');
Output:
[' f ', ' fast ', ' friendly ']
Can anyone tell me why I am not getting later output in first case (without inserting extra space between f & fast)?

Because your pattern ends in \s. Regex matches are non-overlapping, so the first match ' f ' matches the trailing space, making the rest of the string begin with 'fast' instead of ' fast'. 'fast' does not match a pattern starting with \s

The space is consumed by ' f ' after it is matched. Now the next search starts from 'fast and friendly dog'. But now fast does not have a leading space and thus does not match.
If you want the space not be consumed then try a positive lookbehind search.

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.
Suppose I have a string, in this case scraped from a website using Scrapy, like this:
['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text', '
']
The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \r or \t characters, if there are any)?
The result I want (after I join the individual strings) would then be:
['\n\n\nSome text and some more text\nand on another line some more text']
No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.

In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.
Instead, use regex to remove 2 or more spaces from your strings:
l= ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text']
import re
result = "".join([re.sub(" +","",x) for x in l])
print(repr(result))
prints:
'\n\n\nSome text and some more text\n and on another a line some more text'
EDIT: if we apply the regex to each line, we cannot detect \n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):
l= ['\n \n ',
'\n ',
'Some text',
' and some more text \n',
'\n and on another a line some more text ']
import re
result = re.sub("(^ |(?<=\n) | +| (?=\n)| $)","","".join(l))
print(repr(result))
prints:
'\n\n\nSome text and some more text\n\nand on another a line some more text'
There are 5 cases in the regex now that will be removed:
start by one space
space following a newline
2 or more spaces
space followed by a newline
end by one space
Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):
result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))
just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.
Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:
result = "\n".join([re.sub(" +"," ",x.strip(" ")) for x in "".join(l).split("\n")])

You can also do the whole thing in terms of built in string operations if you like.
l = ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text',
' ']
def remove_duplicate_spaces(l):
words = [w for w in l.split(' ') if w != '']
return ' '.join(words)
lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)
print(repr(u))
gives
'\n\n\nSome text and some more text\nand on another a line some more text'
You can also collapse the whole thing into a one-liner:
s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])
# OR
t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))

Splitting longer patterns using regex without losing characters Python 3+

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?

As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

Match parenthesis surrounded by spaces in python with regex

Why doesn't the following code block match the parantheses?
In [27]: import re
In [28]: re.match('.*?([\(]*)', ' (((( ' ).groups()
Out[28]: ('',)

Demonstrating my comment:
import re
>>> re.match('.*?([\(]*)', ' (((( ' ).groups()
('',)
>>> re.match('.*?([\(]+)', ' (((( ' ).groups()
('((((',)
>>>
Note - you don't even need the backslash inside the [] - since special characters lose their meaning. So
>>> re.match('.*?([(]+)', ' (((( ' ).groups()
('((((',)
>>>
works too...
This is because your "non greedy" first quantifier (*?) doesn't need to give anything to the second quantifier - since the second quantifier is happy with zero matches.

In your case .*? means everything because you used [\(]* which means 0 or more. So changing * into + will work for you as + means 1 or more.
re.match('.*?([\(]+)', ' (((( ' ).groups()

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说：“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说：“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3

Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说：“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说：“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。？！；~“]+(?:[。？！；~]|“.*?”)', s)

Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说：“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。？！；~.]*?[?!。.；~])|(?:[^“。？！；~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说：“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。？！；~.]*?)[?!。.；~]|([^“。？！；~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说：“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))

First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))

I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.

May be this will work:
$str = '他说：“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说：“今天天气很好
[1] => 我很开心
[2] => ”
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex findall produces strange results in python3 - python

You need to move the quatifier inside the capture group: b = re.compile('\/\(.?)\*/', re.M|re.S)

Related

Why I need an extra space in the regex pattern to make it work properly?

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

Splitting longer patterns using regex without losing characters Python 3+

Match parenthesis surrounded by spaces in python with regex

How to explode sentences with "。" but ignore the "。" in the double quotation marks

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex findall produces strange results in python3 - python

You need to move the quatifier inside the capture group: b = re.compile('\/\*(.*?)\*/', re.M|re.S)

Related

Why I need an extra space in the regex pattern to make it work properly?

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

Splitting longer patterns using regex without losing characters Python 3+

Match parenthesis surrounded by spaces in python with regex

How to explode sentences with "。" but ignore the "。" in the double quotation marks

Categories

Resources

You need to move the quatifier inside the capture group: b = re.compile('\/\(.?)\*/', re.M|re.S)