Splitting longer patterns using regex without losing characters Python 3+ - python

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?

As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

Related

Why is Python re not splitting multiple instances of punctuation?

I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation.
Here is my re pattern wordsWithPunc = re.split(r'([^-\w]+)',words)
If I have a word like "hello" with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example
"hello,-" will equal "hello",",-" but I want it to be "hello",",","-"
Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"
You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:
import re
words = 'My name is mud!!!'
splitted = re.split(r'([^-\w])', words)
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
This will produce also 'empty' matches between non-word characters (because you're slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:
splitted = [match for match in re.split(r'([^-\w])', words) if match]
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.

Remove leading/ending and internal multiple spaces but NOT tabs, newlines, or return characters, in Python

The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.
Suppose I have a string, in this case scraped from a website using Scrapy, like this:
['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text', '
']
The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \r or \t characters, if there are any)?
The result I want (after I join the individual strings) would then be:
['\n\n\nSome text and some more text\nand on another line some more text']
No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.
In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.
Instead, use regex to remove 2 or more spaces from your strings:
l= ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text']
import re
result = "".join([re.sub(" +","",x) for x in l])
print(repr(result))
prints:
'\n\n\nSome text and some more text\n and on another a line some more text'
EDIT: if we apply the regex to each line, we cannot detect \n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):
l= ['\n \n ',
'\n ',
'Some text',
' and some more text \n',
'\n and on another a line some more text ']
import re
result = re.sub("(^ |(?<=\n) | +| (?=\n)| $)","","".join(l))
print(repr(result))
prints:
'\n\n\nSome text and some more text\n\nand on another a line some more text'
There are 5 cases in the regex now that will be removed:
start by one space
space following a newline
2 or more spaces
space followed by a newline
end by one space
Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):
result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))
just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.
Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:
result = "\n".join([re.sub(" +"," ",x.strip(" ")) for x in "".join(l).split("\n")])
You can also do the whole thing in terms of built in string operations if you like.
l = ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text',
' ']
def remove_duplicate_spaces(l):
words = [w for w in l.split(' ') if w != '']
return ' '.join(words)
lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)
print(repr(u))
gives
'\n\n\nSome text and some more text\nand on another a line some more text'
You can also collapse the whole thing into a one-liner:
s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])
# OR
t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))

python: split string without discarding anything

I'd like to do something like this:
import re
s = 'This is a test'
re.split('(?<= )', s)
add get back something like this:
['This ', 'is ', 'a ', 'test']
but that doesn't work.
Can anybody suggest a simple way to split a string based on a regular expression (my actual code is more complicated and does require a regex) without discarding any content?
The purpose of re.split() is to define a delimiter to split by. While you will find other answers that can actually make your case work, I sense that you would be happier with something like re.findall()
re.findall(r'(\S+\s*)', s)
gives you
['This ', 'is ', 'a ', 'test']
You can use regex module here.
import regex
s = 'This is a test'
print regex.split('(?<= )', s,flags=regex.VERSION1)
Output:
['This ', 'is ', 'a ', 'test']
or
import re
s = 'This is a test'
print [i for i in re.split(r'(\w+\s+)', s,) if i]
Note: 0 width assertions are not supported in re module for split
Why not just use re.findall?
re.findall(r"(\w+\s*)", s)
Capture the delimiter and then rejoin the delimiter to the previous word:
>>> it = iter(re.split('( )', s)+[''])
>>> [word+delimiter for word, delimiter in zip(it, it)]
['This ', 'is ', 'a ', 'test']
at least on alphabetic characters and one space for split :
[i for i in re.split('(\w+ +)',s) if i] # ['This ', 'is ', 'a ', 'test']

Not getting required output using findall in python

Earlier ,I could not put the exact question.My apologies.
Below is what I am looking for :
I am reading a string from file as below and there can be multiple such kind of strings in the file.
" VEGETABLE 1
POTATOE_PRODUCE 1.1 1SIMLA(INDIA)
BANANA 1.2 A_BRAZIL(OR INDIA)
CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA"
I want to capture the entire string as output using findall only.
My script:
import re
import string
f=open('log.txt')
contents = f.read()
output=re.findall('(VEGETABLE.*)(\s+\w+\s+.*)+',contents)
print output
Above script is giving output as
[('VEGETABLE 1', '\n CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA')]
But contents in between are missing.
Solution in last snippet in this answer.
>>> import re
>>> str2='d1 talk walk joke'
>>> re.findall('(\d\s+)(\w+\s)+',str2)
[('1 ', 'walk ')]
output is a list with only one occurrence of the given pattern. The tuple in the list contains two strings that matched corresponding two groupings given within () in the pattern
Experiment 1
Removed the last '+' which made pattern to select the first match instead of greedy last match
>>> re.findall('(\d\s+)(\w+\s)',str2)
[('1 ', 'talk ')]
Experiment 2
Added one more group to find the third words followed with one or more spaces. But if the sting has more than 3 words followed by spaces, this will still find only three words.
>>> re.findall('(\d\s+)(\w+\s)(\w+\s)',str2)
[('1 ', 'talk ', 'walk ')] #
Experiment 3
Using '|' to match the pattern multipel times. Note the tuple has disappeared. Also note that the first match is not containing only the number. This may be because \w is superset of \d
>>> re.findall('\d\s+|\w+\s+',str2)
['d1 ', 'talk ', 'walk ']
Final Experiment
>>> re.findall('\d\s+|[a-z]+\s+',str2)
['1 ', 'talk ', 'walk ']
Hope this helps.

Regex findall produces strange results in python3

I want to find all the docblocks of a string using python.
My first attempt was this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.search(string)
print(match.group(0))
And that worked, but as you'll notice yourself: it'll only print out 1 docblock, not all of them.
So I wanted to use the findall function, which says it would output all the matches, like this:
b = re.compile('\/\*(.)*?\*/', re.M|re.S)
match = b.findall(string)
print(match)
But I never get anything useful, only these kinds of arrays:
[' ', ' ', ' ', '\t', ' ', ' ', ' ', ' ', ' ', '\t', ' ', ' ', ' ']
The documentation does say it'll return empty strings, but I don't know how this can be useful.
You need to move the quatifier inside the capture group:
b = re.compile('\/\*(.*?)\*/', re.M|re.S)
To expand a bit on Rohit Jain's (correct) answer, with the qualifier outside the parentheses you're saying "match (non-greedily) any number of the one character inside the parens, and capture that one character". In other words, it would match " " or "aaaaaa", but in "abcde" it would only match the "a". (And since it's non-greedy, even in "aaaaaa" it would only match a single "a"). By moving the qualifier inside the parens (that is, (.*?) instead of what you had before) you're now saying "match any number of characters, and capture all of them".
I hope this helps you understand what's going on a bit better.

Categories

Resources