python: split string without discarding anything - python

I'd like to do something like this:
import re
s = 'This is a test'
re.split('(?<= )', s)
add get back something like this:
['This ', 'is ', 'a ', 'test']
but that doesn't work.
Can anybody suggest a simple way to split a string based on a regular expression (my actual code is more complicated and does require a regex) without discarding any content?

The purpose of re.split() is to define a delimiter to split by. While you will find other answers that can actually make your case work, I sense that you would be happier with something like re.findall()
re.findall(r'(\S+\s*)', s)
gives you
['This ', 'is ', 'a ', 'test']

You can use regex module here.
import regex
s = 'This is a test'
print regex.split('(?<= )', s,flags=regex.VERSION1)
Output:
['This ', 'is ', 'a ', 'test']
or
import re
s = 'This is a test'
print [i for i in re.split(r'(\w+\s+)', s,) if i]
Note: 0 width assertions are not supported in re module for split

Why not just use re.findall?
re.findall(r"(\w+\s*)", s)

Capture the delimiter and then rejoin the delimiter to the previous word:
>>> it = iter(re.split('( )', s)+[''])
>>> [word+delimiter for word, delimiter in zip(it, it)]
['This ', 'is ', 'a ', 'test']

at least on alphabetic characters and one space for split :
[i for i in re.split('(\w+ +)',s) if i] # ['This ', 'is ', 'a ', 'test']

Related

Splitting list elements after many delimiters

I would like to cut the list elements after a chosen delimiters(many at once): '-', ',' and ':'
I have an example list:
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
I'd like to cut the list elements(strings in that case) so that it will return the following output:
splitted_list = ['some text', 'some text', 'some text']
I already tried with split() but it only takes 1 delimiter at a time:
splited_list = [i.split(',', 1)[0] for i in list_1]
I would prefer something which is more understandable for me and where I could decide which delimiter to use. For example, I don't want to cut string after - but after -.
List of delimiters:
: , - , ,
Note that - has space before and after, : only after, just like , .
You may use this regex in re.sub and replace it with an empty string:
\s*[^\w\s].*
This will match 0 or more whitespace followed by a character that is not a whitespace and not a word character and anything afterwards.
import re
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
delims = [',', ':', ' –']
delimre = '(' + '|'.join(delims) + r')\s.*'
splited_list = [re.sub(delimre, '', i) for i in list_1]
print (splited_list)
Output:
['some text', 'some text', 'some text']

Extract all substrings between two markers for a very long string

This is a continuation of the question Extract all substrings between two markers. The answers by #Daweo and #Tim Biegeleisen works for small strings.
But for very large strings regular expressions doesn't seem to work. This could be because of a of a limit on string length as seen below:
>>> import re
>>> teststr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> for i in range(0, 23):
... teststr += teststr # creating a very long string here
...
>>> len(teststr)
603979776
>>> found = re.findall(r"\&marker1\n(.*?)/\n", newstr)
>>> len(found)
46
>>> found
['The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ']
What could I do to resolve this and find all occurrences between the makers start="&maker1" and end="/\n" ? What is the maximum string length that re can handle?
I couldn't get re.findall to work. Now I do use re but to find the location of markers and extract the substrings manually.
locs_start = [match.start() for match in re.finditer("\&marker1", mylongstring)]
locs_end = [match.start() for match in re.finditer("/\n", mylongstring)]
substrings = []
for i in range(0, len(locs_start)):
substrings.append(mylongstring[locs_start[i]:locs_end[i]+1])

Punctuation not detected between words with no space

How can I split sentences, when punctuation is detected (.?!) and occurs between two words without a space?
Example:
>>> splitText = re.split("(?<=[.?!])\s+", "This is an example. Not
working as expected.Because there isn't a space after dot.")
output:
['This is an example.',
"Not working as expected.Because there isn't a space after dot."]
expected:
['This is an example.',
'Not working as expected.',
'Because there isn't a space after dot.']`
splitText = re.split("[.?!]\s*", "This is an example. Not working as expected.Because there isn't a space after dot.")
+ is used for 1 or more of something, * for zero of more.
if you need to keep the . you probably don't want to split, instead you could do:
splitText = re.findall(".*?[.?!]", "This is an example. Not working as expected.Because there isn't a space after dot.")
which gives
['This is an example.',
' Not working as expected.',
"Because there isn't a space after dot."]
you can trim those by playing with the regex (eg '\s*.*?[.?!]') or just using .trim()
Use
https://regex101.com/r/icrJNl/3/.
import re
from pprint import pprint
split_text = re.findall(".*?[?.!]", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Note: .*? is a lazy (or non-greedy) quantifier in opposite to .* which is a greedy quantifier.
Output:
['This is an example!',
' Working as expected?',
'Because.']
Another solution:
import re
from pprint import pprint
split_text = re.split("([?.!])", "This is an example! Working as "
"expected?Because.")
pprint(split_text)
Output:
['This is an example',
'!',
' Working as expected',
'?',
'Because',
'.',
'']

Splitting longer patterns using regex without losing characters Python 3+

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?
As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

python: split string after comma and dots

I have a piece of code which splits a string after commas and dots (but not when a digit is before or after a comma or dot):
text = "This is, a sample text. Some more text. $1,200 test."
print re.split('(?<!\d)[,.]|[,.](?!\d)', text)
The result is:
['This is', ' a sample text', ' Some more text', ' $1,200 test', '']
I don't want to lose the commas and dots. So what I am looking for is:
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']
Besides, if a dot in the end of text it produces an empty string in the end of the list. Furthermore, there are white-spaces at the beginning of the split strings. Is there a better method without using re? How would you do this?
Unfortunately you can't use re.split() on a zero-length match, so unless you can guarantee that there will be whitespace after the comma or dot you will need to use a different approach.
Here is one option that uses re.findall():
>>> text = "This is, a sample text. Some more text. $1,200 test."
>>> print re.findall(r'(?:\d[,.]|[^,.])*(?:[,.]|$)', text)
['This is,', ' a sample text.', ' Some more text.', ' $1,200 test.', '']
This doesn't strip whitespace and you will get an empty match at the end if the string ends with a comma or dot, but those are pretty easy fixes.
If it is a safe assumption that there will be whitespace after every comma and dot you want to split on, then we can just split the string on that whitespace which makes it a little simpler:
>>> print re.split(r'(?<=[,.])(?<!\d.)\s', text)
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']

Categories

Resources