Python re can't split zero-width anchors? [duplicate] - python

This question already has answers here:
Python regex: splitting on pattern match that is an empty string
(2 answers)
Closed 5 years ago.
import re
s = 'PythonCookbookListOfContents'
# the first line does not work
print re.split('(?<=[a-z])(?=[A-Z])', s )
# second line works well
print re.sub('(?<=[a-z])(?=[A-Z])', ' ', s)
# it should be ['Python', 'Cookbook', 'List', 'Of', 'Contents']
How to split a string from the border of a lower case character and an upper case character using Python re?
Why does the first line fail to work while the second line works well?

According to re.split:
Note that split will never split a string on an empty pattern match.
For example:
>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']
How about using re.findall instead? (Instead of focusing on separators, focus on the item you want to get.)
>>> import re
>>> s = 'PythonCookbookListOfContents'
>>> re.findall('[A-Z][a-z]+', s)
['Python', 'Cookbook', 'List', 'Of', 'Contents']
UPDATE
Using regex module (Alternative regular expression module, to replace re), you can split on zero-width match:
>>> import regex
>>> s = 'PythonCookbookListOfContents'
>>> regex.split('(?<=[a-z])(?=[A-Z])', s, flags=regex.VERSION1)
['Python', 'Cookbook', 'List', 'Of', 'Contents']
NOTE: Specify regex.VERSION1 flag to enable split-on-zero-length-match behavior.

Related

How can I split a string based on the title of the words? [duplicate]

This question already has answers here:
Split a string at uppercase letters
(22 answers)
Closed 11 months ago.
For example I have followed string:
names = "JohnDoeEmmyGooseRichardNickson"
How can I split that string based on the title of the words?
So every time a capital letter occurs, the string will be split up.
Is there a way to do this with the split() method? (no regex)
That I will get:
namesL = ["John","Doe","Emmy","Goose","Richard","Nickson"]
You can do it with regex
>>> import re
>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
# output
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
Can't do it with the split method, but doable with re:
import re
namesL = re.split("(?=[A-Z])", names)[1:]
Keep in mind the first entry will be an empty string (as the first word is also capitalized) so we're removing it.

Using split on multiple strings with different delimiters [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 8 years ago.
I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.
I have a string that needs to be split by either a ';' or ', '
That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched
Example string:
"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"
should be split into a list containing the following:
('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]')
Luckily, Python has this built-in :)
import re
re.split('; |, ', string_to_split)
Update:Following your comment:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
Do a str.replace('; ', ', ') and then a str.split(', ')
Here's a safe way for any iterable of delimiters, using regular expressions:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]
re.escape allows to build the pattern automatically and have the delimiters escaped nicely.
Here's this solution as a function for your copy-pasting pleasure:
def split(delimiters, string, maxsplit=0):
import re
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string, maxsplit)
If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.
If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]
(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)
In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']
By putting the delimiters in square brackets it seems to work more effectively.
>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']
This is how the regex look like:
import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")
# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")
print pattern.split(text)

Split string based on regexp without consuming characters [duplicate]

This question already has answers here:
Non-consuming regular expression split in Python
(2 answers)
Closed 8 years ago.
I would like to split a string like the following
text="one,two;three.four:"
into the list
textOut=["one", ",two", ";three", ".four", ":"]
I have tried with
import re
textOut = re.split(r'(?=[.:,;])', text)
But this does not split anything.
I would use re.findall here instead of re.split:
>>> from re import findall
>>> text = "one,two;three.four:"
>>> findall("(?:^|\W)\w*", text)
['one', ',two', ';three', '.four', ':']
>>>
Below is a breakdown of the Regex pattern used above:
(?: # The start of a non-capturing group
^|\W # The start of the string or a non-word character (symbol)
) # The end of the non-capturing group
\w* # Zero or more word characters (characters that are not symbols)
For more information, see here.
I don't know what else can occur in your string, but will this do the trick?
>>> s='one,two;three.four:'
>>> [x for x in re.findall(r'[.,;:]?\w*', s) if x]
['one', ',two', ';three', '.four', ':']

Trying to split a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python: Split string with multiple delimiters
I have a small syntax problem. I have a string and another string that has a list of seperators. I need to split it via the .split method.
I can't seem to figure out how, this certainly gives a Type error.
String.split([' ', '{', '='])
How can i split it with multiple seperators?
str.split() only accepts one separator.
Use re.split() to split using a regular expression.
import re
re.split(r"[ {=]", "foo bar=baz{qux")
Output:
['foo', 'bar', 'baz', 'qux']
That's not how the built-in split() method works. It simply uses a single string as the separator, not a list of single-character separators.
You can use regular-expression based splitting, instead. This would probably mean building a regular expression that is the "or" of all your desired delimiters:
splitters = "|".join([" ", "{", "="])
re.split(splitters, my_string)
You can do this with the re (regex) library like so:
import re
result=re.split("[abc]", "my string with characters i want to split")
Where the characters in the square brackets are the characters you want to split with.
Use split from regular expressions instead:
>>> import re
>>> s = 'toto + titi = tata'
>>> re.split('[+=]', s)
['toto ', ' titi ', ' tata']
>>>
import re
string_test = "abc cde{fgh=ijk"
re.split('[\s{=]',string_test)

Split string with multiple delimiters in Python [duplicate]

This question already has answers here:
Split Strings into words with multiple word boundary delimiters
(31 answers)
Closed 8 years ago.
I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.
I have a string that needs to be split by either a ';' or ', '
That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched
Example string:
"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"
should be split into a list containing the following:
('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]')
Luckily, Python has this built-in :)
import re
re.split('; |, ', string_to_split)
Update:Following your comment:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
Do a str.replace('; ', ', ') and then a str.split(', ')
Here's a safe way for any iterable of delimiters, using regular expressions:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]
re.escape allows to build the pattern automatically and have the delimiters escaped nicely.
Here's this solution as a function for your copy-pasting pleasure:
def split(delimiters, string, maxsplit=0):
import re
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string, maxsplit)
If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.
If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:
>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]
(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)
In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:
>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']
>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']
By putting the delimiters in square brackets it seems to work more effectively.
>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']
This is how the regex look like:
import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")
# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")
print pattern.split(text)

Categories

Resources