How can I split at word boundaries with regexes? - python

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))

import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally

Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Related

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Regex to find words starting with capital letters not at beginning of sentence

I've managed to find the words beginning with capital Letters but can't figure out a regex to filter out the ones starting at the beginning of the sentence.
Each sentence ends with a full stop and a space.
Test_string = This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence.
Desired output = ['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
I'm coding in Python.
Will be glad if someone can help me out with the regex :)
You may use the following expression:
(?<!^)(?<!\. )[A-Z][a-z]+
Regex demo here.
import re
mystr="This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence."
print(re.findall(r'(?<!^)(?<!\. )[A-Z][a-z]+',mystr))
Prints:
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
A very basic option. See here for an explanation.
[^.]\s([A-Z]\w+)
import re
s = 'This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence, And others.'
re.findall(r'[^.]\s([A-Z]\w+)', s)
output
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence', 'And']

Find all the words in string except one word in python with regex

I'm working with regex in python and I'd like to search for all the words in a string except one word.
Code:
import re
string = "The world is too big"
print re.findall("regex", string)
If I want to get all the words except for the word "too" (so the output will be ["The", "world", "is", "big"]), How can I implement this in regex?
You don't even need to use regex for this task, simply use split and filter:
sentence = "The world is too big"
sentence = list(filter(lambda x: x != 'too', sentence.split()))
print(sentence)
Delete 'too' in string, then split string.
re.sub(r'\btoo\b','',string).split()
Out[15]: ['The', 'world', 'is', 'big']

Python reverse iteration to split a word

I need to split my words from my punctuation-
I'm thinking about having the function look at each word and determining if there is a punctuation mark in it by starting at [-1], index of -1, and then splitting the word from the punctuation as soon as it hits a letter and not a punctuation mark...
sent=['I', 'went', 'to', 'the', 'best', 'movie!','Do','you', 'want', 'to', 'see', 'it', 'again!?!']
import string
def revF(List):
for word in List:
for ch in word[::-1]:
if ch is string.punctuation:
#go to next character and check if it's punctuation
newList= #then split the word between the last letter and first puctuation mark
return newList
If all you want is to remove punctuation from a string. Python provides much better ways of doing this -
>>> import string
>>> line
'I went to the best movie! Do you want to see it again!?!'
>>> line.translate(None, string.punctuation)
'I went to the best movie Do you want to see it again'
From your example, I gather that you want to split your string on a space. You can do this quite simply like this:
my_str = "I went to the best movie! Do you want to see it again!?!"
sent = my_str.split(' ')
Use a regexp containing the punctuation you want to split on:
re.split('re.split(r"[!?.]", text)
Demonstration:
>>> import re
>>> re.split(r"[!?.]", 'bra det. där du! som du gjorde?')
['bra det', ' d\xc3\xa4r du', ' som du gjorde', '']

python regex find all words in text

This sounds very simple, I know, but for some reason I can't get all the results I need
Word in this case is any char but white-space that is separetaed with white-space
for example in the following string: "Hello there stackoverflow."
the result should be: ['Hello','there','stackoverflow.']
My code:
import re
word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result
but after using this pattern on a string like I've shown it only puts the first and the last words in the list and not the words separeted with two spaces
What is the problem with this pattern?
Use the \b boundary test instead:
r'\b\S+\b'
Result:
>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']
or not use a regular expression at all and just use .split(); the latter would include the punctiation in a sentence (the regex above did not match the . in the sentence).
to find all words in a string best use split
>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']
but if you must use regular expressions, then you should change your regex to something simpler and faster: r'\b\S+\b'.
r turns the string to a 'raw' string. meaning it will not escape your characters.
\b means a boundary, which is a space, newline, or punctuation.
\S you should know, is any non-whitespace character.
+ means one or more of the previous.
so together it means find all visible sets of characters (words/numbers).
How about simply using -
>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']
The other answers are good. Depending on what you want (eg. include/exclude punctuation or other non-word characters) an alternative could be to use a regex to split by one or more whitespace characters:
re.split(r'\s+', 'Hello there StackOverflow.')
['Hello', 'There', 'StackOverflow.']

Categories

Resources