Python reverse iteration to split a word - python

I need to split my words from my punctuation-
I'm thinking about having the function look at each word and determining if there is a punctuation mark in it by starting at [-1], index of -1, and then splitting the word from the punctuation as soon as it hits a letter and not a punctuation mark...
sent=['I', 'went', 'to', 'the', 'best', 'movie!','Do','you', 'want', 'to', 'see', 'it', 'again!?!']
import string
def revF(List):
for word in List:
for ch in word[::-1]:
if ch is string.punctuation:
#go to next character and check if it's punctuation
newList= #then split the word between the last letter and first puctuation mark
return newList

If all you want is to remove punctuation from a string. Python provides much better ways of doing this -
>>> import string
>>> line
'I went to the best movie! Do you want to see it again!?!'
>>> line.translate(None, string.punctuation)
'I went to the best movie Do you want to see it again'

From your example, I gather that you want to split your string on a space. You can do this quite simply like this:
my_str = "I went to the best movie! Do you want to see it again!?!"
sent = my_str.split(' ')

Use a regexp containing the punctuation you want to split on:
re.split('re.split(r"[!?.]", text)
Demonstration:
>>> import re
>>> re.split(r"[!?.]", 'bra det. där du! som du gjorde?')
['bra det', ' d\xc3\xa4r du', ' som du gjorde', '']

Related

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Regex to find words starting with capital letters not at beginning of sentence

I've managed to find the words beginning with capital Letters but can't figure out a regex to filter out the ones starting at the beginning of the sentence.
Each sentence ends with a full stop and a space.
Test_string = This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence.
Desired output = ['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
I'm coding in Python.
Will be glad if someone can help me out with the regex :)
You may use the following expression:
(?<!^)(?<!\. )[A-Z][a-z]+
Regex demo here.
import re
mystr="This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence."
print(re.findall(r'(?<!^)(?<!\. )[A-Z][a-z]+',mystr))
Prints:
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
A very basic option. See here for an explanation.
[^.]\s([A-Z]\w+)
import re
s = 'This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence, And others.'
re.findall(r'[^.]\s([A-Z]\w+)', s)
output
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence', 'And']

How can I split at word boundaries with regexes?

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Creating a function that returns all capitalized words in a sentence (commas excluded)

I need to create a function that will return all capitalized words from a sentence into a list. If the word ends with a comma, you need to exclude it (the comma). This is what I came up with:
def find_cap(sentence):
s = []
for word in sentence.split():
if word.startswith(word.capitalize()):
s.append(word)
if word.endswith(","):
word.replace(",", "")
return s
My problem: The function seems to work, but if I have a sentence and a word is in quotes, it returns the word in quotes even if it isn't capitalized. Also the commas aren't replaced, even though I used word.replace(",", ""). Any tips would be appreciated.
Strings are an immutable type in Python. This means that word.replace(",", "") will not mutate the string word is pointing at; it will return a new string with the commas replaced.
Also, since this is a stripping problem (and commas are not likely to be in the middle of words), why not use string.strip() instead?
Try something like this:
import string
def find_cap(sentence):
s = []
for word in sentence.split():
# strip() removes each character from the front and back of the string
word = word.strip(string.punctuation)
if word.startswith(word.capitalize()):
s.append(word)
return s
Use regular expression to do this:
>>> import re
>>> string = 'This Is a String With a Comma, Capital and small Letters'
>>> newList = re.findall(r'([A-Z][a-z]*)', string)
>>> newList
['This', 'Is', 'String', 'With', 'Comma', 'Capital', 'Letters']
using re.findall:
a= "Hellow how, Are You"
re.findall('[A-Z][a-z]+',a)
['Hellow', 'Are', 'You']

Python Regular Expression help. Combinations

I'm trying to read from a text file and create a list of seed words that begin a sentence and a second list containing all adjacent words excluding the seed words.
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted. How would you keep them as they appear in the file?
Text contained in file:
This doesn't seem to work. Is findall or sub the correct approach? Or neither?
CODE:
my_string = open('sample.txt', 'r').read()
starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string)
print(my_string)
RESULT:
['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']
It is easier with two regex's:
import re
txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""
first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))
rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}
print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted.
The slash-w-plus isn't your friend. It is a short-cut for alphabetic characters, numbers, and underscores. It does not include hyphens or apostrophes.
Use a character range instead. That way you can include apostrophes and exclude numbers and underscores:
r"[A-Za-z\']+" # works better than \w+

Categories

Resources