I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation.
Here is my re pattern wordsWithPunc = re.split(r'([^-\w]+)',words)
If I have a word like "hello" with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example
"hello,-" will equal "hello",",-" but I want it to be "hello",",","-"
Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"
You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:
import re
words = 'My name is mud!!!'
splitted = re.split(r'([^-\w])', words)
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
This will produce also 'empty' matches between non-word characters (because you're slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:
splitted = [match for match in re.split(r'([^-\w])', words) if match]
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.
Related
I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+ and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+ but I can't figure out a way to combine them. What would be the best way to get what I need here?
EDIT
To clarify, I am trying to split on this regex. So when I attempt to using re.split() with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W) I end up with something like the following:
INPUT:
this is a test: https://www.google.com
OUTPUT:
['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']
What I'm hoping to get is this:
['this', 'is', 'a', 'test', 'https://www.google.com']
This is how I'm splitting:
import re
message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)
You should use a reverse logic, match a URL pattern or any one or more word chars:
import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']
See the Python demo.
Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+ and https*:\/\/[\w\.]+\.[a-zA-Z]*, where [a-zA-Z]* is redundant as it matches any zero or more letters and the next [\w\/\-]+ pattern requires one or more letters, / or - chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.
i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.
I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:
'Hello, my name is John. What's your name?'
If I used split() on this sentence then I would get
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.
Does anybody know if there's a more efficient way to do this?
Thank you.
You can do a trick:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer which at least seems to work with the sample data you provided:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.
Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (e.g. a.m., p.m., N.Y., D.I.Y, A.D., B.C., e.g., etc., i.e., Mr., Ms., Dr.). These will be split into separate tokens (e.g. B, ., C, .) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ', $, %, such things as email addresses and URLs, sequences of digits (e.g 5,000.99, 33.3%), hyphenated words (e.g. pre-processing, avant-garde), names that include punctuation (e.g. O'Neill), contractions (e.g. aren't, can't, let's), the English possessive marker ('s) etc. etc., etc.
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:
spaCy (especially geared towards efficiency on large corpora)
NLTK
Stanford CoreNLP
TreeTagger
The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.
You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
TweetTokenizer from nltk can also be used for this..
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?
As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']
I would like to replace all spaces between strings with '#' except for the space after the end of string.
Example:
input=' hello world '
output = '#hello##world'
I know using rstrip() I can neglect the space at the end of string . I just want to try without using rstrip()
Use regular expressions.
import re
a = ' hello world '
a = re.sub(' +$', '', a)
output = re.sub(' ', '#', a)
but really, this is better:
output = re.sub(' ', '#', a.rstrip())