I'm trying to split a string in python using regular expressions. This line works almost perfectly for me:
from string import punctuation
import re
row = re.findall('\w+|[{0}]+'.format(punctuation), string)
However, it doesn't split the string on instances of _ as well. For instance:
>>> string = "Hi my name is _Mark. I like apples!! Do you?!"
>>> row = re.findall('\w+|[{0}]+'.format(punctuation), string)
>>> row
['Hi', 'my', 'name', 'is', '_Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
What i want is:
['Hi', 'my', 'name', 'is', '_', 'Mark', '.', 'I', 'like', 'apples', '!!', 'Do', 'you', '?!']
I've read its because _ is considered a character. Does anyone know how to accomplish this? Thanks for the help.
Since \w will match the underscore, you can more directly specify what you consider a character without too much more work:
re.findall('[a-zA-Z0-9]+|[{0}]+'.format(punctuation), string)
Because the left side of a disjunction will always match first if possible, you can simply include _ with the punctuation characters before you match letters:
row = re.findall(r'[{0}_]+|\w+'.format(string.punctuation), mystring)
But you can do the same without bothering with string.punctuation at all. "Punctuation" is anything that's neither a space nor a word character:
row = re.findall(r"(?:[^\s\w]|_)+|\w+", mystring)
PS. In your code sample, the string named string "shadows" the module string. Don't do that, it's bad practice and leads to bugs.
It is clearly stated in Python docs that \w not only include alphanumerical characters but also the underscore as well:
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
so like Eric pointed out in his solution, better specify a set of only alphanumerical characters [a-zA-Z0-9]
Related
I am attempting to write a regex in python that will match all non-word characters (spaces, slashes, colons, etc.) excluding those that exist in a url. I know I can get all non-word characters with \W+ and I also have a regex to get urls: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+ but I can't figure out a way to combine them. What would be the best way to get what I need here?
EDIT
To clarify, I am trying to split on this regex. So when I attempt to using re.split() with the following regex: https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a-zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W) I end up with something like the following:
INPUT:
this is a test: https://www.google.com
OUTPUT:
['this', ' ', 'is', ' ', 'a', ' ', 'test', ':', '', ' ', '', None, '']
What I'm hoping to get is this:
['this', 'is', 'a', 'test', 'https://www.google.com']
This is how I'm splitting:
import re
message = 'this is a test: https://www.google.com'
re.split("https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+|https*:\/\/[\w\.]+\.[a- zA-Z]*|[\w\.]+\.[a-zA-Z]*\/[\w\/\-]+|(\W)", message)
You should use a reverse logic, match a URL pattern or any one or more word chars:
import re
rx = r"https*://[\w.]+\.[\w/-]*|[\w.]+\.[a-zA-Z]*/[\w/-]+|\w+"
message = 'this is a test: https://www.google.com'
print( re.findall(rx, message) )
# => ['this', 'is', 'a', 'test', 'https://www.google.com']
See the Python demo.
Note I shortened your URL pattern, you had two similar alternatives, https*:\/\/[\w\.]+\.[a-zA-Z]*[\w\/\-]+ and https*:\/\/[\w\.]+\.[a-zA-Z]*, where [a-zA-Z]* is redundant as it matches any zero or more letters and the next [\w\/\-]+ pattern requires one or more letters, / or - chars. You also do not have to escape dots inside character classes and slashes, the unnecessary escapes are removed here.
i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.
I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:
'Hello, my name is John. What's your name?'
If I used split() on this sentence then I would get
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.
Does anybody know if there's a more efficient way to do this?
Thank you.
You can do a trick:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer which at least seems to work with the sample data you provided:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.
Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (e.g. a.m., p.m., N.Y., D.I.Y, A.D., B.C., e.g., etc., i.e., Mr., Ms., Dr.). These will be split into separate tokens (e.g. B, ., C, .) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ', $, %, such things as email addresses and URLs, sequences of digits (e.g 5,000.99, 33.3%), hyphenated words (e.g. pre-processing, avant-garde), names that include punctuation (e.g. O'Neill), contractions (e.g. aren't, can't, let's), the English possessive marker ('s) etc. etc., etc.
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:
spaCy (especially geared towards efficiency on large corpora)
NLTK
Stanford CoreNLP
TreeTagger
The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.
You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
TweetTokenizer from nltk can also be used for this..
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
Input: s = "test1 this is a sample subscript o₁"
I've tried: re.compile(r'\b[^\W\d_]{2,}\b').findall(s)
It finds the word with more than 2 chars and doesn't contain number
'this', 'is', 'sample', 'subscript', 'o₁',
but it still has the subscript number.
Is there a way to remove word that contains subscript in it?
Desire output: 'this', 'is', 'sample', 'subscript'
The point is that the Unicode aware \d in Python 3 regex does not match No Unicode category.
If you need to work with ASCII only letter words, use
r'\b[a-zA-Z]{2,}\b'
Or, make the pattern non-Unicode aware by using re.A / re.ASCII flag:
re.compile(r'\b[^\W\d_]{2,}\b', re.A)
See this Python 3 demo.
If you need to work with any Unicode letters you may fix it by either adding all the No characters to the regex negated character class (which might make it a tedious solution), or add a programmatic check after a match is found to see if the match contains any char from the No category.
See this Python 3 demo:
import re, sys, unicodedata
s = "test1 this is a sample subscript o₁"
No = [chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'No']
print([x for x in re.findall(r'\b[^\W\d_]{2,}\b', s) if not any(y in x for y in No)])
# => ['this', 'is', 'sample', 'subscript']
Make sure you are using the latest Python version to support the latest Unicode standard, or rely on the PyPi regex module:
p = regex.compile(r"\b\p{L}{2,}\b")
print(p.findall(s))
Suppose I have this variable, named string.
string = "Hello(There|World!!"
Since I want to split on multiple delimiters, I'm using re.split() to do the job. Unfortunately, this string contains special characters used by the re module. I don't want to use re.escape() because that would escape the exclamation points too. How can I split on re's special characters without using re.escape()?
Use a character class to define the characters you want to split on.
I assume you may want to keep those exclamation marks. If this is the case..
>>> s = "Hello(There|World!!"
>>> re.split(r'[(|]+', s)
['Hello', 'There', 'World!!']
If you want to split on the exclamation marks as well.
>>> s = "Hello(There|World!!"
>>> re.split(r'[(|!]+', s)
['Hello', 'There', 'World', '']
If you want to split on other characters, simply keep adding them to your class.
>>> s = "Hello(There|World!!Hi[There]"
>>> re.split(r'[(|!\[\]]+', s)
['Hello', 'There', 'World', 'Hi', 'There', '']
Then use filter to remove the None elements in the list.
re.split(r"\(|\||!",x)
Output:['Hello', 'There', 'World', '', '']
You can split using multiple delimiters.