In Python, how do I split on either a space or a hyphen?
Input:
You think we did this un-thinkingly?
Desired output:
["You", "think", "we", "did", "this", "un", "thinkingly"]
I can get as far as
mystr.split(' ')
But I don't know how to split on hyphens as well as spaces and the Python definition of split only seems to specify a string. Do I need to use a regex?
If your pattern is simple enough for one (or maybe two) replace, use it:
mystr.replace('-', ' ').split(' ')
Otherwise, use RE as suggested by #jamylak.
>>> import re
>>> text = "You think we did this un-thinkingly?"
>>> re.split(r'\s|-', text)
['You', 'think', 'we', 'did', 'this', 'un', 'thinkingly?']
As #larsmans noted, to split by multiple spaces/hyphens (emulating .split() with no arguments) used [...] for readability:
>>> re.split(r'[\s-]+', text)
['You', 'think', 'we', 'did', 'this', 'un', 'thinkingly?']
Without regex (regex is the most straightforward option in this case):
>>> [y for x in text.split() for y in x.split('-')]
['You', 'think', 'we', 'did', 'this', 'un', 'thinkingly?']
Actually the answer by #Elazar without regex is quite straightforward as well (I would still vouch for regex though)
A regex is far easier and better, but if you're staunchly opposed to using one:
import itertools
itertools.chain.from_iterable((i.split(" ") for i in myStr.split("-")))
Related
I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:
'Hello, my name is John. What's your name?'
If I used split() on this sentence then I would get
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.
Does anybody know if there's a more efficient way to do this?
Thank you.
You can do a trick:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer which at least seems to work with the sample data you provided:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.
Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (e.g. a.m., p.m., N.Y., D.I.Y, A.D., B.C., e.g., etc., i.e., Mr., Ms., Dr.). These will be split into separate tokens (e.g. B, ., C, .) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ', $, %, such things as email addresses and URLs, sequences of digits (e.g 5,000.99, 33.3%), hyphenated words (e.g. pre-processing, avant-garde), names that include punctuation (e.g. O'Neill), contractions (e.g. aren't, can't, let's), the English possessive marker ('s) etc. etc., etc.
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:
spaCy (especially geared towards efficiency on large corpora)
NLTK
Stanford CoreNLP
TreeTagger
The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.
You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
TweetTokenizer from nltk can also be used for this..
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
My goal is to achieve this:
Input:
Hi, Are you happy? I am "extremely happy" today
Output:
['Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Is there a straight-forward approach to achieve this? I tried using another pattern I found:
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
I assume this should find the text inside the quote, but did not manage to find a way to nail it.
EDIT
I also tried splitting using the next regex, but this obviously only gives me spaces separation which cuts my text inside quotes to segments:
tokens = [token for token in re.split(r"(\W)", text) if token.strip()]
Is there a way to combine the pattern I supplied with this for loop such that it return an array that each word in a different cell unless it is quoted and then whats inside the quotes gets its own cell?
You could use shlex.split instead of regex
import shlex
print(shlex.split('input: Hi, Are you happy? I am "extremely happy" today'))
result:
['input:', 'Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Another fun way to do it: First split on quotes, then split every non-quoted part (every other):
str = 'I am "super happy" today'
ss = str.split('"')
res = sum(([w] if i%2 else w.split() for i,w in enumerate(ss)), [])
To remove punctuation, you need to replace split() on last line with a proper regexp, but I think you had that covered already.
This will not remove punctuation inside quotes of course, and you cannot nest quotes. So you cannot be "super "super" happy" :)
Suppose I have this variable, named string.
string = "Hello(There|World!!"
Since I want to split on multiple delimiters, I'm using re.split() to do the job. Unfortunately, this string contains special characters used by the re module. I don't want to use re.escape() because that would escape the exclamation points too. How can I split on re's special characters without using re.escape()?
Use a character class to define the characters you want to split on.
I assume you may want to keep those exclamation marks. If this is the case..
>>> s = "Hello(There|World!!"
>>> re.split(r'[(|]+', s)
['Hello', 'There', 'World!!']
If you want to split on the exclamation marks as well.
>>> s = "Hello(There|World!!"
>>> re.split(r'[(|!]+', s)
['Hello', 'There', 'World', '']
If you want to split on other characters, simply keep adding them to your class.
>>> s = "Hello(There|World!!Hi[There]"
>>> re.split(r'[(|!\[\]]+', s)
['Hello', 'There', 'World', 'Hi', 'There', '']
Then use filter to remove the None elements in the list.
re.split(r"\(|\||!",x)
Output:['Hello', 'There', 'World', '', '']
You can split using multiple delimiters.
I'm trying to read from a text file and create a list of seed words that begin a sentence and a second list containing all adjacent words excluding the seed words.
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted. How would you keep them as they appear in the file?
Text contained in file:
This doesn't seem to work. Is findall or sub the correct approach? Or neither?
CODE:
my_string = open('sample.txt', 'r').read()
starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string)
print(my_string)
RESULT:
['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']
It is easier with two regex's:
import re
txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""
first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))
rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}
print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted.
The slash-w-plus isn't your friend. It is a short-cut for alphabetic characters, numbers, and underscores. It does not include hyphens or apostrophes.
Use a character range instead. That way you can include apostrophes and exclude numbers and underscores:
r"[A-Za-z\']+" # works better than \w+
I have sentences that I want to tokenize, including the punctuations. But I need to handle contractions so that words which are something+not like "can't" is tokenized into "ca" and "n't" where the split is one character before the apostrophe, and the rest of the contraction words split at the apostrophe like "you've" and "It's" turn into "you" "'ve" and "It" and "'s". This is where I'm stuck. Basically roughly equivalent to how to NKTL's TreebankWord Tokenizer behaves:
NLTK Word Tokenization Demo
I've been using one of the solution proposed on here which doesn't handle contractions the way I want it to:
re.findall("'\w+|[\w]+|[^\s\w]", "Hello, I'm a string! Please don't kill me? It's his car.")
and I get this result:
['Hello', ',', 'I', "'m", 'a', 'string', '!', 'Please', 'don', "'t", 'kill', 'me', '?', 'It', "'s", 'his', 'car', '.']
Which handles the apostrophes correctly except in the don't case where it should be "do" and "n't". Anyone know how to fix that?
I can only use the standard library, so the NLTK is not an option in this case.
Regex:
\w+(?=n't)|n't|\w+(?=')|'\w+|\w+
Usage
match_list = re.findall(r"\w+(?=n't)|n't|\w+(?=')|'\w+|\w+","you've it's couldn't don't", re.IGNORECASE | re.DOTALL)
Matches:
['you', "'ve", "it", "'s", 'could', "n't", "do", "n't"]
Try:
r"[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=')|[^\s\w']"
This matches character followed by ' followed by more characters and will match first even if it is able to match the other patterns.
Catch n't and \w+(?=n't) before \w+
r"'\w+|n't|\w+(?=n't)|\w+|[^\s\w]"