how to split string on word and punctuation [duplicate] - python

This question already has answers here:
Splitting a string into words and punctuation
(11 answers)
Closed 4 years ago.
I am relatively new to Python, is there a way I can split the string "James kicked Bob's ball, laughed and ran away." into the following, so I have the words and punctuation in list items ["James", "kicked", "Bob's", "ball", ",", "laughed", "and", "ran", "away", "."]. is there a way to do this in python?

You can try this:
import re
str = "James kicked Bob's ball, laughed and ran away."
x = re.findall(r"[\w']+|[.,!?;]", str)
print(x)
Output:
['James', 'kicked', "Bob's", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']

It seems you are trying to tokenize a sentence.
Some tokenizer already exists and perform well.
For example, you can use spacy.
Once install, you will need to download the model of your language:
python -m spacy download en
Then you will be able to use it in your script:
import spacy
nlp = spacy.load('en')
tokens = list(nlp("James kicked Bob's ball, laughed and ran away."))
Output:
['James', 'kicked', 'Bob', "'s", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']
By using a tokenizer, it will take care of some corner cases. For example, the sentence 'I tried but it failed...' will be tokenized as ['I', 'tried', 'but', 'it', 'failed', '...']. Here the dots at the end are grouped together as only one token. In the same way, "don't" is tokenize as ['do', "n't"] instead of the basic ['don', "'t"]

Related

Sentences into words in Python [duplicate]

This question already has answers here:
Print list without brackets in a single row
(14 answers)
Closed 5 months ago.
I'm suppose to store all words in a long sentence in a binary tree stored in a txt.file
Ex english.txt: A blurb or a tag is a statement about a book,
record or video, supplied by the publisher or
distributor, like "The best-selling novel" or
"Greatest hit tunes" or even "Perverse sex".
How do I story every single word of the sentence in the tree?
I have tried:
from bintreeFile import Bintree
english = Bintree()
with open("english.txt", "r", encoding = "utf-8") as english_file:
for rad in english_file:
words = rad.strip().split(" ")
engelska.put(words)
engelska.write()
This ends up printing ex:
['A', 'blurb', 'or', 'a', 'tag', 'is', 'a', 'statement', 'about', 'a', 'book,']
['record', 'or', 'video,', 'supplied', 'by', 'the', 'publisher', 'or']
How can i fix this so it only prints the words?
A
blurb
or
tag
...etc
engelska is probably a list or an Array of some sort. Arrays/Lists have iterable methods, so you need to do something like
for x in engelska:
print(x)

How to separate beginning and ending punctuation from words using python? [duplicate]

This question already has an answer here:
Python Tokenization
(1 answer)
Closed 4 years ago.
I have a list of words with possible punctuation at their beginning and the end. I need to separate the punctuation using regex as follows:
sample_input = ["I", "!Go", "I'm", "call.", "exit?!"]
sample_output = ["I", "!", "Go", "I'm", "call", ".", "exit", "?", "!"]
The original string look like that:
string ="It's a mountainous wonderland decorated with ancient glaciers, breathtaking national parks and sumptuous vineyards, but behind its glossy image New Zealand is failing many of its children."
Does anybody have an idea, how to solve this problem?
Thank you.
You can tokenize the each list item first by:
import re
words = ["I", "!Go", "I'm", "call.", "exit?!"]
newwords = []
for i in words:
newwords.append(re.findall(r"[\w']+|[\W]", i))
print newwords
>>>[['I'], ['!', 'Go'], ["I'm"], ['call', '.'], ['exit', '?', '!']]
then getting the result by:
result= [item for sublist in newwords for item in sublist]
print result
>>>['I', '!', 'Go', "I'm", 'call', '.', 'exit', '?', '!']
You need to break each string w.r.t either \w' or with \W group to get the final list as per your desired output.
You can use this approach to write as per your code requirement.

Getting words out of a numpy array of sentence strings

I have a numpy array of sentences (strings)
arr = np.array(['It's the most wonderful time of the year.',
'With the kids jingle belling.',
'And everyone telling you be of good cheer.',
'It's the hap-happiest season of all.'])
(that I read from a csv file). I need to make a numpy array with all the unique words in these sentences.
So what I need is
array(["It's", "the", "most", "wonderful", "time", "of" "year", "With", "the", "kids", "jingle", "belling" "and", "everyone", "telling", "you", "be", "good", "cheer", "It's", "hap-happiest", "season", "all"])
I could do this like
o = []
for x in arr:
o += x.split()
words = np.array(o)
unique_words = np.array(list(set(words.tolist())))
but as this involves first making lists and then converting that to numpy array, it's obviously gonna be slow and inefficient for large data.
I also tried nltk as in
words = np.array([])
for x in arr:
words = np.append(words, nltk.word_tokenize(x))
but with this too seems inefficient as a new array is created on each iteration instead of the old one being modified.
I suppose there's some elegant way of achieving what I want using more of numpy.
Can you point me in the right direction?
I think you can try something like this:
vocab = set()
for x in arr:
vocab.update(nltk.word_tokenize(x))
set.update() takes an iterable to add elements to existing set.
Update:
Also, you can look at the working of CountVectorizer in scikit-learn which:
converts a collection of text documents to a matrix of token counts.
And it uses a dictionary to keep track of the unique words:
# raw_documents is an iterable of sentences.
for doc in raw_documents:
feature_counter = {}
# analyze will split the sentences into tokens
# and apply some preprocessing on them (like stemming, lemma etc)
for feature in analyze(doc):
try:
# vocabulary is a dictionary containing the words and their counts
feature_idx = vocabulary[feature]
...
...
And I think it works pretty efficiently. So I think you can also use a dict() instead of set. I am not familiar with working of NLTK, but I think that must also contain something equivalent to CountVectorizer.
I'm not sure numpy is the best way to go here. You can achieve what you want with nested lists and sets or dictionaries.
One useful thing to know is that the tokenizer methods from nltk can process a list of sentences, and will return a list of tokenized sentences. For example:
from nltk.tokenize import WordPunktTokenizer
wpt = WordPunktTokenizer()
tokenized = wpt.tokenize_sents(arr)
This will return a list of lists of the tokenized sentences in arr, i.e.:
[['It', "'", 's', 'the', 'most', 'wonderful', 'time', 'of', 'the', 'year', '.'],
['With', 'the', 'kids', 'jingle', 'belling', '.'],
['And', 'everyone', 'telling', 'you', 'be', 'of', 'good', 'cheer', '.'],
['It', "'", 's', 'the', 'hap', '-', 'happiest', 'season', 'of', 'all', '.']]
nltk comes with lots of different tokenizers, and so will give you options for how best to split the sentences into word tokens. You can then use something like the following to get the unique set of words / tokens:
unique_words = set()
for toks in tokenized:
unique_words.update(toks)

tokenize array consisting of strings

I have an array called allchats consisting of long strings. some of the places in the index look like the following:
allchats[5,0] = "Hi, have you ever seen something like that? no?"
allchats[106,0] = "some word blabla some more words yes"
allchats[410,0] = "I don't know how we will ever get through this..."
I wish to tokenize each string in the array. Furthermore I wish to use a regex tool to eliminate questionsmarks, commas etc.
I have tried the following:
import nltk
from nltk.tokenize import RegexTokenizer
tknzr = RegexTokenizer('\w+')
allchats1 = [[tknzr.tokenize(chat) for chat in str] for str in allchats]
I wish to end up with:
allchats[5,0] = ['Hi', 'have', 'you', 'ever', 'seen', 'something', 'like', 'that', 'no']
allchats[106,0] = '[some', 'word', 'blabla', 'some', 'more', 'words', 'yes']
allchats[410,0] = ['I', 'dont', 'know', 'how', 'we', 'will', 'ever', 'get', 'through', 'this']
I am quite sure that I am doing something wrong with the strings (str) in the for loop, but cannot figure out what I need to correct in order to succeed.
Thank you in advance for you help!
You have a typo error on your list comprehension, it doesn't take nested lists, but chained lists:
allchats1 = [tknzr.tokenize(chat) for str in allchats for chat in str]
If you want to iterate over words instead of just characters, you are looking for str.split() method. So here is a fully working exmple:
allchats = ["Hi, have you ever seen something like that? no?", "some word blabla some more words yes", "I don't know how we will ever get through this..."]
def tokenize(word):
# use real logic here
return word + 'tokenized'
tokenized = [tokenize(word) for sentence in allchats for word in sentence.split()]
print(tokenized)
If you're not sure you have only strings in your list and want to go over only strings, you can check this with isinstance method (example here):
tokenized = [tokenize(word) for sentence in allchats if isinstance(sentence, str) for word in sentence.split()]

Splitting words using nltk module in Python

I am trying to find a way for splitting words in Python using the nltk module. I am unsure how to reach my goal given the raw data I have which is a list of tokenized words e.g.
['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
As you can see many words are stuck together (i.e. 'to' and 'produce' are stuck in one string 'toproduce'). This is an artifact of scraping data from a PDF file and I would like to find a way using the nltk module in python to split the stuck-together words (i.e. split 'toproduce' into two words: 'to' and 'produce'; split 'standardoperatingprocedures' into three words: 'standard', 'operating', 'procedures').
I appreciate any help!
I believe you will want to use word segmentation in this case, and I am not aware of any word segmentation features in the NLTK that will deal with English sentences without spaces. You could use pyenchant instead. I offer the following code only by way of example. (It would work for a modest number of relatively short strings--such as the strings in your example list--but would be highly inefficient for longer strings or more numerous strings.) It would need modification, and it will not successfully segment every string in any case.
import enchant # pip install pyenchant
eng_dict = enchant.Dict("en_US")
def segment_str(chars, exclude=None):
"""
Segment a string of chars using the pyenchant vocabulary.
Keeps longest possible words that account for all characters,
and returns list of segmented words.
:param chars: (str) The character string to segment.
:param exclude: (set) A set of string to exclude from consideration.
(These have been found previously to lead to dead ends.)
If an excluded word occurs later in the string, this
function will fail.
"""
words = []
if not chars.isalpha(): # don't check punctuation etc.; needs more work
return [chars]
if not exclude:
exclude = set()
working_chars = chars
while working_chars:
# iterate through segments of the chars starting with the longest segment possible
for i in range(len(working_chars), 1, -1):
segment = working_chars[:i]
if eng_dict.check(segment) and segment not in exclude:
words.append(segment)
working_chars = working_chars[i:]
break
else: # no matching segments were found
if words:
exclude.add(words[-1])
return segment_str(chars, exclude=exclude)
# let the user know a word was missing from the dictionary,
# but keep the word
print('"{chars}" not in dictionary (so just keeping as one segment)!'
.format(chars=chars))
return [chars]
# return a list of words based on the segmentation
return words
As you can see, this approach (presumably) mis-segments only one of your strings:
>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]
You can then use chain to flatten this list of lists:
>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']
You can easily install the following library and use it for your purpose:
pip install wordsegment
import wordsegment
help(wordsegment)
from wordsegment import load, segment
load()
segment('usingvariousmolecularbiology')
The output will be like this:
Out[4]: ['using', 'various', 'molecular', 'biology']
Please refer to http://www.grantjenks.com/docs/wordsegment/ for more details.

Categories

Resources