Python split text into tokens using regex

Python split text into tokens using regex - python

Hi I have a question about splitting strings into tokens.
Here is an example string:
string = "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."
and I'm trying to split string correctly into its tokens.
Here is my function count_words
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
and the result of split here
['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a',
'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he',
'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut',
'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left',
'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed',
'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it',
'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong',
'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and',
'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he',
'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling',
'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a',
'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for',
'the', 'more', 'favoured', 'of', 'his', 'guests', '']
as you see there is the empty string '' in the last index of the split list.
Please help me understand this empty string in the list and to correctly split this example string.

You could use a list comprehension to iterate over the list items produced by re.split and only keep them if they are not empty strings:
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
split = [x for x in split if x != ''] # <- list comprehension
print(split)
You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.

That happened because the end of string is . and it is in the split pattern so , when match . the next match will start with an empty and that why you see ''.
I suggest this solution using re.findall instead to work an opposite way like this :
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.findall(r"[a-z\-]+", lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary

Python's wiki explains this behavior:
If there are capturing groups in the separator and it matches at the
start of the string, the result will start with an empty string. The
same holds for the end of the string
Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).
The 2 solution already proposed (more or less) by others are these:
Solution 1: findall
As other users pointed out you can use findall and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+.
But it depends on you regex pattern because it is not always that easy.
Solution 2: check on the starting and ending token
Instead of checking if each token is != '', you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.
split = re.split("[\s\.,!?:;'\"-]+",lowerText)
if split[0] == '':
split = split[1:]
if split[-1] == '':
split = split[:-1]

You have an empty string due to a point is also matching to split at the string ending and anything is downstream. You can, however, filter out empty strings with filter function and thus complete your function:
import re
import collections
def count_words(text):
"""Count how many times each unique word occurs in text."""
lowerText = text.lower()
split = re.split("[ .,!?:;'\"\-]+",lowerText)
## filer out empty strings and count
## words:
return collections.Counter( filter(None, split) )
count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})

import string
def count_words(text):
counts = dict()
text = text.translate(text.maketrans('', '', string.punctuation))
text = text.lower()
words = text.split()
print(words)
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
It works.

Related

Create dictionary with unique elements in list and their collocation

Id need some help with the following:
Id like to create a function that when inserted a string, i would get a dictionary with the unique elements in the list as the key, and as values the texts that are before and after it.
for example with the following string:
Id like to have the following:
Important to note that for example some words a repeated and have different values next to it.
I am trying with the following function:
def ffg(txt):
txt = re.sub(r'[^\w\s]','',txt).lower().split()
words = list(set(txt))
indx = [words.index(i) for i in txt]
for x in range(len(txt)):
res = txt[x]
But as you can see, it doesnt work at all.

I am assuming you past a sequence of words, so split the text into words however you please.
from collections import defaultdict
def word_context(l):
context = defaultdict(set)
for i, w in enumerate(l):
if i + 1 < len(l):
context[w].add(l[i+1])
if i - 1 >= 0:
context[w].add(l[i-1])
return dict(context)
Result:
>>> l
['half', 'a', 'league', 'half', 'a', 'league', 'half', 'a', 'league', 'onward', 'all', 'in', 'the', 'valley', 'of', 'death', 'rode', 'the', 'six', 'hundred']
>>> word_context(l)
{'half': {'a', 'league'}, 'a': {'half', 'league'}, 'league': {'half', 'a', 'onward'}, 'onward': {'all', 'league'}, 'all': {'onward', 'in'}, 'in': {'all', 'the'}, 'the': {'six', 'rode', 'in', 'valley'}, 'valley': {'the', 'of'}, 'of': {'death', 'valley'}, 'death': {'rode', 'of'}, 'rode': {'death', 'the'}, 'six': {'the', 'hundred'}, 'hundred': {'six'}}

Another variation:
import re
def collocate(txt):
txt = re.sub(r'[^\w\s]', '', txt).lower().split()
neighbors={}
for i in range(len(txt)):
if txt[i] not in neighbors:
neighbors[txt[i]]=set()
if i>0:
neighbors[txt[i]].add(txt[i-1])
if i < len(txt) - 1:
neighbors[txt[i]].add(txt[i+1])
return neighbors
print(collocate("Half a league, half a league, Half a league onward, All in the valley of Death Rode the six hundred."))

Determining the most frequent casing of a word in Python

I have a text and a I want to determine the most frequent casing of each word and create a dictionary with it. This is an extract of the text:
PENCIL: A pencil is an object we use to write. Pencil should not be confused by pen, which is a different object. A pencil is usually made from a pigment core inside a protective casing.
For example, a word like "pencil" could appear as "Pencil", "PENCIL" or "pencil" in my text. I would like to create a function that would first determine which of those options is the most frequent one. I have started by classifying all the words into three groups, depending on casing, though I don't know how to determine which case is the most frequent one (I guess I'd have to do a comparison across the three lists, but I don't know how to do that):
list_upper = []
list_lower = []
list_both = []
for word in text:
if isupper(word):
list_upper.append(word)
if islower(word):
list_lower.append(word)
if word == word.title():
list_both.append(word)
Then, it will create a dictionary in which the first key would be the lowercase words and the values would be the most frequent type. For example: pencil, Pencil. I'm not sure how to do this either... This is my desired output:
my_dictionary = {"pencil":"Pencil", "the":"THE"...}

I'm assuming that text is already an iterable of words and that words like 'pEnCiL' cannot occur.
Instead of building these three lists, you can start constructing a dictionary with the counts right away. I suggest using a defaultdict which returns Counter instances when a key is missing.
from collections import defaultdict, Counter
cases = defaultdict(Counter)
for word in text:
cases[word.lower()][word] += 1
For a list text with the content
['pencil', 'pencil', 'PENCIL', 'Pencil', 'Pencil', 'PENCIL', 'rubber', 'PENCIL']
this will produce the following cases dictionary.
defaultdict(collections.Counter,
{'pencil': Counter({'PENCIL': 3, 'Pencil': 2, 'pencil': 2}),
'rubber': Counter({'rubber': 1})})
From here, you can construct the final result as follows.
result = {w:c.most_common(1)[0][0] for w, c in cases.items()}
This will give you
{'pencil': 'PENCIL', 'rubber': 'rubber'}
in this example. If two cases appear equally often, an arbitrary one is picked as the most common.
~edit~
Turns out text is not an iterable of words. Daniel Mesejo's answer has a regular expression which can help you extract the words from a string.

You could use Counter with defaultdict:
import re
from collections import Counter, defaultdict
def words(t):
return re.findall('\w+', t)
text = """PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
Another casing with different Casing"""
table = defaultdict(list)
for word in words(text):
table[word.lower()].append(word)
result = {key: Counter(values).most_common(1)[0][0] for key, values in table.items()}
print(result)
Output
{'casing': 'casing', 'be': 'be', 'core': 'core', 'another': 'Another', 'object': 'object', 'should': 'should', 'from': 'from', 'write': 'write', 'pen': 'pen', 'protective': 'protective', 'a': 'a', 'which': 'which', 'pencil': 'pencil', 'different': 'different', 'not': 'not', 'is': 'is', 'by': 'by', 'inside': 'inside', 'to': 'to', 'confused': 'confused', 'with': 'with', 'pigment': 'pigment', 'we': 'we', 'use': 'use', 'an': 'an', 'made': 'made', 'usually': 'usually'}
First create a dictionary where the keys are the lower case variant of each word and the values are a list of the corresponding occurrences. Then use Counter to count the number of each casing and get the most common. Note the use of regex to extract the words.

You have two great answers already. Just for fun I figured we could try just using the builtins since you have already tokenized the words:
# Create a temp dict within the main dict that counts the occurrences of cases
d= {}
for word in words:
d.setdefault(word.lower(), {}).setdefault(word, 0)
d[word.lower()][word] += 1
# Create a function to convert the temp d back to its most common occurrence
def func(dct):
return sorted(dct.items(), key=lambda x: x[-1])[-1][0]
# Use function and dictionary comprehension to convert the results.
result = {k: func(v) for k, v in d.items()}
Test case:
text = """
PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
PENCIL PENCIL PENCIL Pigment Pigment Pigment Pigment
"""
# Added last line to produce a different result
result
# {'pencil': 'PENCIL',
# 'a': 'a', 'is': 'is',
# 'an': 'an', 'object': 'object',
# 'we': 'we', 'use': 'use', 'to': 'to',
# 'write': 'write', 'should': 'should',
# 'not': 'not', 'be': 'be', 'confused':
# 'confused', 'by': 'by', 'pen': 'pen',
# 'which': 'which', 'different': 'different',
# 'usually': 'usually', 'made': 'made',
# 'from': 'from', 'pigment': 'Pigment',
# 'core': 'core', 'inside': 'inside',
# 'protective': 'protective', 'casing': 'casing'}

Removing punctuation and creating a dictionary Python

I'm trying to create a function that removes punctuation and lowercases every letter in a string. Then, it should return all this in the form of a dictionary that counts the word frequency in the string.
This is the code I wrote so far:
def word_dic(string):
string = string.lower()
new_string = string.split(' ')
result = {}
for key in new_string:
if key in result:
result[key] += 1
else:
result[key] = 1
for c in result:
"".join([ c if not c.isalpha() else "" for c in result])
return result
But this what i'm getting after executing it:
{'am': 3,
'god!': 1,
'god.': 1,
'i': 2,
'i?': 1,
'thanks': 1,
'to': 1,
'who': 2}
I just need to remove he punctuation at the end of the words.

Another option is to use that famous Python's batteries included.
>>> sentence = 'Is this a test? It could be!'
>>> from collections import Counter
>>> Counter(re.sub('\W', ' ', sentence.lower()).split())
Counter({'a': 1, 'be': 1, 'this': 1, 'is': 1, 'it': 1, 'test': 1, 'could': 1})
Leverages collections.Counter for counting words, and re.sub for replacing everything that's not a word character.

"".join([ c if not c.isalpha() else "" for c in result]) creates a new string without the punctuation, but it doesn't do anything with it; it's thrown away immediately, because you never store the result.
Really, the best way to do this is to normalize your keys before counting them in result. For example, you might do:
for key in new_string:
# Keep only the alphabetic parts of each key, and replace key for future use
key = "".join([c for c in key if c.isalpha()])
if key in result:
result[key] += 1
else:
result[key] = 1
Now result never has keys with punctuation (and the counts for "god." and "god!" are summed under the key "god" alone), and there is no need for another pass to strip the punctuation after the fact.
Alternatively, if you only care about leading and trailing punctuation on each word (so "it's" should be preserved as is, not converted to "its"), you can simplify a lot further. Simply import string, then change:
key = "".join([c for c in key if c.isalpha()])
to:
key = key.rstrip(string.punctuation)
This matches what you specifically asked for in your question (remove punctuation at the end of words, but not at the beginning or embedded within the word).

You can use string.punctuation to recognize punctuation and use collections.Counter to count occurence once the string is correctly decomposed.
from collections import Counter
from string import punctuation
line = "It's a test and it's a good ol' one."
Counter(word.strip(punctuation) for word in line.casefold().split())
# Counter({"it's": 2, 'a': 2, 'test': 1, 'and': 1, 'good': 1, 'ol': 1, 'one': 1})
Using str.strip instead of str.replace allows to preserve words such as It's.
The method str.casefold is simply a more general case of str.lower.

Maybe if you want to reuse the words later, you can store them in a sub-dictionary along with its ocurrences number. Each word will have its place in a dictionary. We can create our own function to remove punctuation, pretty simple.
See if the code bellow serves your needs:
def remove_punctuation(word):
for c in word:
if not c.isalpha():
word = word.replace(c, '')
return word
def word_dic(s):
words = s.lower().split(' ')
result = {}
for word in words:
word = remove_punctuation(word)
if not result.get(word, None):
result[word] = {
'word': word,
'ocurrences': 1,
}
continue
result[word]['ocurrences'] += 1
return result
phrase = 'Who am I and who are you? Are we gods? Gods are we? We are what we are!'
print(word_dic(phrase))
and you'll have an output like this:
{
'who': {
'word': 'who',
'ocurrences': 2},
'am': {
'word': 'am',
'ocurrences': 1},
'i': {
'word': 'i',
'ocurrences': 1},
'and': {
'word': 'and',
'ocurrences': 1},
'are': {
'word': 'are',
'ocurrences': 5},
'you': {
'word': 'you',
'ocurrences': 1},
'we': {
'word': 'we',
'ocurrences': 4},
'gods': {
'word': 'gods',
'ocurrences': 2},
'what': {
'word': 'what',
'ocurrences': 1}
}
Then you can easily access each word and its ocurrences simply doing:
word_dict(phrase)['are']['word'] # output: are
word_dict(phrase)['are']['ocurrences'] # output: 5

Find a length of a sub list of a list Python

Im currently writing a program that needs to figure out the lengths of words in a sublist of a list and then add them together. Currently the code i have can either break out the list that is a sentence into individual words and now i need to find the lengths of the individual words in the sublist. Whenever i use counter_split_list, it counts the words in the sentence.
here is my code:
def split_by_whitespace (['It', 'is', 'a', 'truth', 'universally', 'acknowledged'], ['that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife']):
l = my_string
split_list =[i.split() for i in l]
#counter_split_list=[len(i) for i in split_list]
return(split_list)
Words in Sublists

I assume, your input is not a string but a list of lists. These sub-lists contain words as shown in your input example. If that's the case, then the following function would return the length of each word for you as you wanted it.
# Modified version of your function
def split_by_whitespace (my_string):
sentence_all_len = []
for sentence in my_string:
sentence_len = [len(word) for word in sentence]
sentence_all_len.append(sentence_len)
return sentence_all_len
# I assume your input looks like this as you provided in your question, though in a wrong way.
list_of_strings = [['It', 'is', 'a', 'truth', 'universally', 'acknowledged'], ['that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife']]
# Output statement with a call to your function with the list of lists as an input
print(split_by_whitespace(list_of_strings)
Output:
[[2, 2, 1, 5, 11, 12], [4, 1, 6, 3, 2, 10, 2, 1, 4, 7, 4, 2, 2, 4, 2, 1, 4]]

How to sort unique words in order of appearance?

restart = True
while restart == True:
option = input("Would you like to compress or decompress this file?\nIf you would like to compress type c \nIf you would like to decompress type d.\n").lower()
if option == 'c':
text = input("Please type the text you would like to compress.\n")
text = text.split()
for count,word in enumerate(text):
if text.count(word) < 2:
order.append (max(order)+1)
else:
order.append (text.index(word)+1)
print (uniqueWords)
print (order)
break
elif option == 'd':
pass
else:
print("Sorry that was not an option")
For part of my assignment I need to identify unique words and send them to a text file. I understand how to write text to a text file I do not understand how I can order this code appropriately so it reproduces in a text file (if I was to input "the world of the flowers is a small world to be in":
the,world,of,flowers,is,a,small,to,be,in
1, 2, 3, 1, 5, 6, 7, 8, 2, 9, 10
The top line stating the unique words and the second line showing the order of the words in order to be later decompressed. I have no issue with the decompression or the sorting of the numbers but only the unique words being in order.
Any assistance would be much appreciated!

text = "the world of the flowers is a small world to be in"
words = text.split()
unique_ordered = []
for word in words:
if word not in unique_ordered:
unique_ordered.append(word)

from collections import OrderedDict
text = "the world of the flowers is a small world to be in"
words = text.split()
print list(OrderedDict.fromkeys(words))
output
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']

That's an interesting problem, in fact it can be solved using a dictionary to keep the index of the first occurence and to check if it was already encountered:
string = "the world of the flowers is a small world to be in"
dct = {}
words = []
indices = []
idx = 1
for substring in string.split():
# Check if you've seen it already.
if substring in dct:
# Already seen it, so append the index of the first occurence
indices.append(dct[substring])
else:
# Add it to the dictionary with the index and just append the word and index
dct[substring] = idx
words.append(substring)
indices.append(idx)
idx += 1
>>> print(words)
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> print(indices)
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]
If you don't want the indices there are also some external modules that have such a function to get the unique words in order of appearance:
>>> from iteration_utilities import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from more_itertools import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from toolz import unique
>>> list(unique(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']

To remove the duplicate entries from list whilst preserving the order, you my check How do you remove duplicates from a list in whilst preserving order?'s answers. For example:
my_sentence = "the world of the flowers is a small world to be in"
wordlist = my_sentence.split()
# Accepted approach in linked post
def get_ordered_unique(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
unique_list = get_ordered_unique(wordlist)
# where `unique_list` holds:
# ['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
Then in order to print the position of word, you may list.index() with list comprehension expression as:
>>> [unique_list.index(word)+1 for word in wordlist]
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python split text into tokens using regex - python

import string def count_words(text): counts = dict() text = text.translate(text.maketrans('', '', string.punctuation)) text = text.lower() words = text.split() print(words) for word in words: if word not in counts: counts[word] = 1 else: counts[word] += 1 return counts It works.

Related

Create dictionary with unique elements in list and their collocation

Determining the most frequent casing of a word in Python

Removing punctuation and creating a dictionary Python

Find a length of a sub list of a list Python

How to sort unique words in order of appearance?

Categories

Resources