Determining the most frequent casing of a word in Python - python

I have a text and a I want to determine the most frequent casing of each word and create a dictionary with it. This is an extract of the text:
PENCIL: A pencil is an object we use to write. Pencil should not be confused by pen, which is a different object. A pencil is usually made from a pigment core inside a protective casing.
For example, a word like "pencil" could appear as "Pencil", "PENCIL" or "pencil" in my text. I would like to create a function that would first determine which of those options is the most frequent one. I have started by classifying all the words into three groups, depending on casing, though I don't know how to determine which case is the most frequent one (I guess I'd have to do a comparison across the three lists, but I don't know how to do that):
list_upper = []
list_lower = []
list_both = []
for word in text:
if isupper(word):
list_upper.append(word)
if islower(word):
list_lower.append(word)
if word == word.title():
list_both.append(word)
Then, it will create a dictionary in which the first key would be the lowercase words and the values would be the most frequent type. For example: pencil, Pencil. I'm not sure how to do this either... This is my desired output:
my_dictionary = {"pencil":"Pencil", "the":"THE"...}

I'm assuming that text is already an iterable of words and that words like 'pEnCiL' cannot occur.
Instead of building these three lists, you can start constructing a dictionary with the counts right away. I suggest using a defaultdict which returns Counter instances when a key is missing.
from collections import defaultdict, Counter
cases = defaultdict(Counter)
for word in text:
cases[word.lower()][word] += 1
For a list text with the content
['pencil', 'pencil', 'PENCIL', 'Pencil', 'Pencil', 'PENCIL', 'rubber', 'PENCIL']
this will produce the following cases dictionary.
defaultdict(collections.Counter,
{'pencil': Counter({'PENCIL': 3, 'Pencil': 2, 'pencil': 2}),
'rubber': Counter({'rubber': 1})})
From here, you can construct the final result as follows.
result = {w:c.most_common(1)[0][0] for w, c in cases.items()}
This will give you
{'pencil': 'PENCIL', 'rubber': 'rubber'}
in this example. If two cases appear equally often, an arbitrary one is picked as the most common.
~edit~
Turns out text is not an iterable of words. Daniel Mesejo's answer has a regular expression which can help you extract the words from a string.

You could use Counter with defaultdict:
import re
from collections import Counter, defaultdict
def words(t):
return re.findall('\w+', t)
text = """PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
Another casing with different Casing"""
table = defaultdict(list)
for word in words(text):
table[word.lower()].append(word)
result = {key: Counter(values).most_common(1)[0][0] for key, values in table.items()}
print(result)
Output
{'casing': 'casing', 'be': 'be', 'core': 'core', 'another': 'Another', 'object': 'object', 'should': 'should', 'from': 'from', 'write': 'write', 'pen': 'pen', 'protective': 'protective', 'a': 'a', 'which': 'which', 'pencil': 'pencil', 'different': 'different', 'not': 'not', 'is': 'is', 'by': 'by', 'inside': 'inside', 'to': 'to', 'confused': 'confused', 'with': 'with', 'pigment': 'pigment', 'we': 'we', 'use': 'use', 'an': 'an', 'made': 'made', 'usually': 'usually'}
First create a dictionary where the keys are the lower case variant of each word and the values are a list of the corresponding occurrences. Then use Counter to count the number of each casing and get the most common. Note the use of regex to extract the words.

You have two great answers already. Just for fun I figured we could try just using the builtins since you have already tokenized the words:
# Create a temp dict within the main dict that counts the occurrences of cases
d= {}
for word in words:
d.setdefault(word.lower(), {}).setdefault(word, 0)
d[word.lower()][word] += 1
# Create a function to convert the temp d back to its most common occurrence
def func(dct):
return sorted(dct.items(), key=lambda x: x[-1])[-1][0]
# Use function and dictionary comprehension to convert the results.
result = {k: func(v) for k, v in d.items()}
Test case:
text = """
PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
PENCIL PENCIL PENCIL Pigment Pigment Pigment Pigment
"""
# Added last line to produce a different result
result
# {'pencil': 'PENCIL',
# 'a': 'a', 'is': 'is',
# 'an': 'an', 'object': 'object',
# 'we': 'we', 'use': 'use', 'to': 'to',
# 'write': 'write', 'should': 'should',
# 'not': 'not', 'be': 'be', 'confused':
# 'confused', 'by': 'by', 'pen': 'pen',
# 'which': 'which', 'different': 'different',
# 'usually': 'usually', 'made': 'made',
# 'from': 'from', 'pigment': 'Pigment',
# 'core': 'core', 'inside': 'inside',
# 'protective': 'protective', 'casing': 'casing'}

Related

Create dictionary with unique elements in list and their collocation

Id need some help with the following:
Id like to create a function that when inserted a string, i would get a dictionary with the unique elements in the list as the key, and as values the texts that are before and after it.
for example with the following string:
Id like to have the following:
Important to note that for example some words a repeated and have different values next to it.
I am trying with the following function:
def ffg(txt):
txt = re.sub(r'[^\w\s]','',txt).lower().split()
words = list(set(txt))
indx = [words.index(i) for i in txt]
for x in range(len(txt)):
res = txt[x]
But as you can see, it doesnt work at all.
I am assuming you past a sequence of words, so split the text into words however you please.
from collections import defaultdict
def word_context(l):
context = defaultdict(set)
for i, w in enumerate(l):
if i + 1 < len(l):
context[w].add(l[i+1])
if i - 1 >= 0:
context[w].add(l[i-1])
return dict(context)
Result:
>>> l
['half', 'a', 'league', 'half', 'a', 'league', 'half', 'a', 'league', 'onward', 'all', 'in', 'the', 'valley', 'of', 'death', 'rode', 'the', 'six', 'hundred']
>>> word_context(l)
{'half': {'a', 'league'}, 'a': {'half', 'league'}, 'league': {'half', 'a', 'onward'}, 'onward': {'all', 'league'}, 'all': {'onward', 'in'}, 'in': {'all', 'the'}, 'the': {'six', 'rode', 'in', 'valley'}, 'valley': {'the', 'of'}, 'of': {'death', 'valley'}, 'death': {'rode', 'of'}, 'rode': {'death', 'the'}, 'six': {'the', 'hundred'}, 'hundred': {'six'}}
Another variation:
import re
def collocate(txt):
txt = re.sub(r'[^\w\s]', '', txt).lower().split()
neighbors={}
for i in range(len(txt)):
if txt[i] not in neighbors:
neighbors[txt[i]]=set()
if i>0:
neighbors[txt[i]].add(txt[i-1])
if i < len(txt) - 1:
neighbors[txt[i]].add(txt[i+1])
return neighbors
print(collocate("Half a league, half a league, Half a league onward, All in the valley of Death Rode the six hundred."))

Python split text into tokens using regex

Hi I have a question about splitting strings into tokens.
Here is an example string:
string = "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."
and I'm trying to split string correctly into its tokens.
Here is my function count_words
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
and the result of split here
['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a',
'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he',
'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut',
'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left',
'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed',
'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it',
'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong',
'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and',
'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he',
'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling',
'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a',
'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for',
'the', 'more', 'favoured', 'of', 'his', 'guests', '']
as you see there is the empty string '' in the last index of the split list.
Please help me understand this empty string in the list and to correctly split this example string.
You could use a list comprehension to iterate over the list items produced by re.split and only keep them if they are not empty strings:
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
split = [x for x in split if x != ''] # <- list comprehension
print(split)
You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.
That happened because the end of string is . and it is in the split pattern so , when match . the next match will start with an empty and that why you see ''.
I suggest this solution using re.findall instead to work an opposite way like this :
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.findall(r"[a-z\-]+", lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
Python's wiki explains this behavior:
If there are capturing groups in the separator and it matches at the
start of the string, the result will start with an empty string. The
same holds for the end of the string
Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).
The 2 solution already proposed (more or less) by others are these:
Solution 1: findall
As other users pointed out you can use findall and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+.
But it depends on you regex pattern because it is not always that easy.
Solution 2: check on the starting and ending token
Instead of checking if each token is != '', you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.
split = re.split("[\s\.,!?:;'\"-]+",lowerText)
if split[0] == '':
split = split[1:]
if split[-1] == '':
split = split[:-1]
You have an empty string due to a point is also matching to split at the string ending and anything is downstream. You can, however, filter out empty strings with filter function and thus complete your function:
import re
import collections
def count_words(text):
"""Count how many times each unique word occurs in text."""
lowerText = text.lower()
split = re.split("[ .,!?:;'\"\-]+",lowerText)
## filer out empty strings and count
## words:
return collections.Counter( filter(None, split) )
count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})
import string
def count_words(text):
counts = dict()
text = text.translate(text.maketrans('', '', string.punctuation))
text = text.lower()
words = text.split()
print(words)
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
It works.

When setting the value of a dictionary to an index i + 1 in a list, item is set to the the item after the last occurance of the value i

with open("text.txt", "r") as file:
contents = file.read().replace('\n',' ')
words = contents.split(' ')
wordsDict = {}
for i in range(len(words) - 1):
wordsDict[words[i]] = words[i + 1]
def assemble():
start = words[random.randint(0, len(words))]
print(start.capitalize())
assemble()
I am currently creating a markov chain-esque project. When I ran this code, I had expected for the dictionary to look as follows:
(if text.txt read: the cat chased the rat while the dog chased the cat into the rat house)
{'the': 'cat', 'cat': 'chased', 'chased': 'the', 'the': 'rat', 'rat': 'while', 'while': 'the', 'the': 'dog', 'dog': 'chased', 'chased': 'the', 'the': 'cat', 'cat': 'into', 'into': 'the', 'the': 'rat', 'rat': 'house'}
but instead, I get
{'the': 'rat', 'cat': 'into', 'chased': 'the', 'rat': 'house', 'while': 'the', 'dog': 'chased', 'into': 'the'}
If you don't sense a pattern, it's that the value isn't just the next item in the array, it is the next item after the last occurance of the word. In this case, our first key & value pair is 'the': 'rat' because the last occurance of the is followed by rat.
I have no idea why this happens or how to fix it.
The dictionary you wanted isn't valid, you can't duplicate keys. You could try to do this with lists of lists instead.

Find matching phrases and words in a string python

Using python, what would be the most efficient way for one to extract common phrases or words from to given string?
For example,
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
Would return:
["a","time","there","was a very","called Jack"]
How would one go about in doing this efficiently (in my case I would need to do this over thousands of 1000 word documents)?
You can split each string, then intersect the sets.
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))
Result
set(['a', 'very', 'Jack', 'time', 'was', 'called'])
Note this only matches individual words. You have to be more specific on what you would consider a "phrase". Longest consecutive matching substring? That could get more complicated.
In natural language processing, you usually extract common patterns and sequences from sentences using n-grams.
In python, you can use the excellent NLTK module for that.
For counting and finding the most common, you can use collections.Counter.
Here's a example for 2-grams:
from nltk.util import ngrams
from collections import Counter
from itertools import chain
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)
counter= Counter(chain(ngrams1,ngrams2)) #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once
output:
[('called', 'Jack'), ('was', 'a'), ('a', 'very')]
output with n=3:
[('was', 'a', 'very')]
output with n=1 (without tuples):
['Jack', 'a', 'was', 'time', 'called', 'very']
This is a classic dynamic programming problem. All you need to do is build a suffix tree for string1, with words instead of letters (which is the usual formulation). Here is an illustrative example of a suffix tree.
Label all nodes in your tree as s1.
Insert all suffixes of string2 one by one.
All nodes that the suffixes in step 2 pass through are labeled s2.
Any new nodes created in step 2 are also labeled s2.
In the final tree, path labels of every node labeled both s1 and s2 is a common substring.
This algorithm is succinctly explained in this lecture note.
For two strings of lengths n and m, the suffix tree construction takes O(max(n,m)), and all the matching substrings (in your case, words or phrases) can be searched in O(#matches).
A couple of years later, but I tried this way using 'Counter' below:
Input[ ]:
from collections import Counter
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()
count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
dics = {'tag': n, 'count': c}
tag_count.append(dics)
Output[ ]:
[{'tag': 'a', 'count': 4},
{'tag': 'very', 'count': 3},
{'tag': 'time', 'count': 2},
{'tag': 'was', 'count': 2},
{'tag': 'called', 'count': 2},
{'tag': 'Jack', 'count': 2},
{'tag': 'once', 'count': 1},
{'tag': 'upon', 'count': 1},
{'tag': 'there', 'count': 1},
{'tag': 'large', 'count': 1}]
Hopefully, it would be useful for someone :)

compare words in two lists in python

I would appreciate someone's help on this probably simple matter: I have a long list of words in the form ['word', 'another', 'word', 'and', 'yet', 'another']. I want to compare these words to a list that I specify, thus looking for target words whether they are contained in the first list or not.
I would like to output which of my "search" words are contained in the first list and how many times they appear. I tried something like list(set(a).intersection(set(b))) - but it splits up the words and compares letters instead.
How can I write in a list of words to compare with the existing long list? And how can I output co-occurences and their frequencies? Thank you so much for your time and help.
>>> lst = ['word', 'another', 'word', 'and', 'yet', 'another']
>>> search = ['word', 'and', 'but']
>>> [(w, lst.count(w)) for w in set(lst) if w in search]
[('and', 1), ('word', 2)]
This code basically iterates through the unique elements of lst, and if the element is in the search list, it adds the word, along with the number of occurences, to the resulting list.
Preprocess your list of words with a Counter:
from collections import Counter
a = ['word', 'another', 'word', 'and', 'yet', 'another']
c = Counter(a)
# c == Counter({'word': 2, 'another': 2, 'and': 1, 'yet': 1})
Now you can iterate over your new list of words and check whether they are contained within this Counter-dictionary and the value gives you their number of appearance in the original list:
words = ['word', 'no', 'another']
for w in words:
print w, c.get(w, 0)
which prints:
word 2
no 0
another 2
or output it in a list:
[(w, c.get(w, 0)) for w in words]
# returns [('word', 2), ('no', 0), ('another', 2)]

Categories

Resources