Create dictionary with unique elements in list and their collocation

Create dictionary with unique elements in list and their collocation - python

Id need some help with the following:
Id like to create a function that when inserted a string, i would get a dictionary with the unique elements in the list as the key, and as values the texts that are before and after it.
for example with the following string:
Id like to have the following:
Important to note that for example some words a repeated and have different values next to it.
I am trying with the following function:
def ffg(txt):
txt = re.sub(r'[^\w\s]','',txt).lower().split()
words = list(set(txt))
indx = [words.index(i) for i in txt]
for x in range(len(txt)):
res = txt[x]
But as you can see, it doesnt work at all.

I am assuming you past a sequence of words, so split the text into words however you please.
from collections import defaultdict
def word_context(l):
context = defaultdict(set)
for i, w in enumerate(l):
if i + 1 < len(l):
context[w].add(l[i+1])
if i - 1 >= 0:
context[w].add(l[i-1])
return dict(context)
Result:
>>> l
['half', 'a', 'league', 'half', 'a', 'league', 'half', 'a', 'league', 'onward', 'all', 'in', 'the', 'valley', 'of', 'death', 'rode', 'the', 'six', 'hundred']
>>> word_context(l)
{'half': {'a', 'league'}, 'a': {'half', 'league'}, 'league': {'half', 'a', 'onward'}, 'onward': {'all', 'league'}, 'all': {'onward', 'in'}, 'in': {'all', 'the'}, 'the': {'six', 'rode', 'in', 'valley'}, 'valley': {'the', 'of'}, 'of': {'death', 'valley'}, 'death': {'rode', 'of'}, 'rode': {'death', 'the'}, 'six': {'the', 'hundred'}, 'hundred': {'six'}}

Another variation:
import re
def collocate(txt):
txt = re.sub(r'[^\w\s]', '', txt).lower().split()
neighbors={}
for i in range(len(txt)):
if txt[i] not in neighbors:
neighbors[txt[i]]=set()
if i>0:
neighbors[txt[i]].add(txt[i-1])
if i < len(txt) - 1:
neighbors[txt[i]].add(txt[i+1])
return neighbors
print(collocate("Half a league, half a league, Half a league onward, All in the valley of Death Rode the six hundred."))

Related

Missing keys from dictionary counting words in list

My output is incomplete. There are 3 element which don't count.
# A programm to count words in a string and put them in a dictionary as key = word and value = count
def word_in_str (S):
dict_s = {} # make a empty dict
s = S.lower() # make string lowercase
l = s.split() # split string into a list and separate theme by spase
print (l) # original list contain all words
for word in l:
counter = l.count (str(word))
print (str(word)) # for testing the code, it's value = count
print (counter) # for testing the code, it's key = word
dict_s[str(word)] = counter
l[:] = (value for value in l if value != str(word)) #delete the word after count it
print (l) # for testing the code, it's the list after deleting the word
print (dict_s) # main print code, but there is no ('when', 'young', 'and') in result
if __name__ == '__main__':
word_in_str ('I am tall when I am young and I am short when I am old')
the output for this code is:
['i', 'am', 'tall', 'when', 'i', 'am', 'young', 'and', 'i', 'am', 'short', 'when', 'i', 'am', 'old']
i
4
['am', 'tall', 'when', 'am', 'young', 'and', 'am', 'short', 'when', 'am', 'old']
tall
1
['am', 'when', 'am', 'young', 'and', 'am', 'short', 'when', 'am', 'old']
am
4
['when', 'young', 'and', 'short', 'when', 'old']
short
1
['when', 'young', 'and', 'when', 'old']
old
1
['when', 'young', 'and', 'when'] <==what happened to this words?
{'i': 4, 'tall': 1, 'am': 4, 'short': 1, 'old': 1} <==result without the words above

I think you're over thinking the problem. A Counter already counts elements of an iterable, and it is a type of dict
from collections import Counter
def word_in_str(S):
return dict(Counter(S.split()))
The problem with your code is that your for loop is over l, but then you're attempting to "delete" and reassign l[:], where you don't really need to. Just count and store the dict entry.

def word_in_str (S):
dict_s = {}
s = S.lower()
l = s.split()
for word in l:
counter = l.count (word)
dict_s[word] = counter
print (dict_s)
word_in_str ('I am tall when I am young and I am short when I am old')

My function that has to return the 10 most frequent words while excluding words doesn't work when it seems like it should

Python beginner here. I made this function to find the 10 most frequent words in a dictionary called "Counts".
The thing is, I have to exclude all the items from the englprep, englconj, englpronouns and specialwords lists from the "Counts" dictionary, and then get the top 10 most frequent words returned as a dictionary. Basically I have to get the "getMostFrequent()" function to take the "Counts" dictionary and the specified lists of "no-no" words as an input to output a new dictionary containing the 10 most frequent words.
I have tried for hours but I can't for the life of me get this to work.
expected output should be somewhere along the lines of: {'river': 755, 'party': 527, 'water': 472, etc...}
but i just get: {'the': 16517,
'of': 8550,
'and': 6390,
'to': 5471,
'a': 3508,
'in': 3298,
'was': 2371,
'on': 2094,
'that': 1893,
'he': 1557}, Which contains words that i specified not to be included :/
Would really aprecciate some help or maybe even a possible solution. Thanks in advance to anyone willing to help.
PS! I use python 3.8
def countWords():
Counts = {}
for x in wordList:
if not x in Counts:
Counts[x] = wordList.count(x)
return Counts
def getMostFrequent():
exclWordList = tuple(englConj), tuple(englPrep), tuple(englPronouns), tuple(specialWords)
topNumber = 10
topFreqWords = dict(sorted(Counts.items(), key=lambda x: x[1], reverse=True)[:topNumber])
new_dict = {}
for key, value in topFreqWords.items():
for index in exclWordList:
for y in index:
if value is not y:
new_dict[key] = value
topFreqWords = new_dict
return topFreqWords
if __name__ == "__main__":
Counts = countWords()
englPrep = ['about', 'beside', 'near', 'to', 'above', 'between', 'of',
'towards', 'across', 'beyond', 'off', 'under', 'after', 'by',
'on', 'underneath', 'against', 'despite', 'onto', 'unlike',
'along', 'down', 'opposite', 'until', 'among', 'during', 'out',
'up', 'around', 'except', 'outside', 'along', 'as', 'for',
'over', 'via', 'at', 'from', 'past', 'with', 'before', 'in',
'round', 'within', 'behind', 'inside', 'since', 'without',
'below', 'into', 'than', 'beneath', 'like', 'through']
englConj = ['for', 'and', 'nor', 'but', 'or', 'yet', 'so']
englPronouns = ['you', 'he', 'she', 'him', 'her', 'his', 'hers', 'yours']
specialWords = ['the']
topFreqWords = getMostFrequent()

Try to pass Counts dictionary in getMostFrequent(Counts).
Your function should accept it unless Counts is declared in global scope.

In your code you take top 10 most frequent words including stopwords. You need to remove stopwords from Counts before sorting dict by value.
def getMostFrequent(Counts, englConj, englPronouns, specialWords):
exclWordList = set(englConj + englPrep + englPronouns + specialWords)
popitems = exclWordList.intersection(Counts.keys())
for i in popitems:
Counts.pop(i)
topNumber = 10
topFreqWords = dict(sorted(Counts.items(), key=lambda x: x[1], reverse=True)[:topNumber])
return topFreqWords

Determining the most frequent casing of a word in Python

I have a text and a I want to determine the most frequent casing of each word and create a dictionary with it. This is an extract of the text:
PENCIL: A pencil is an object we use to write. Pencil should not be confused by pen, which is a different object. A pencil is usually made from a pigment core inside a protective casing.
For example, a word like "pencil" could appear as "Pencil", "PENCIL" or "pencil" in my text. I would like to create a function that would first determine which of those options is the most frequent one. I have started by classifying all the words into three groups, depending on casing, though I don't know how to determine which case is the most frequent one (I guess I'd have to do a comparison across the three lists, but I don't know how to do that):
list_upper = []
list_lower = []
list_both = []
for word in text:
if isupper(word):
list_upper.append(word)
if islower(word):
list_lower.append(word)
if word == word.title():
list_both.append(word)
Then, it will create a dictionary in which the first key would be the lowercase words and the values would be the most frequent type. For example: pencil, Pencil. I'm not sure how to do this either... This is my desired output:
my_dictionary = {"pencil":"Pencil", "the":"THE"...}

I'm assuming that text is already an iterable of words and that words like 'pEnCiL' cannot occur.
Instead of building these three lists, you can start constructing a dictionary with the counts right away. I suggest using a defaultdict which returns Counter instances when a key is missing.
from collections import defaultdict, Counter
cases = defaultdict(Counter)
for word in text:
cases[word.lower()][word] += 1
For a list text with the content
['pencil', 'pencil', 'PENCIL', 'Pencil', 'Pencil', 'PENCIL', 'rubber', 'PENCIL']
this will produce the following cases dictionary.
defaultdict(collections.Counter,
{'pencil': Counter({'PENCIL': 3, 'Pencil': 2, 'pencil': 2}),
'rubber': Counter({'rubber': 1})})
From here, you can construct the final result as follows.
result = {w:c.most_common(1)[0][0] for w, c in cases.items()}
This will give you
{'pencil': 'PENCIL', 'rubber': 'rubber'}
in this example. If two cases appear equally often, an arbitrary one is picked as the most common.
~edit~
Turns out text is not an iterable of words. Daniel Mesejo's answer has a regular expression which can help you extract the words from a string.

You could use Counter with defaultdict:
import re
from collections import Counter, defaultdict
def words(t):
return re.findall('\w+', t)
text = """PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
Another casing with different Casing"""
table = defaultdict(list)
for word in words(text):
table[word.lower()].append(word)
result = {key: Counter(values).most_common(1)[0][0] for key, values in table.items()}
print(result)
Output
{'casing': 'casing', 'be': 'be', 'core': 'core', 'another': 'Another', 'object': 'object', 'should': 'should', 'from': 'from', 'write': 'write', 'pen': 'pen', 'protective': 'protective', 'a': 'a', 'which': 'which', 'pencil': 'pencil', 'different': 'different', 'not': 'not', 'is': 'is', 'by': 'by', 'inside': 'inside', 'to': 'to', 'confused': 'confused', 'with': 'with', 'pigment': 'pigment', 'we': 'we', 'use': 'use', 'an': 'an', 'made': 'made', 'usually': 'usually'}
First create a dictionary where the keys are the lower case variant of each word and the values are a list of the corresponding occurrences. Then use Counter to count the number of each casing and get the most common. Note the use of regex to extract the words.

You have two great answers already. Just for fun I figured we could try just using the builtins since you have already tokenized the words:
# Create a temp dict within the main dict that counts the occurrences of cases
d= {}
for word in words:
d.setdefault(word.lower(), {}).setdefault(word, 0)
d[word.lower()][word] += 1
# Create a function to convert the temp d back to its most common occurrence
def func(dct):
return sorted(dct.items(), key=lambda x: x[-1])[-1][0]
# Use function and dictionary comprehension to convert the results.
result = {k: func(v) for k, v in d.items()}
Test case:
text = """
PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
PENCIL PENCIL PENCIL Pigment Pigment Pigment Pigment
"""
# Added last line to produce a different result
result
# {'pencil': 'PENCIL',
# 'a': 'a', 'is': 'is',
# 'an': 'an', 'object': 'object',
# 'we': 'we', 'use': 'use', 'to': 'to',
# 'write': 'write', 'should': 'should',
# 'not': 'not', 'be': 'be', 'confused':
# 'confused', 'by': 'by', 'pen': 'pen',
# 'which': 'which', 'different': 'different',
# 'usually': 'usually', 'made': 'made',
# 'from': 'from', 'pigment': 'Pigment',
# 'core': 'core', 'inside': 'inside',
# 'protective': 'protective', 'casing': 'casing'}

When setting the value of a dictionary to an index i + 1 in a list, item is set to the the item after the last occurance of the value i

with open("text.txt", "r") as file:
contents = file.read().replace('\n',' ')
words = contents.split(' ')
wordsDict = {}
for i in range(len(words) - 1):
wordsDict[words[i]] = words[i + 1]
def assemble():
start = words[random.randint(0, len(words))]
print(start.capitalize())
assemble()
I am currently creating a markov chain-esque project. When I ran this code, I had expected for the dictionary to look as follows:
(if text.txt read: the cat chased the rat while the dog chased the cat into the rat house)
{'the': 'cat', 'cat': 'chased', 'chased': 'the', 'the': 'rat', 'rat': 'while', 'while': 'the', 'the': 'dog', 'dog': 'chased', 'chased': 'the', 'the': 'cat', 'cat': 'into', 'into': 'the', 'the': 'rat', 'rat': 'house'}
but instead, I get
{'the': 'rat', 'cat': 'into', 'chased': 'the', 'rat': 'house', 'while': 'the', 'dog': 'chased', 'into': 'the'}
If you don't sense a pattern, it's that the value isn't just the next item in the array, it is the next item after the last occurance of the word. In this case, our first key & value pair is 'the': 'rat' because the last occurance of the is followed by rat.
I have no idea why this happens or how to fix it.

The dictionary you wanted isn't valid, you can't duplicate keys. You could try to do this with lists of lists instead.

How to sort unique words in order of appearance?

restart = True
while restart == True:
option = input("Would you like to compress or decompress this file?\nIf you would like to compress type c \nIf you would like to decompress type d.\n").lower()
if option == 'c':
text = input("Please type the text you would like to compress.\n")
text = text.split()
for count,word in enumerate(text):
if text.count(word) < 2:
order.append (max(order)+1)
else:
order.append (text.index(word)+1)
print (uniqueWords)
print (order)
break
elif option == 'd':
pass
else:
print("Sorry that was not an option")
For part of my assignment I need to identify unique words and send them to a text file. I understand how to write text to a text file I do not understand how I can order this code appropriately so it reproduces in a text file (if I was to input "the world of the flowers is a small world to be in":
the,world,of,flowers,is,a,small,to,be,in
1, 2, 3, 1, 5, 6, 7, 8, 2, 9, 10
The top line stating the unique words and the second line showing the order of the words in order to be later decompressed. I have no issue with the decompression or the sorting of the numbers but only the unique words being in order.
Any assistance would be much appreciated!

text = "the world of the flowers is a small world to be in"
words = text.split()
unique_ordered = []
for word in words:
if word not in unique_ordered:
unique_ordered.append(word)

from collections import OrderedDict
text = "the world of the flowers is a small world to be in"
words = text.split()
print list(OrderedDict.fromkeys(words))
output
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']

That's an interesting problem, in fact it can be solved using a dictionary to keep the index of the first occurence and to check if it was already encountered:
string = "the world of the flowers is a small world to be in"
dct = {}
words = []
indices = []
idx = 1
for substring in string.split():
# Check if you've seen it already.
if substring in dct:
# Already seen it, so append the index of the first occurence
indices.append(dct[substring])
else:
# Add it to the dictionary with the index and just append the word and index
dct[substring] = idx
words.append(substring)
indices.append(idx)
idx += 1
>>> print(words)
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> print(indices)
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]
If you don't want the indices there are also some external modules that have such a function to get the unique words in order of appearance:
>>> from iteration_utilities import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from more_itertools import unique_everseen
>>> list(unique_everseen(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
>>> from toolz import unique
>>> list(unique(string.split()))
['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']

To remove the duplicate entries from list whilst preserving the order, you my check How do you remove duplicates from a list in whilst preserving order?'s answers. For example:
my_sentence = "the world of the flowers is a small world to be in"
wordlist = my_sentence.split()
# Accepted approach in linked post
def get_ordered_unique(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
unique_list = get_ordered_unique(wordlist)
# where `unique_list` holds:
# ['the', 'world', 'of', 'flowers', 'is', 'a', 'small', 'to', 'be', 'in']
Then in order to print the position of word, you may list.index() with list comprehension expression as:
>>> [unique_list.index(word)+1 for word in wordlist]
[1, 2, 3, 1, 4, 5, 6, 7, 2, 8, 9, 10]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create dictionary with unique elements in list and their collocation - python

Related

Missing keys from dictionary counting words in list

My function that has to return the 10 most frequent words while excluding words doesn't work when it seems like it should

Determining the most frequent casing of a word in Python

When setting the value of a dictionary to an index i + 1 in a list, item is set to the the item after the last occurance of the value i

How to sort unique words in order of appearance?

Categories

Resources