I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict:
TypeError: unhashable type: 'list'
Also, I am wondering of .split() is good because my text files contain different punctuation marks.
fileref = open(mypath + '/' + i, 'r')
wordDict = {}
for line in fileref.readlines():
key = line.split()
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1
from collections import Counter
text = '''I am trying to count every word from text files and appending the word and count to a dictionary as the key-value pairs. It throws me this error: if key not in wordDict: TypeError: unhashable type: 'list' Also, I am wondering of .split() is good because my text files contain different punctuation marks. Thanks ahead for those who help!'''
split_text = text.split()
counter = Counter(split_text)
print(counter)
out:
Counter({'count': 2, 'and': 2, 'text': 2, 'to': 2, 'I': 2, 'files': 2, 'word': 2, 'am': 2, 'the': 2, 'dictionary': 1, 'a': 1, 'not': 1, 'in': 1, 'ahead': 1, 'me': 1, 'trying': 1, 'every': 1, '.split()': 1, 'type:': 1, 'my': 1, 'punctuation': 1, 'is': 1, 'key': 1, 'error:': 1, 'help!': 1, 'those': 1, 'different': 1, 'throws': 1, 'TypeError:': 1, 'contain': 1, 'wordDict:': 1, 'appending': 1, 'if': 1, 'It': 1, 'Also,': 1, 'unhashable': 1, 'from': 1, 'because': 1, 'marks.': 1, 'pairs.': 1, 'this': 1, 'key-value': 1, 'wondering': 1, 'Thanks': 1, 'of': 1, 'good': 1, "'list'": 1, 'for': 1, 'who': 1, 'as': 1})
key is a list of space-delimited words found in the current line. You would need to iterate over that list as well.
for line in fileref:
keys = line.split()
for key in keys:
if key not in wordDict:
wordDict[key] = 1
else:
wordDict[key] += 1
This can be cleaned up considerably by either using the setdefault method or a defaultdict from the collections module; both allow you to avoid explicitly checking for a key by automatically adding the key with an initial value if it isn't already in the dict.
for key in keys:
wordDict.setdefault(key, 0) += 1
or
from collections import defaultdict
wordDict = defaultdict(int) # Default to 0, since int() == 0
...
for key in keys:
wordDict[key] += 1
key is a list and you're trying to see if a list is in a dictionary which is equivalent to seeing if it is one of the keys. Dictionary keys canot be lists hence the "unhashable type" error.
str.split return a list of words
>>> "hello world".split()
['hello', 'world']
>>>
and lists or any other mutable object cannot be used as a key of a dictionary, and that is why you get the error TypeError: unhashable type: 'list'.
You need to iterate over it to include each one of those, also the recommended way to work with a file is with the with statement
wordDict = {}
with open(mypath + '/' + i, 'r') as fileref:
for line in fileref:
for word in line.split():
if word not in wordDict:
wordDict[word] = 1
else:
wordDict[word] += 1
the above can be shortened with the use Counter and an appropriate call to it
from collections import Counter
with open(mypath + '/' + i, 'r') as fileref:
wordDict = Counter( word for line in fileref for word in line.split() )
Related
This question already has answers here:
Word count from a txt file program
(9 answers)
Closed 3 years ago.
I want to count the occurence of each word in a file using a dictionary (all words contained in the file are in lower-case and the file does not contain any punctuation).
I want to optimize my code because I am aware that the list takes unnecessary time.
def create_dictionary(filename):
d = {}
flat_list = []
with open(filename,"r") as fin:
for line in fin:
for word in line.split():
flat_list.append(word)
for i in flat_list:
if d.get(i,0) == 0:
d[i] = 1
else :
d[i] +=1
return d
For example, a file containing :
i go to the market to buy some things to
eat and drink because i want
to eat and drink
should return:
{'i': 2, 'go': 1, 'to': 4, 'the': 1, 'market': 1, 'buy': 1, 'some': 1, 'things': 1, 'eat': 2, 'and': 2, 'drink': 2, 'because': 1, 'want': 1}
What can I improve?
Just use collections.Counter:
with open(filename,"r") as fin:
print(Counter(fin.read().split()))
This question already has answers here:
Iterating through a string word by word
(7 answers)
Closed 4 years ago.
I have made a text string and removed all non alphabetical symbols and added whitespaces in between the words, but when I add them to a dictionary to count the frequency of the words it counts the letters instead. How do I count the words from a dictionary?
dictionary = {}
for item in text_string:
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)
Change this
for item in text_string:
to this
for item in text_string.split():
Function .split() splits the string to words using whitespace characters (including tabs and newlines) as delimiters.
You are very close. Since you state that your words are already whitespace separated, you need to use str.split to make a list of words.
An example is below:
dictionary = {}
text_string = 'there are repeated words in this sring with many words many are repeated'
for item in text_string.split():
if item in dictionary:
dictionary[item] = dictionary[item]+1
else:
dictionary[item] = 1
print(dictionary)
{'there': 1, 'are': 2, 'repeated': 2, 'words': 2, 'in': 1,
'this': 1, 'sring': 1, 'with': 1, 'many': 2}
Another solution is to use collections.Counter, available in the standard library:
from collections import Counter
text_string = 'there are repeated words in this sring with many words many are repeated'
c = Counter(text_string.split())
print(c)
Counter({'are': 2, 'repeated': 2, 'words': 2, 'many': 2, 'there': 1,
'in': 1, 'this': 1, 'sring': 1, 'with': 1})
My function is supposed to have:
One parameter as a tweet.
This tweet can involve numbers, words, hashtags, links and punctuations.
A second parameter is a dictionary that counts the words in that string with tweets, disregarding the hashtag's, mentions, links, and punctuation included in it.
The function returns all individual words in the dictionary as lowercase letters without any punctuation.
If the tweet had Don't then the dictionary would count it as dont.
Here is my function:
def count_words(tweet, num_words):
''' (str, dict of {str: int}) -> None
Return a NoneType that updates the count of words in the dictionary.
>>> count_words('We have made too much progress', num_words)
>>> num_words
{'we': 1, 'have': 1, 'made': 1, 'too': 1, 'much': 1, 'progress': 1}
>>> count_words("#utmandrew Don't you wish you could vote? #MakeAmericaGreatAgain", num_words)
>>> num_words
{'dont': 1, 'wish': 1, 'you': 2, 'could': 1, 'vote': 1}
>>> count_words('I am fighting for you! #FollowTheMoney', num_words)
>>> num_words
{'i': 1, 'am': 1, 'fighting': 1, 'for': 1, 'you': 1}
>>> count_words('', num_words)
>>> num_words
{'': 0}
'''
I might misunderstand your question, but if you want to update the dictionary you can do it in this manner:
d = {}
def update_dict(tweet):
for i in tweet.split():
if i not in d:
d[i] = 1
else:
d[i] += 1
return d
I'm trying to compute the frequencies of words using a dictionary in a nested lists. Each nested list is a sentence broken up into each word. Also, I want to delete proper nouns and lower case words at the beginning of the sentence. Is it even possible to get ride of proper nouns?
x = [["Hey", "Kyle","are", "you", "doing"],["I", "am", "doing", "fine"]["Kyle", "what", "time" "is", "it"]
from collections import Counter
def computeFrequencies(x):
count = Counter()
for listofWords in L:
for word in L:
count[word] += 1
return count
It is returning an error: unhashable type: 'list'
I want to return exactly this without the Counter() around the dictionary:
{"hey": 1, "how": 1, "are": 1, "you": 1, "doing": 2, "i": , "am": 1, "fine": 1, "what": 1, "time": 1, "is": 1, "it": 1}
Since your data is nested, you can flatten it with chain.from_iterable like this
from itertools import chain
from collections import Counter
print Counter(chain.from_iterable(x))
# Counter({'doing': 2, 'Kyle': 2, 'what': 1, 'timeis': 1, 'am': 1, 'Hey': 1, 'I': 1, 'are': 1, 'it': 1, 'you': 1, 'fine': 1})
If you want to use generator expression, then you can do
from collections import Counter
print Counter(item for items in x for item in items)
If you want to do this without using Counter, then you can use a normal dictionary like this
my_counter = {}
for line in x:
for word in line:
my_counter[word] = my_counter.get(word, 0) + 1
print my_counter
You can also use collections.defaultdict, like this
from collections import defaultdict
my_counter = defaultdict(int)
for line in x:
for word in line:
my_counter[word] += 1
print my_counter
Okay, if you simply want to convert the Counter object to a dict object (which I believe is not necessary at all since Counter is actually a dictionary. You can access key-values, iterate, delete update the Counter object just like a normal dictionary object), you can use bsoist's suggestion,
print dict(Counter(chain.from_iterable(x)))
The problem is that you are iterating over L twice.
Replace the inner loop:
for word in L:
with:
for word in listofWords:
Though, if want to go "pythonic" - check out #thefourtheye's solution.
Hello I want to count single word and double word count from input text in python .
Ex.
"what is your name ? what you want from me ?
You know best way to earn money is Hardwork
what is your aim ?"
output:
sinle W.C. :
what 3
is 3
your 2
you 2
and so on..
Double W.C. :
what is 2
is your 2
your name 1
what you 1
ans so on..
please post the way to do this ?
i use following code for the singl word count :
ws={}
for line in text:
for wrd in line:
if wrd not in ws:
ws[wrd]=1
else:
ws[wrd]+=1
from collections import Counter
s = "..."
words = s.split()
pairs = zip(words, words[1:])
single_words, double_words = Counter(words), Counter(pairs)
Output:
print "sinle W.C."
for word, count in sorted(single_words.items(), key=lambda x: -x[1]):
print word, count
print "double W.C."
for pair, count in sorted(double_words.items(), key=lambda x: -x[1]):
print pair, count
import nltk
from nltk import bigrams
from nltk import trigrams
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = bigrams(tokens)
print [(item, tokens.count(item)) for item in sorted(set(tokens))]
print [(item, bi_tokens.count(item)) for item in sorted(set(bi_tokens))]
this works. using defaultdict. python 2.6
>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> string = "what is your name ? what you want from me ?\n
You know best way to earn money is Hardwork\n what is your aim ?"
>>> l = string.split()
>>> for i in l:
d[i]+=1
>>> d
defaultdict(<type 'int'>, {'me': 1, 'aim': 1, 'what': 3, 'from': 1, 'name': 1,
'You': 1, 'money': 1, 'is': 3, 'earn': 1, 'best': 1, 'Hardwork': 1, 'to': 1,
'way': 1, 'know': 1, 'want': 1, 'you': 1, 'your': 2, '?': 3})
>>> d2 = defaultdict(int)
>>> for i in zip(l[:-1], l[1:]):
d2[i]+=1
>>> d2
defaultdict(<type 'int'>, {('You', 'know'): 1, ('earn', 'money'): 1,
('is', 'Hardwork'): 1, ('you', 'want'): 1, ('know', 'best'): 1,
('what', 'is'): 2, ('your', 'name'): 1, ('from', 'me'): 1,
('name', '?'): 1, ('?', 'You'): 1, ('?', 'what'): 1, ('to', 'earn'): 1,
('aim', '?'): 1, ('way', 'to'): 1, ('Hardwork', 'what'): 1,
('money', 'is'): 1, ('me', '?'): 1, ('what', 'you'): 1, ('best', 'way'): 1,
('want', 'from'): 1, ('is', 'your'): 2, ('your', 'aim'): 1})
>>>
I realize this question is a few years old. I wrote a little routine today to just count individual words from a word doc (docx). I used docx2txt to get the text from the word document, and used my first regex expression ever to remove every character other than alpha, numeric or spaces, and switched all to uppercase. I put this in because the question isn't answered.
Here is my little test routine in case it may help anyone.
mydoc = 'I:/flashdrive/pmw/pmw_py.docx'
words_all = {}
#####
import docx2txt
my_text = docx2txt.process(mydoc)
print(my_text)
my_text_org = my_text
import re
#added this code for the double words
from collections import Counter
pairs = zip(words, words[1:])
pair_list = Counter(pairs)
print('before pair listing')
for pair, count in sorted(pair_list.items(), key=lambda x: -x[1]):
#print (''.join('{} {}'.format(*pair)), count) #worked
#print(' '.join(pair), '', count) #worked
new_pair = ("{} {}")
my_pair = new_pair.format(pair[0],pair[1])
print ((my_pair), ": ", count)
#end of added code
my_text = re.sub('[\W_]+', ' ', my_text.upper(), flags=re.UNICODE)
print(my_text)
words = my_text.split()
words_org = words #just in case I may need the original version later
for i in words:
if not i in words_all:
words_all[i] = words.count(i)
for k,v in sorted(words_all.items()):
print(k, v)
print("Number of items in word list: {}".format(len(words_all)))