The task is to read a file, create a dict and print out the word and its counter value. Below is code that works fine, but I can't seem to get my mind to understand why in the print_words() function, I can't change the sort to:
words = sorted(word_count.values())
and then print the word and its counter, sorted by the counter (number of times that word is in word_count[]).
def word_count_dict(filename):
word_count = {}
input_file = open(filename, 'r')
for line in input_file:
words = line.split()
for word in words:
word = word.lower()
if not word in word_count:
word_count[word] = 1
else:
word_count[word] = word_count[word] + 1
input_file.close()
return word_count
def print_words(filename):
word_count = word_count_dict(filename)
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]
If you sorted output by value (including the keys), the simplest approach is sorting the items (key-value pairs), using a key argument to sorted that sorts on the value, then iterating the result. So for your example, you'd replace:
words = sorted(word_count.keys())
for word in words:
print word, word_count[word]
with (adding from operator import itemgetter to the top of the module):
# key=itemgetter(1) means the sort key is the second value in each key-value
# tuple, meaning the value
sorted_word_counts = sorted(word_count.items(), key=itemgetter(1))
for word, count in sorted_word_counts:
print word, count
First thing to note is that dictionaries are not considered to be ordered, although this may change in the future. Therefore, it is good practice to convert your dict to a list of tuples ordered in some way.
The below function will help you convert a dictionary to a list of tuples ordered by values.
d = {'a': 5, 'b': 1, 'c': 7, 'd': 3}
def order_by_values(dct):
rev = sorted((v, k) for k, v in dct.items())
return [t[::-1] for t in rev]
order_by_values(d) # [('b', 1), ('d', 3), ('a', 5), ('c', 7)]
Related
"tree" to "eert" as 'e' occured twice and among 'r' and 't', as 'r' is of higher index so that will come first.
Was able to get the occurence of each character and clubbed same occurance characters together
def stringFunction(mystr):
mydict = {}
for i in range(len(mystr)):
if mystr[i] in mydict:
mydict[mystr[i]] = mydict.get(mystr[i]) + 1
else:
mydict[mystr[i]] = 1
print(mydict)
print(set(sorted(mydict.values())))
final_list = []
for each in set(sorted(mydict.values())):
print(each)
listOfKeys = []
listOfItems = mydict.items()
for item in listOfItems:
if item[1] == each:
listOfKeys.append(item[0])
print(listOfKeys)
Output of above code was
{'r': 1, 'e': 2, 't': 1}
set([1, 2])
1
['r', 't']
2
['e']
Expected result = "eert"
save the word, frequency of word and rank/index of the word in a list. then sort the list by the order first frequency and then the index value (in solution i didn't reverse it). once result is get, get the character from the last element to first element (if not reversed , else if reverse then get it from first element to last element).
def func(st):
#storing word , word count , word index in a tuple and
# then stroing this all in a list
l =[(i,st.count(i),st.index(i)) for i in set(st)]
# sort the list in reverse order on the base of frequency of word
# and then index value of word i
l.sort(key=lambda x:[x[1],x[2]],reverse=True)
# finally joining the word and no of times it come ie if p come 2 time
# it become 'pp' and append to final word
return ''.join([i[0]*i[1] for i in l])
print(func("apple")) # ppela
print(func("deer")) # eerd
print(func("tree")) # eert
this may work:
from collections import Counter
def stringFunction(mystr):
return "".join(n * char for n, char in Counter(reversed(mystr)).most_common())
print(stringFunction("apple")) # ppela
print(stringFunction("deer")) # eerd
print(stringFunction("tree")) # eert
where collections.Counter is used to count the number of occurrences of the letters and the reverseing takes care of the order of the letter that occur the same amount of times.
if you really want to avoid imports you could do this (this will only generate the correct order in python >= 3.5):
def stringFunction(mystr):
counter = {}
for char in reversed(mystr):
counter[char] = counter.get(char, 0) + 1
ret = "".join(
n * char
for char, n in sorted(counter.items(), key=lambda x: x[1], reverse=True)
)
return ret
I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1
Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}
If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()
You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)
This question already has answers here:
How do I sort a dictionary by value?
(34 answers)
Closed 6 years ago.
So I have the code below to count the number of words in a text file. I'd like to sort the output of this by words that appeared the greatest number of times to words that appeared the least number of times. How can this be accomplished?
ally = open("alice.txt", "r")
wordcount={}
for word in ally.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v, in wordcount.items():
print(k,v)
Simply use Counter. It will both shorten your code and get you the ordering that you want.
Quoting from the documentation:
A Counter is a dict subclass for counting hashable objects. It is an
unordered collection where elements are stored as dictionary keys and
their counts are stored as dictionary values. Counts are allowed to be
any integer value including zero or negative counts. The Counter class
is similar to bags or multisets in other languages.
>>> c = Counter(['eggs', 'ham'])
>>> c['bacon'] # count of a missing element is zero
0
You can view the sorted dictionary using operator.itemgetter():
from operator import itemgetter
wordcount = {'test': 1, 'hello': 3, 'test2':0}
sortedWords = sorted(wordcount.items(), key=itemgetter(1), reverse = True)
Output:
>>> sortedWords
[('hello', 3), ('test', 1), ('test2', 0)]
This should do it for you:-
ally = open("alice.txt", "r")
wordcount={}
for word in ally.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v, in sorted(wordcount.items(), key=lambda words: words[1], reverse = True):
print(k,v)
I'm trying to make a list that contains the most frequent tuple of a dictionary acording the first element. For example:
If d is my dictionary:
d = {(Hello, my): 1,(Hello, world):2, (my, name):3, (my,house):1}
I want to obtain a list like this:
L= [(Hello, world),(my, name)]
So I try this:
L = [k for k,val in d.iteritems() if val == max(d.values())]
But that only gives me the max of all the tuples:
L = [('my', 'name')]
I was thinking that maybe I have to go through my dictionary and make a new one for every first word of each tuple and then find the most frequent and put it on a list, but I'm having trouble to translate that in a code.
from itertools import groupby
# your input data
d = {('Hello', 'my'): 1,('Hello', 'world'):2, ('my', 'name'):3, ('my','house'):1}
key_fu = lambda x: x[0][0] # first element of first element,
# i.e. of ((a,b), c), return a
groups = groupby(sorted(d.iteritems(), key=key_fu), key_fu)
l = [max(g, key=lambda x:x[1])[0] for _, g in groups]
This is achievable in O(n) if you just re-key the mapping off the first word:
>>> d = {('Hello','my'): 1, ('Hello','world'): 2, ('my','name'): 3, ('my','house'): 1}
>>> d_max = {}
>>> for (first, second), count in d.items():
... if count >= d_max.get(first, (None, 0))[1]:
... d_max[first] = (second, count)
...
>>> d_max
{'Hello': ('world', 2), 'my': ('name', 3)}
>>> output = [(first, second) for (first, (second, count)) in d_max.items()]
>>> output
[('my', 'name'), ('Hello', 'world')]
In my opinion you should not just get the max on all the d values otherwise it just get the biggest value contained in your dictionary that is three in the specified case.
What I would do is create an intermediate list ( maybe this can be hidden ) that keeps in memory the first part of the key as second element, and the counter as first element. In this way you can just get the first element on the sorted list, to get the real max key.
You have pairs of words and a count associated to each of them. You could store your information in (or convert it to) 3-tuples:
d = [
('Hello', 'my', 1),
('Hello', 'world', 2),
('my', 'name', 3),
('my', 'house', 1)
]
For each word in the first position, you want to find the word in 2nd position occurs the most frequently. Sort the data according to the first word (any order, just to group them), then according to the count (descending).
d.sort(lambda t1,t2: cmp(t2[2],t1[2]) if (t1[0]==t2[0]) else cmp(t1[0],t2[0]))
Finally, iterate through the resulting array, keeping track of the last word encountered, and append only when encountering a new word in 1st position.
L = []
last_word = ""
for word1, word2, count in d:
if word1 != last_word:
L.append((word1,word2))
last_word = word1
print L
By running this code, you obtain [('Hello', 'world'), ('my', 'name')].
I have a list of tokenized text sentences (youtube comments):
sample_tok = [['How', 'does', 'it', 'call', 'them', '?', '\xef\xbb\xbf'],
['Thats', 'smart\xef\xbb\xbf'],
... # and sooo on.....
['1:45', ':', 'O', '\xef\xbb\xbf']]
Now I want to make a dictionary with the words and the amount of times they are mentioned.
from collections import Counter
d = Counter()
for sent in [sample_tok]:
for words in sent:
d = Counter(words)
Unfortunately, this just counts the last sublist...
[(':', 1), ('1:45', 1), ('\xef\xbb\xbf', 1), ('O', 1)]
Is there a way to make it count all the tokenized sentences?
You are replacing your counter, not updating it. Each time in the loop you produce a new Counter() instance, discarding the previous copy.
Pass each word in a nested generator expression to your Counter():
d = Counter(word for sublist in sample_tok for word in sublist)
or, if you need to somehow process each sublist first, use Counter.update():
d = Counter()
for sent in [sample_tok]:
for words in sent:
d.update(words)
You can use the update method of Counter instances. This counts the passed values and adds them to the counter.
d = Counter()
for sent in [sample_tok]:
for words in sent:
d.update(words)
Or you can add the new counter to the old one:
d = Counter()
for sent in [sample_tok]:
for words in sent:
d += Counter(words)