I have a dictionary that's two levels deep. That is, each key in the first dictionary is a url and the value is another dictionary with each key being words and each value being the number of times the word appeared on that url. It looks something like this:
dic = {
'http://www.cs.rpi.edu/news/seminars.html': {
'hyper': 1,
'summer': 2,
'expert': 1,
'koushk': 1,
'semantic': 1,
'feedback': 1,
'sandia': 1,
'lewis': 1,
'global': 1,
'yener': 1,
'laura': 1,
'troy': 1,
'session': 1,
'greenhouse': 1,
'human': 1
...and so on...
The dictionary itself is very long and has 25 urls in it, each url having another dictionary as its value with every word found within the url and the number of times its found.
I want to find the word or words that appear in the most different urls in the dictionary. So the output should look something like this:
The following words appear x times on y pages: list of words
It seems that you should use a Counter for this:
from collections import Counter
print sum((Counter(x) for x in dic.values()),Counter()).most_common()
Or the multiline version:
c = Counter()
for d in dic.values():
c += Counter(d)
print c.most_common()
To get the words which are common in all of the subdicts:
subdicts = iter(dic.values())
s = set(next(subdicts)).intersection(*subdicts)
Now you can use that set to filter the resulting counter, removing words which don't appear in every subdict:
c = Counter((k,v) for k,v in c.items() if k in s)
print c.most_common()
A Counter isn't quite what you want. From the output you show, it looks like you want to keep track of both the total number of occurrences, and the number of pages the word occurs on.
data = {
'page1': {
'word1': 5,
'word2': 10,
'word3': 2,
},
'page2': {
'word2': 2,
'word3': 1,
}
}
from collections import defaultdict
class Entry(object):
def __init__(self):
self.pages = 0
self.occurrences = 0
def __iadd__(self, occurrences):
self.pages += 1
self.occurrences += occurrences
return self
def __str__(self):
return '{} occurrences on {} pages'.format(self.occurrences, self.pages)
def __repr__(self):
return '<Entry {} occurrences, {} pages>'.format(self.occurrences, self.pages)
counts = defaultdict(Entry)
for page_words in data.itervalues():
for word, count in page_words.iteritems():
counts[word] += count
for word, entry in counts.iteritems():
print word, ':', entry
This produces the following output:
word1 : 5 occurrences on 1 pages
word3 : 3 occurrences on 2 pages
word2 : 12 occurrences on 2 pages
That would capture the information you want, the next step would be to find the most common n words. You could do that using a heapsort (which has the handy feature of not requiring that you sort the whole list of words by number of pages then occurrences - that might be important if you've got a lot of words in total, but n of 'top n' is relatively small).
from heapq import nlargest
def by_pages_then_occurrences(item):
entry = item[1]
return entry.pages, entry.occurrences
print nlargest(2, counts.iteritems(), key=by_pages_then_occurrences)
Related
I was trying to find the occurrence of every 2 consecutive characters from a string.
The result will be in a dictionary as key = 2 characters and value = number of occurrence.
I tried the following :
seq = "AXXTAGXXXTA"
d = {seq[i:i+2]:seq.count(seq[i:i+2]) for i in range(0, len(seq)-1)}
The problem is that the result of XX should be 3 not 2 .
You can use collections.Counter.
from collections import Counter
seq = "AXXTAGXXXTA"
Counter((seq[i:i+2] for i in range(len(seq)-1)))
Output:
Counter({'AX': 1, 'XX': 3, 'XT': 2, 'TA': 2, 'AG': 1, 'GX': 1})
Or without additional libraries. You can use dict.setdefault.
seq = "AXXTAGXXXTA"
d = {}
for i in range(len(seq)-1):
key = seq[i:i+2]
d[key] = d.setdefault(key, 0) + 1
print(d)
I've got two problems with the following code
S = "acbcbba"
def count_letters(text):
result = {}
for letter in text:
if letter.isalpha():
if letter.lower() in result.keys():
result[letter.lower()] += 1
else:
result[letter.lower()] = 1
print(result)
return(result)
count_letters(S)
Firstly, I can't figure out how to modify it so it only returns 1 dictionary instead of as many dictionaries as there letters in the string.
Secondly, I then need to be able to access each key to figure out if the value associated with it is odd and return the keys that have odd values associated with them?
Does anyone have any ideas of how to do this?
It isn't returning multiple dictionaries, it is returning 1 dictionary, and printing the others. Just remove your print statement.
Regarding querying for items which have an odd number of counts you can use a list comprehension of the dictionary's items() and filter out by their value (i.e. count) being odd.
>>> d = count_letters(S)
>>> d
{'a': 2, 'c': 2, 'b': 3}
>>> [key for key, value in d.items() if value % 2 == 1]
['b']
If you want a list of the key value pairs then you can do something similar
>>> [(key, value) for key, value in d.items() if value % 2 ==1 ]
[('b', 3)]
All was about an indentation but here is a solution
S = "acbcbba"
def count_letters(text):
result = {}
for letter in text:
if letter.isalpha():
if letter.lower() in result.keys():
result[letter.lower()] += 1
else:
result[letter.lower()] = 1
print(result)
return(result)
count_letters(S)
output
{'a': 2, 'c': 2, 'b': 3}
anyway there was no reason to return if there is print in the function or you could return result only and thereafter print it like the following
S = "acbcbba"
def count_letters(text):
result = {}
for letter in text:
if letter.isalpha():
if letter.lower() in result.keys():
result[letter.lower()] += 1
else:
result[letter.lower()] = 1
return(result)
print(count_letters(S))
You can use built-in functions for that. For counting a specific character, just do S.count('a'). For getting a dictionary with all characters you could do something like that
S = "acbcbba"
my_dict = {k:S.count(k) for k in set(S)}
I'm having trouble transforming every word of a string in a dictionary and passing how many times the word appears as the value.
For example
string = 'How many times times appeared in this many times'
The dict i wanted is:
dict = {'times':3, 'many':2, 'how':1 ...}
Using Counter
from collections import Counter
res = dict(Counter(string.split()))
#{'How': 1, 'many': 2, 'times': 3, 'appeared': 1, 'in': 1, 'this': 1}
You can loop through the words and increment the count like so:
d = {}
for word in string.split(" "):
d.setdefault(word, 0)
d[word] += 1
im new on python 3.
What I want to do is to alternate upper and lowercase but only on a dictionary key.
my dictionary is created from a list, its key is the word (or list element) and its value is the times this element appears in the list.
kb = str(input("Give me a string: "));
txt = kb.lower(); #Turn string into lowercase
cadena = txt.split(); #Turn string into list
dicc = {};
for word in cadena:
if (word in dicc):
dicc[word] = dicc[word] + 1
else:
dicc[word] = 1
print(dicc)
With this code i can get for example:
input: "Hi I like PYthon i am UsING python"
{'hi': 1, 'i': 2, 'like': 1, 'python': 2, 'am': 1, 'using': 1}
but what I am trying to get is actually is:
{'hi': 1, 'I': 2, 'like': 1, 'PYTHON': 2, 'am': 1, 'USING': 1}
I tried using this:
for n in dicc.keys():
if (g%2 == 0):
n.upper()
else:
n.lower()
print(dicc)
But it seems that I have no idea of what I'm doing.
Any help would be appreciated.
Using itertools and collections.OrderedDict (to guarantee order in Python < 3.7)
Setup
import itertools
from collections import OrderedDict
s = 'Hi I like PYthon i am UsING python'
switcher = itertools.cycle((str.lower, str.upper))
d = OrderedDict()
final = OrderedDict()
First, create an OrderedDictionary just to count the occurences of strings in your list (since you want matches to be case insensitive based on your output):
for word in s.lower().split():
d.setdefault(word, 0)
d[word] += 1
Next, use itertools.cycle to call str.lower or str.upper on keys and create your final dictionary:
for k, v in d.items():
final[next(switcher)(k)] = v
print(final)
OrderedDict([('hi', 1), ('I', 2), ('like', 1), ('PYTHON', 2), ('am', 1), ('USING', 1)])
Your n in dicc.keys() line is wrong. You are trying to use n as both the position in the array of keys and the key itself.
Also the semicolons are unnecessary.
This should do what you want:
from collections import OrderedDict
# Receive user input
kb = str(input("Give me a string: "))
txt = kb.lower()
cadena = txt.split()
dicc = OrderedDict()
# Construct the word counter
for word in cadena:
if word in dicc:
dicc[word] += 1
else:
dicc[word] = 1
If you just want to print the output with alternating case, you can do something like this:
# Print the word counter with alternating case
elems = []
for i, (word, wordcount) in enumerate(dicc.items()):
if i % 2 == 0:
word = word.upper()
elems.append('{}: {}'.format(word, wordcount)
print('{' + ', '.join(elems) + '}')
Or you can make a new OrderedDict with alternating case...
dicc_alt_case = OrderedDict((word.upper() if (i % 2 == 0) else word, wordcount)
for word, wordcount in dicc.items())
I have the following sample data
docs_word = ["this is a test", "this is another test"]
docs_txt = ["this is a great test", "this is another test"]
What I want to do now is to create two dictionaries of the words in the sample files, compare them and store the words that are in the docs_txt file but not in the docs_word file in a seperate dictionary. Therefore I wrote the following:
count_txtDoc = Counter()
for file in docs_word:
words = file.split(" ")
count_txtDoc.update(words)
count_wrdDoc = Counter()
for file in docs_txt:
words = file.split(" ")
count_wrdDoc.update(words)
#Create a list of the dictionary keys
words_worddoc = count_wrdDoc.keys()
words_txtdoc = count_txtDoc.keys()
#Look for values that are in word_doc but not in txt_doc
count_all = Counter()
for val in words_worddoc:
if val not in words_txtdoc:
count_all.update(val)
print(val)
The thing now is that the correct values are printed. It shows: "great".
However if I print:
print(count_all)
I get the following output:
Counter({'a': 1, 'r': 1, 'e': 1, 't': 1, 'g': 1})
While I expected
Counter({'great': 1})
Any thoughts on how I can achieve this?
#
print(count_all)
Update the counter using an iterable containing the word, not the word itself (since the word is also iterable):
count_all.update([val])
# ^ ^
However, you may not need to create a new counter if you only the item. You can take the symmetric difference of the keys:
words_worddoc = count_wrdDoc.viewkeys() # use .keys() in Py3
words_txtdoc = count_txtDoc.viewkeys() # use .keys() in Py3
print(words_txtdoc ^ words_worddoc)
# set(['great'])
If you want the count also, you can compute the symmetric difference between both counters like so:
count_all = (count_wrdDoc - count_txtDoc) | (count_txtDoc - count_wrdDoc)
print (count_all)
# Counter({'great': 1})