i'm having a little problem with an exercise i have to do :
Basically the assignment is to open an url, convert it into a given format, and count the number of occurrences of given strings in the text.
import urllib2 as ul
def word_counting(url, code, words):
page = ul.urlopen(url)
text = page.read()
decoded = ext.decode(code)
result = {}
for word in words:
count = decoded.count(word)
counted = str(word) + ":" + " " + str(count)
result.append(counted)
return finale
The result i should get is like " word1: x, word2: y, word3: z " with x,y,z being the number of occurrences. But it seems that i only get ONE number, when i try to run the test program i get as result only like 9 for the first occurrences, 14 for the second list, 5 for the third, missing the other occurrences and the whole counted value.
What am i doing wrong? Thanks in advance
You're not appending to the dictionary correctly.
The correct way is result[key] = value.
So for your loop it would be
for word in words:
count = decoded.count(word)
result[word] = str(count)
An example without decode but using .count()
words = ['apple', 'apple', 'pear', 'banana']
result= {}
for word in words:
count = words.count(word)
result[word] = count
>>> result
>>> {'pear': 1, 'apple': 2, 'banana': 1}
Or you can use Collections.Counter :
>>> from collections import Counter
>>> words = ['apple', 'apple', 'pear', 'banana']
>>> Counter(words)
Counter({'apple': 2, 'pear': 1, 'banana': 1})
Don't forget about list and dictionary comprehensions. They can be quite efficient on larger sets of data (especially if you are analysing a large web-page in your example). At the end of the day, if your data set is small, one could argue that the dict comprehension syntax is cleaner/more pythonic etc.
So in this case I would use something like:
result = {word : decoded.count(word) for word in words}
Related
I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.
I tried to do a comprehension dictionary for word_dict, but I don't find a way to do it. I had two problems with it:
I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.
The comprehension dictionary is:
word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}
Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.
text = "hello Hello, water! WATER:HELLO. water , HELLO"
# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello water WATER HELLO water HELLO'
# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']
word_dict = {}
for word in text_split:
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}
Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:
import re
from collections import Counter
text = "hello Hello, water! WATER:HELLO. water , HELLO"
pattern = r"\b\w+\b"
print(Counter(re.findall(pattern, text)))
Output:
Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>
Here's what the regex pattern is composed of:
\b - represents a word boundary (will not be included in the match)
\w+ - one or more characters from the set [a-zA-Z0-9_].
\b - another word boundary (will also not be included in the match)
Welcome to Python. There is the library collections (https://docs.python.org/3/library/collections.html), which has a class called Counter. It seems very likely that this could fit in your code. Is that a take?
from collections import Counter
...
word_dict = Counter(text_split)
How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?
At the minute, I have this code:
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
d = {}
lst = []
def findDupeWords():
string = text.lower()
#Split the string into words using built-in function
words = string.split(" ")
print("Duplicate words in a given string : ")
for i in range(0, len(words)):
count = 1
for j in range(i+1, len(words)):
if(words[i] == (words[j])):
count = count + 1
#Set words[j] to 0 to avoid printing visited word
words[j] = "0"
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
print(words[i])
for key in d:
text = text.replace(key,d[key])
print(text)
findDupeWords()
The output I get when I run this is:
Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll
prove she can again. We all already know this and you will too.
Duplicate words in a given string :
she
doesn't
have
to
but
just
her
will
you
and
How can I turn this list of words into a dictionary, like the following:
{'she': 1, 'doesn't': 2, 'have': 3, 'to': 4} , etc...
Don't reinvent the wheel, use an instance of a collections.Counter in the standard library.
from collections import Counter
def findDupeWords(text):
counter = Counter(text.lower().split(" "))
for word in counter:
if counter[word] > 1:
print(word)
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
findDupeWords(text)
Well, you could replace your call to print with an assignment to a dictionary:
def findDupeWords():
duplicates = {}
duplicate_counter = 0
...
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
duplicate_counter += 1
duplicates[words[i]] = duplicate_counter
But there are easier ways to achieve this, for example with collections.Counter:
from collections import Counter
words = text.lower().split()
word_occurrences = Counter(words)
dupes_in_order = sorted(
(word for word in set(words) if word_occurrences[word] > 1),
key=lambda w: words.index(w),
)
dupe_dictionary = {word: i+1 for i, word in enumerate(dupes_in_order)}
Afterwards:
>>> dupe_dictionary
{'she': 1,
"doesn't": 2,
'have': 3,
'to': 4,
'but': 5,
'just': 6,
'her': 7,
'will': 8,
'you': 9,
'and': 10}
Seeking help on Homework
I am given a list and asked to find the most occurring value in a list and returns the amount of times it is occurred. This question is fairly big and i have managed to get through the other parts by myself but this one stumped me.I should add that this is for an assignment any guidance would be appreciated.
Question Statement : Maximum (word) Frequency
For example in a book with the following words ['big', 'big', 'bat', 'bob', 'book'] the maximum frequency is 2, i.e., big is the most frequently occurring word, therefore 2 is the maximum frequency.
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
I have gotten this far but I am not sure if its right/where to go from here
Try this:
from collections import Counter
c = Counter(['big', 'big', 'bat', 'bob', 'book'])
max(c.items(), key=lambda x:x[1])
the max will returns the most one by its count, you can do:
key,rate = max(c.items(), key=lambda x:x[1])
the key will be big and the rate will be 2.
also, you can access all of the items count by c.items(). and the output will be
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
Edit:
as schwobaseggl said the best practice to find from a counter is to use most_common.
c.most_common(1)[0]
You just need to count the occurrence of all the unique elements and compare the frequency with the previously computed frequency.
sample is a list of words.
def maxfreq(sample):
m=0
frequency=0
word=''
set_sample=list(set(sample))
for i in range(len(set_sample)):
c=sample.count(set_sample[i])
if c>m:
m=c
frequency=m
word=set_sample[i]
return (frequency,word)
Since it sounds like this is some kind of challenge and/or homework you're supposed to be working on, instead of directly providing a code sample let me give you some concepts.
First off, the best way to know if you've seen a word or not is to use a map, in Python -- the term is "dict" and the syntax is simple {}, you can store values like this: my_dict['value'] = true or whatever key/value you need.
So if you're going to read your words, one by one, and store them into this dict, the what should the value be? You know you want to know the maximum frequency, right? Well, so let's use that as our value. By default, if we add a word, we should make sure to set it's initial value to 1 (we've seen it once). And if we see a word a second time, we then increment our frequency.
Now that you have a dict full of words and their frequencies, perhaps you might be able to figure out how to find the one with the largest frequency?
So that being said, things you should look into are:
How to determine if a key exists in a dict
How to modify the value of a key in a dict
How to iterate a dict's key/value pairs
After that, your answer should be pretty easy to figure out.
try this :
>>> MyList = ["above", "big", "above", "cat", "cat", "above", "cat"]
>>> my_dict = {i:MyList.count(i) for i in MyList}
>>> my_dict
{'above': 3, 'big': 1, 'cat': 3}
It can also be accomplish using collections.Counter which is compatible with Python 2.7 or 3.x !
>>> from collections import Counter
>>> MyList = ['big', 'big', 'bat', 'bob', 'book']
>>> dict(Counter(MyList))
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
If you are open to Pandas then it can be done as follows:
>>> import pandas as pd
>>> pd.Series(MyList).value_counts()
big 2
book 1
bob 1
bat 1
dtype: int64
#Answer to the OP's next Question in the comment section what if i wanted to get just the maximum value instead of the word .
>>> pd.Series(MyList).value_counts().max()
2
How about this:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
max_freq = max(word_counter.items(), key=(lambda x: x[1]))
return max_freq
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))
Output:
('big', 2)
Works fine with Python 2 and 3 and returns result as a tuple of most frequent word and occurrences count.
EDIT:
If you don't care at all which word has the highest count and you want only the frequency number you can simplify it a bit to:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return max(word_counter.values())
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))
I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1
Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}
If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()
You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)
I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']
Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3
Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']
If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}
why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.
As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None