Counting unique words in python - python

In direct, my code so far is this :
from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
I want to add a code that counts unique words from pattern(42 txt files in this path) but I don't know how. Can anybody help me?

The best way to count objects in Python is to use collections.Counter class, which was created for that purposes. It acts like a Python dict but is a bit easier in use when counting. You can just pass a list of objects and it counts them for you automatically.
>>> from collections import Counter
>>> c = Counter(['hello', 'hello', 1])
>>> print c
Counter({'hello': 2, 1: 1})
Also Counter has some useful methods like most_common, visit documentation to learn more.
One method of Counter class that can also be very useful is update method. After you've instantiated Counter by passing a list of objects, you can do the same using update method and it will continue counting without dropping old counters for objects:
>>> from collections import Counter
>>> c = Counter(['hello', 'hello', 1])
>>> print c
Counter({'hello': 2, 1: 1})
>>> c.update(['hello'])
>>> print c
Counter({'hello': 3, 1: 1})

print len(set(w.lower() for w in open('filename.dat').read().split()))
Reads the entire file into memory, splits it into words using
whitespace, converts
each word to lower case, creates a (unique) set from the lowercase words, counts them
and prints the output

If you want to get count of each unique word, then use dicts:
words = ['Hello', 'world', 'world']
count = {}
for word in words :
if word in count :
count[word] += 1
else:
count[word] = 1
And you will get dict
{'Hello': 1, 'world': 2}

Related

Counting repeated strings in a list and printing

I have a list containing a number of strings. Some of the strings are repeated so I want to count how many times they are repeated. For the singular strings I will only print it, for the repeating strings I want to print the number of duplications it has. the code is as follows:
for string in list:
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)
However it has some problems as it is printing all the instances of the repeated strings. For example, if there are two "hello" strings in the list, it will print hello appeared 2 for twice. So is there a way to skip to check all the instances of the repeated strings? Thanks for help.
list.count in a loop is expensive. It will parse the entire list for each word. That's O(n2) complexity. You can loop over a set of words, but that's O(m*n) complexity, still not great.
Instead, you can use collections.Counter to parse your list once. Then iterate your dictionary key-value pairs. This will have O(m+n) complexity.
lst = ['hello', 'test', 'this', 'is', 'a', 'test', 'hope', 'this', 'works']
from collections import Counter
c = Counter(lst)
for word, count in c.items():
if count == 1:
print(word)
else:
print(f'{word} appeared: {count}')
hello
test appeared: 2
this appeared: 2
is
a
hope
works
Use set
Ex:
for string in set(list):
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)
Use a Counter
To create:
In [166]: import collections
In [169]: d = collections.Counter(['hello', 'world', 'hello'])
To display:
In [170]: for word, freq in d.items():
...: if freq > 1:
...: print('{0} appeared {1} times'.format(word, freq))
...: else:
...: print(word)
...:
hello appeared 2 times
world
You can use python's collections.counter like so -
import collections
result = dict(collections.Counter(list))
Another way to do this manually is:
result = {k, 0 for k in set(list)}
for item in list:
result[item] += 1
Also, you should not name your list as list as its python's inbuilt type. Now both the methods will give you dicts like -
{"a": 3, "b": 1, "c": 4, "d": 1}
Where keys are the unique values from your list and values are how many time a key has appeared in your list

Python - How do i build a dictionary from a text file?

for the class data structures and algorithms at Tilburg University i got a question in an in class test:
build a dictionary from testfile.txt, with only unique values, where if a value appears again, it should be added to the total sum of that productclass.
the text file looked like this, it was not a .csv file:
apples,1
pears,15
oranges,777
apples,-4
oranges,222
pears,1
bananas,3
so apples will be -3 and the output would be {"apples": -3, "oranges": 999...}
in the exams i am not allowed to import any external packages besides the normal: pcinput, math, etc. i am also not allowed to use the internet.
I have no idea how to accomplish this, and this seems to be a big problem in my development of python skills, because this is a question that is not given in a 'dictionaries in python' video on youtube (would be to hard maybe), but also not given in a expert course because there this question would be to simple.
hope you guys can help!
enter code here
from collections import Counter
from sys import exit
from os.path import exists, isfile
##i did not finish it, but wat i wanted to achieve was build a list of the
strings and their belonging integers. then use the counter method to add
them together
## by splitting the string by marking the comma as the split point.
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
keys = []
values = []
with open(filename) as f:
xs = f.read().split()
for i in xs:
keys.append([i])
print(keys)
my_dict = {}
for i in range(len(xs)):
my_dict[xs[i]] = xs.count(xs[i])
print(my_dict)
word_and_integers_dict = dict(zip(keys, values))
print(word_and_integers_dict)
values2 = my_dict.split(",")
for j in values2:
print( value2 )
the output becomes is this:
[['schijndel,-3'], ['amsterdam,0'], ['tokyo,5'], ['tilburg,777'], ['zaandam,5']]
{'zaandam,5': 1, 'tilburg,777': 1, 'amsterdam,0': 1, 'tokyo,5': 1, 'schijndel,-3': 1}
{}
so i got the dictionary from it, but i did not separate the values.
the error message is this:
28 values2 = my_dict.split(",") <-- here was the error
29 for j in values2:
30 print( value2 )
AttributeError: 'dict' object has no attribute 'split'
I don't understand what your code is actually doing, I think you don't know what your variables are containing, but this is an easy problem to solve in Python. Split into a list, split each item again, and count:
>>> input = "apples,1 pears,15 oranges,777 apples,-4 oranges,222 pears,1 bananas,3"
>>> parts = input.split()
>>> parts
['apples,1', 'pears,15', 'oranges,777', 'apples,-4', 'oranges,222', 'pears,1', 'bananas,3']
Then split again. Behold the list comprehension. This is an idiomatic way to transform a list to another in python. Note that the numbers are strings, not ints yet.
>>> strings = [s.split(',') for s in strings]
>>> strings
[['apples', '1'], ['pears', '15'], ['oranges', '777'], ['apples', '-4'], ['oranges', '222'], ['pears', '1'], ['bananas', '3']]
Now you want to iterate over pairs, and sum all the same fruits. This calls for a dict:
>>> result = {}
>>> for fruit, countstr in pairs:
... if fruit not in result:
... result[fruit] = 0
... result[fruit] += int(countstr)
>>> result
{'pears': 16, 'apples': -3, 'oranges': 999, 'bananas': 3}
This pattern of adding an element if it doesn't exist comes up frequently. You should checkout defaultdict in the collections module. If you use that, you don't even need the if.
Let's walk through what you need to do to. First, check if the file exists and read the contents to a variable. Second, parse each line - you need to split the line on the comma, convert the number from a string to an integer, and then pass the values to a dictionary. In this case I would recommend using defaultdict from collections, but we can also do it with a standard dictionary.
from os.path import exists, isfile
from collections import defaultdict
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
# this reads the file to a list, removing newline characters
with open(filename) as f:
line_list = [x.strip() for x in f]
# create a dictionary
my_dict = {}
# update the value in the dictionary if it already exists,
# otherwise add it to the dictionary
for line in line_list:
k, v_str = line.split(',')
if k in my_dict:
my_dict[k] += int(v_str)
else:
my_dict[k] = int(v_str)
# print the dictionary
table_str = '{:<30}{}'
print(table_str.format('Item','Count'))
print('='*35)
for k,v in sorted(my_dict.item()):
print(table_str.format(k,v))

using lambda and dictionaries functions

I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1
Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}
If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.
for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.
def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()
You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)

Python - counting duplicate strings

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']
Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3
Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']
If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}
why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.
As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

counting words from a file using dictionary doesn't work

I'm trying to count hashtags from a json file of tweets. The goal of my program is to first extract hasthags and make a list, and then to create a dictionary of those hashtags (for which I wrote the "hashtags_dic" function) to count the number of times each hashtag is present. My problem is that right now the program returns the hashtags values but does not sum up the number of times each particular hashtag is present.
I created a function named "hashtags_dic" that creates a dictionary, but it doesn't work.
Here is the code:
from twitter_DB import load_from_DB
def get_entities(tweet):
if 'entities' in tweet.keys():
hashtag_list = [hashtag['text'] for hashtag in tweet['entities']['hashtags']]
return hashtag_list
else:
return []
def hashtags_dic(hashtag_list):
hashtag_count = {}
for text in hashtag_list:
if text != None:
if text in hashtag_count.keys():
hashtag_count[text] = hashtag_count[text] + 1
else:
hashtag_count[text] = 1
return hashtag_count
if __name__ == '__main__':
DBname = 'search-results'
tweet_results = load_from_DB(DBname)
print 'number loaded', len(tweet_results)
for tweet in tweet_results[:100]:
labels = get_entities(tweet)
dic=hashtags_dic(labels)
print ' Hashtags:', labels[:20]
print ' Hastags count: ', dic
I'd appreciate any hints or ideas on what's wrong with my code. Thanks in advance... Norpa
There are several techniques for counting using dicts or dict subclasses (including dict.setdefault, collections.defaultdict, and collections.Counter).
As you might guess from its name, collections.Counter() is ideally suited to the task of counting :-)
import collections
import pprint
hash_counts = collections.Counter(hashtags)
print("Fifty most popular hashtags")
pprint.pprint(hash_counts.most_common(50))
FWIW, you're original hashtags_dict() function seems to work just fine:
>>> hashtags_dic(['obama', 'putin', 'cameron', 'putin', 'obama'])
{'cameron': 1, 'putin': 2, 'obama': 2}
The hashtags_dict() function would do much less work if you substituted text in hashtag_count for text in hashtag_count.keys(). The former does a high-speed hashed dictionary lookup and the latter builds a keys list uses a slow linear search.
You can use defaultdict to easily count unique hashtag occurrences. For example:
from collections import defaultdict
hashtags = ['nice', 'cool', 'great', 'fun', 'nice', 'cool']
hashtag_dict = defaultdict(int)
for k in hashtags:
hashtag_dict[k] += 1
defaultdict(<type 'int'>, {'fun': 1, 'great': 1, 'cool': 2, 'nice': 2})

Categories

Resources