Counting repeated strings in a list and printing - python

I have a list containing a number of strings. Some of the strings are repeated so I want to count how many times they are repeated. For the singular strings I will only print it, for the repeating strings I want to print the number of duplications it has. the code is as follows:
for string in list:
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)
However it has some problems as it is printing all the instances of the repeated strings. For example, if there are two "hello" strings in the list, it will print hello appeared 2 for twice. So is there a way to skip to check all the instances of the repeated strings? Thanks for help.

list.count in a loop is expensive. It will parse the entire list for each word. That's O(n2) complexity. You can loop over a set of words, but that's O(m*n) complexity, still not great.
Instead, you can use collections.Counter to parse your list once. Then iterate your dictionary key-value pairs. This will have O(m+n) complexity.
lst = ['hello', 'test', 'this', 'is', 'a', 'test', 'hope', 'this', 'works']
from collections import Counter
c = Counter(lst)
for word, count in c.items():
if count == 1:
print(word)
else:
print(f'{word} appeared: {count}')
hello
test appeared: 2
this appeared: 2
is
a
hope
works

Use set
Ex:
for string in set(list):
if list.count(string) > 1:
print(string+" appeared: ")
print(list.count(string))
elif list.count(string) == 1:
print(string)

Use a Counter
To create:
In [166]: import collections
In [169]: d = collections.Counter(['hello', 'world', 'hello'])
To display:
In [170]: for word, freq in d.items():
...: if freq > 1:
...: print('{0} appeared {1} times'.format(word, freq))
...: else:
...: print(word)
...:
hello appeared 2 times
world

You can use python's collections.counter like so -
import collections
result = dict(collections.Counter(list))
Another way to do this manually is:
result = {k, 0 for k in set(list)}
for item in list:
result[item] += 1
Also, you should not name your list as list as its python's inbuilt type. Now both the methods will give you dicts like -
{"a": 3, "b": 1, "c": 4, "d": 1}
Where keys are the unique values from your list and values are how many time a key has appeared in your list

Related

Maximum frequency value for word

Seeking help on Homework
I am given a list and asked to find the most occurring value in a list and returns the amount of times it is occurred. This question is fairly big and i have managed to get through the other parts by myself but this one stumped me.I should add that this is for an assignment any guidance would be appreciated.
Question Statement : Maximum (word) Frequency
For example in a book with the following words ['big', 'big', 'bat', 'bob', 'book'] the maximum frequency is 2, i.e., big is the most frequently occurring word, therefore 2 is the maximum frequency.
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
I have gotten this far but I am not sure if its right/where to go from here
Try this:
from collections import Counter
c = Counter(['big', 'big', 'bat', 'bob', 'book'])
max(c.items(), key=lambda x:x[1])
the max will returns the most one by its count, you can do:
key,rate = max(c.items(), key=lambda x:x[1])
the key will be big and the rate will be 2.
also, you can access all of the items count by c.items(). and the output will be
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
Edit:
as schwobaseggl said the best practice to find from a counter is to use most_common.
c.most_common(1)[0]
You just need to count the occurrence of all the unique elements and compare the frequency with the previously computed frequency.
sample is a list of words.
def maxfreq(sample):
m=0
frequency=0
word=''
set_sample=list(set(sample))
for i in range(len(set_sample)):
c=sample.count(set_sample[i])
if c>m:
m=c
frequency=m
word=set_sample[i]
return (frequency,word)
Since it sounds like this is some kind of challenge and/or homework you're supposed to be working on, instead of directly providing a code sample let me give you some concepts.
First off, the best way to know if you've seen a word or not is to use a map, in Python -- the term is "dict" and the syntax is simple {}, you can store values like this: my_dict['value'] = true or whatever key/value you need.
So if you're going to read your words, one by one, and store them into this dict, the what should the value be? You know you want to know the maximum frequency, right? Well, so let's use that as our value. By default, if we add a word, we should make sure to set it's initial value to 1 (we've seen it once). And if we see a word a second time, we then increment our frequency.
Now that you have a dict full of words and their frequencies, perhaps you might be able to figure out how to find the one with the largest frequency?
So that being said, things you should look into are:
How to determine if a key exists in a dict
How to modify the value of a key in a dict
How to iterate a dict's key/value pairs
After that, your answer should be pretty easy to figure out.
try this :
>>> MyList = ["above", "big", "above", "cat", "cat", "above", "cat"]
>>> my_dict = {i:MyList.count(i) for i in MyList}
>>> my_dict
{'above': 3, 'big': 1, 'cat': 3}
It can also be accomplish using collections.Counter which is compatible with Python 2.7 or 3.x !
>>> from collections import Counter
>>> MyList = ['big', 'big', 'bat', 'bob', 'book']
>>> dict(Counter(MyList))
{'big': 2, 'bat': 1, 'bob': 1, 'book': 1}
If you are open to Pandas then it can be done as follows:
>>> import pandas as pd
>>> pd.Series(MyList).value_counts()
big 2
book 1
bob 1
bat 1
dtype: int64
#Answer to the OP's next Question in the comment section what if i wanted to get just the maximum value instead of the word .
>>> pd.Series(MyList).value_counts().max()
2
How about this:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
max_freq = max(word_counter.items(), key=(lambda x: x[1]))
return max_freq
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))
Output:
('big', 2)
Works fine with Python 2 and 3 and returns result as a tuple of most frequent word and occurrences count.
EDIT:
If you don't care at all which word has the highest count and you want only the frequency number you can simplify it a bit to:
def maximum_frequency(new_list):
word_counter = {}
for word in new_list:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
return max(word_counter.values())
if __name__ == '__main__':
test_data = ['big', 'big', 'bat', 'bob', 'book']
print(maximum_frequency(test_data))

Creating a dictionary where the key is an integer and the value is the length of a random sentence

Super new to to python here, I've been struggling with this code for a while now. Basically the function returns a dictionary with the integers as keys and the values are all the words where the length of the word corresponds with each key.
So far I'm able to create a dictionary where the values are the total number of each word but not the actual words themselves.
So passing the following text
"the faith that he had had had had an affect on his life"
to the function
def get_word_len_dict(text):
result_dict = {'1':0, '2':0, '3':0, '4':0, '5':0, '6' :0}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))] += 1
return result_dict
returns
1 - 0
2 - 3
3 - 6
4 - 2
5 - 1
6 - 1
Where I need the output to be:
2 - ['an', 'he', 'on']
3 - ['had', 'his', 'the']
4 - ['life', 'that']
5 - ['faith']
6 - ['affect']
I think I need to have to return the values as a list. But I'm not sure how to approach it.
I think that what you want is a dic of lists.
result_dict = {'1':[], '2':[], '3':[], '4':[], '5':[], '6' :[]}
for word in text.split():
if str(len(word)) in result_dict:
result_dict[str(len(word))].append(word)
return result_dict
Fixing Sabian's answer so that duplicates aren't added to the list:
def get_word_len_dict(text):
result_dict = {1:[], 2:[], 3:[], 4:[], 5:[], 6 :[]}
for word in text.split():
n = len(word)
if n in result_dict and word not in result_dict[n]:
result_dict[n].append(word)
return result_dict
Check out list comprehensions
Integers are legal dictionaries keys so there is no need to make the numbers strings unless you want it that way for some other reason.
if statement in the for loop controls flow to add word only once. You could get this effect more automatically if you use set() type instead of list() as your value data structure. See more in the docs. I believe the following does the job:
def get_word_len_dict(text):
result_dict = {len(word) : [] for word in text.split()}
for word in text.split():
if word not in result_dict[len(word)]:
result_dict[len(word)].append(word)
return result_dict
try to make it better ;)
Instead of defining the default value as 0, assign it as set() and within if condition do, result_dict[str(len(word))].add(word).
Also, instead of preassigning result_dict, you should use collections.defaultdict.
Since you need non-repetitive words, I am using set as value instead of list.
Hence, your final code should be:
from collections import defaultdict
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
result_dict[str(len(word))].add(word)
return result_dict
In case it is must that you want list as values (I think set should suffice your requirement), you need to further iterate it as:
for key, value in result_dict.items():
result_dict[key] = list(value)
What you need is a map to list-construct (if not many words, otherwise a 'Counter' would be fine):
Each list stands for a word class (number of characters). Map is checked whether word class ('3') found before. List is checked whether word ('had') found before.
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if not result_dict.get(str(len(word))): # add list to map?
result_dict[str(len(word))] = []
if not word in result_dict[str(len(word))]: # add word to list?
result_dict[str(len(word))].append(word)
return result_dict
-->
3 ['the', 'had', 'his']
2 ['he', 'an', 'on']
5 ['faith']
4 ['that', 'life']
6 ['affect']
the problem here is you are counting the word by length, instead you want to group them. You can achieve this by storing a list instead of a int:
def get_word_len_dict(text):
result_dict = {}
for word in text.split():
if len(word) in result_dict:
result_dict[len(word)].add(word)
else:
result_dict[len(word)] = {word} #using a set instead of list to avoid duplicates
return result_dict
Other improvements:
don't hardcode the key in the initialized dict but let it empty instead. Let the code add the new keys dynamically when necessary
you can use int as keys instead of strings, it will save you the conversion
use sets to avoid repetitions
Using groupby
Well, I'll try to propose something different: you can group by length using groupby from the python standard library
import itertools
def get_word_len_dict(text):
# split and group by length (you get a list if tuple(key, list of values)
groups = itertools.groupby(sorted(text.split(), key=lambda x: len(x)), lambda x: len(x))
# convert to a dictionary with sets
return {l: set(words) for l, words in groups}
You say you want the keys to be integers but then you convert them to strings before storing them as a key. There is no need to do this in Python; integers can be dictionary keys.
Regarding your question, simply initialize the values of the keys to empty lists instead of the number 0. Then, in the loop, append the word to the list stored under the appropriate key (the length of the word), like this:
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = {i : [] for i in range(1, 7)}
for word in text.split():
length = len(word)
if length in result_dict:
result_dict[length].append(word)
return result_dict
This results in the following:
>>> get_word_len_dict(string)
{1: [], 2: ['he', 'an', 'on'], 3: ['the', 'had', 'had', 'had', 'had', 'his'], 4: ['that', 'life'], 5: ['faith'], 6: ['affect']}
If you, as you mentioned, wish to remove the duplicate words when collecting your input string, it seems elegant to use a set and convert to a list as a final processing step, if this is needed. Also note the use of defaultdict so you don't have to manually initialize the dictionary keys and values as a default value set() (i.e. the empty set) gets inserted for each key that we try to access but not others:
from collections import defaultdict
string = "the faith that he had had had had an affect on his life"
def get_word_len_dict(text):
result_dict = defaultdict(set)
for word in text.split():
length = len(word)
result_dict[length].add(word)
return {k : list(v) for k, v in result_dict.items()}
This gives the following output:
>>> get_word_len_dict(string)
{2: ['he', 'on', 'an'], 3: ['his', 'had', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
Your code is counting the occurrence of each word length - but not storing the words themselves.
In addition to capturing each word into a list of words with the same size, you also appear to want:
If a word length is not represented, do not return an empty list for that length - just don't have a key for that length.
No duplicates in each word list
Each word list is sorted
A set container is ideal for accumulating the words - sets naturally eliminate any duplicates added to them.
Using defaultdict(sets) will setup an empty dictionary of sets -- a dictionary key will only be created if it is referenced in our loop that examines each word.
from collections import defaultdict
def get_word_len_dict(text):
#create empty dictionary of sets
d = defaultdict(set)
# the key is the length of each word
# The value is a growing set of words
# sets automatically eliminate duplicates
for word in text.split():
d[len(word)].add(word)
# the sets in the dictionary are unordered
# so sort them into a new dictionary, which is returned
# as a dictionary of lists
return {i:sorted(d[i]) for i in d.keys()}
In your example string of
a="the faith that he had had had had an affect on his life"
Calling the function like this:
z=get_word_len_dict(a)
Returns the following list:
print(z)
{2: ['an', 'he', 'on'], 3: ['had', 'his', 'the'], 4: ['life', 'that'], 5: ['faith'], 6: ['affect']}
The type of each value in the dictionary is "list".
print(type(z[2]))
<class 'list'>

Python - Count letters in random strings

I have a bunch of integers which are allocated values using the random module, then converted to letters depending on their position of the alphabet.
I then combine a random sample of these variables into a "master" variable, which is printed to the console.
I want to then count the occurrence of each character, which will later be written to an output file.
Any help on how i would go about doing this?
>>> from collections import Counter
>>> for letter, count in Counter("aaassd").items():
... print("letter", letter, "count", count)
...
letter s count 2
letter a count 3
letter d count 1
Probably better to use collections.Counter(), but here is a list comprehension
>>> li = 'aaassd'
>>> res = {ch: sum(1 for x in li if x==ch) for ch in set(li)}
{'d': 1, 's': 2, 'a': 3}

Python - counting duplicate strings

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']
Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3
Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']
If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}
why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.
As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

Counting unique words in python

In direct, my code so far is this :
from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
with open(fp) as fh:
return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
I want to add a code that counts unique words from pattern(42 txt files in this path) but I don't know how. Can anybody help me?
The best way to count objects in Python is to use collections.Counter class, which was created for that purposes. It acts like a Python dict but is a bit easier in use when counting. You can just pass a list of objects and it counts them for you automatically.
>>> from collections import Counter
>>> c = Counter(['hello', 'hello', 1])
>>> print c
Counter({'hello': 2, 1: 1})
Also Counter has some useful methods like most_common, visit documentation to learn more.
One method of Counter class that can also be very useful is update method. After you've instantiated Counter by passing a list of objects, you can do the same using update method and it will continue counting without dropping old counters for objects:
>>> from collections import Counter
>>> c = Counter(['hello', 'hello', 1])
>>> print c
Counter({'hello': 2, 1: 1})
>>> c.update(['hello'])
>>> print c
Counter({'hello': 3, 1: 1})
print len(set(w.lower() for w in open('filename.dat').read().split()))
Reads the entire file into memory, splits it into words using
whitespace, converts
each word to lower case, creates a (unique) set from the lowercase words, counts them
and prints the output
If you want to get count of each unique word, then use dicts:
words = ['Hello', 'world', 'world']
count = {}
for word in words :
if word in count :
count[word] += 1
else:
count[word] = 1
And you will get dict
{'Hello': 1, 'world': 2}

Categories

Resources