Python - counting duplicate strings

Python - counting duplicate strings - python

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']

Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3

Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']

If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}

why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.

As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

Related

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?
At the minute, I have this code:
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
d = {}
lst = []
def findDupeWords():
string = text.lower()
#Split the string into words using built-in function
words = string.split(" ")
print("Duplicate words in a given string : ")
for i in range(0, len(words)):
count = 1
for j in range(i+1, len(words)):
if(words[i] == (words[j])):
count = count + 1
#Set words[j] to 0 to avoid printing visited word
words[j] = "0"
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
print(words[i])
for key in d:
text = text.replace(key,d[key])
print(text)
findDupeWords()
The output I get when I run this is:
Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll
prove she can again. We all already know this and you will too.
Duplicate words in a given string :
she
doesn't
have
to
but
just
her
will
you
and
How can I turn this list of words into a dictionary, like the following:
{'she': 1, 'doesn't': 2, 'have': 3, 'to': 4} , etc...

Don't reinvent the wheel, use an instance of a collections.Counter in the standard library.
from collections import Counter
def findDupeWords(text):
counter = Counter(text.lower().split(" "))
for word in counter:
if counter[word] > 1:
print(word)
text = "Here's the thing. She doesn't have anything to prove, but she is going to anyway. That's just her character. She knows she doesn't have to, but she still will just to show you that she can. Doubt her more and she'll prove she can again. We all already know this and you will too."
findDupeWords(text)

Well, you could replace your call to print with an assignment to a dictionary:
def findDupeWords():
duplicates = {}
duplicate_counter = 0
...
#Displays the duplicate word if count is greater than 1
if(count > 1 and words[i] != "0"):
duplicate_counter += 1
duplicates[words[i]] = duplicate_counter
But there are easier ways to achieve this, for example with collections.Counter:
from collections import Counter
words = text.lower().split()
word_occurrences = Counter(words)
dupes_in_order = sorted(
(word for word in set(words) if word_occurrences[word] > 1),
key=lambda w: words.index(w),
)
dupe_dictionary = {word: i+1 for i, word in enumerate(dupes_in_order)}
Afterwards:
>>> dupe_dictionary
{'she': 1,
"doesn't": 2,
'have': 3,
'to': 4,
'but': 5,
'just': 6,
'her': 7,
'will': 8,
'you': 9,
'and': 10}

Python- How to return a dictionary that counts occurrences in a list of strings?

I'm trying to make a function that counts occurrences of the first letters of a list of strings and returns them as a dictionary.
For example:
list=["banana","ball", "cat", "hat"]
dictionary would look like: {b:2, c:1, h:1}
Here is the code I have which iterates but doesn't count properly. That's where I'm getting stuck. How do I update the values to be count?
def count_starts(text):
new_list=[]
for word in range(len(text)):
for letter in text[word]:
if letter[0]=='':
new_list.append(None)
else:
new_list.append(letter[0])
new_dict= {x:new_list.count(x) for x in new_list}
return new_dict
Also, how can I avoid the out of range error given the following format:
def count_starts(text):
import collections
c=collections.Counter(x[0] for x in text)
return c
Also, what do I need to do if the list contains "None" as a value? I need to count None.

Problem with your code is that you seem to iterate on all letters of the word. letter[0] is a substring of the letter (which is a string).
You'd have to do it more simply, no need for a double loop, take each first letter of your words:
for word in text:
if word: # to filter out empty strings
first_letter = word[0]
But once again collections.Counter taking a generator comprehension to extract first letter is the best choice and one-liner (with an added condition to filter out empty strings):
import collections
c = collections.Counter(x[0] for x in ["banana","ball", "cat", "", "hat"] if x)
c is now a dict: Counter({'b': 2, 'h': 1, 'c': 1})
one variant to insert None instead of filtering out empty values would be:
c = collections.Counter(x[0] if x else None for x in ["banana","ball", "cat", "", "hat"])

my_list=["banana","ball", "cat", "hat"]
my_dict = dict()
for word in my_list:
try:
my_dict[word[0]] += 1
except KeyError:
my_dict[word[0]] = 1
This increases the value of the key by 1 for the already existing key, and if they key has not been found before it creates it with the value 1
Alternative:
my_list=["banana","ball", "bubbles", "cat", "hat"]
my_dict = dict()
for word in my_list:
if word[0] in my_dict.keys():
my_dict[word[0]] += 1
else:
my_dict[word[0]] = 1

Also, what do I need to do if the list contains "None" as a value? I
need to count None.
removing None
lst_no_Nones = [x for x in lis if x != None]
count None
total_None = (sum(x != None for x in lst))

you need counter:
from collections import Counter
lst = ["banana","ball", "cat", "hat"]
dct = Counter(lst)
Now, dct stores the number of times every element in lst occurs.
dct = {'b': 2, 'h': 1, 'c': 1}

using lambda and dictionaries functions

I wrote this function:
def make_upper(words):
for word in words:
ind = words.index(word)
words[ind] = word.upper()
I also wrote a function that counts the frequency of occurrences of each letter:
def letter_cnt(word,freq):
for let in word:
if let == 'A': freq[0]+=1
elif let == 'B': freq[1]+=1
elif let == 'C': freq[2]+=1
elif let == 'D': freq[3]+=1
elif let == 'E': freq[4]+=1

Counting letter frequency would be much more efficient with a dictionary, yes. Note that you are manually lining up each letter with a number ("A" with 0, et cetera). Wouldn't it be easier if we could have a data type that directly associated a letter with the number of times it occurs, without adding an extra set of numbers in between?
Consider the code:
freq = {"A":0, "B":0, "C":0, "D":0, ... ..., "Z":0}
for letter in text:
freq[letter] += 1
This dictionary is used to count frequencies much more efficiently than your current code does. You just add one to an entry for a given letter each time you see it.
I will also mention that you can count frequencies effectively with certain libraries. If you are interested in analyzing frequencies, look into collections.Counter() and possibly the collections.Counter.most_common() method.
Whether or not you decide to just use collections.Counter(), I would attempt to learn why dictionaries are useful in this context.
One final note: I personally found typing out the values for the "freq" dictionary to be tedious. If you want you could construct an empty dictionary of alphabet letters on-the-fly with this code:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
freq = {letter:0 for letter in alphabet}

If you want to convert strings in the list to upper case using lambda, you may use it with map() as:
>>> words = ["Hello", "World"]
>>> map(lambda word: word.upper(), words) # In Python 2
['HELLO', 'WORLD']
# In Python 3, use it as: list(map(...))
As per the map() document:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results.
For finding the frequency of each character in word, you may use collections.Counter() (sub class dict type) as:
>>> from collections import Counter
>>> my_word = "hello world"
>>> c = Counter(my_word)
# where c holds dictionary as:
# {'l': 3,
# 'o': 2,
# ' ': 1,
# 'e': 1,
# 'd': 1,
# 'h': 1,
# 'r': 1,
# 'w': 1}
As per Counter Document:
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values.

for the letter counting, don't reinvent the wheel collections.Counter
A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

def punc_remove(words):
for word in words:
if word.isalnum() == False:
charl = []
for char in word:
if char.isalnum()==True:
charl.append(char)
ind = words.index(word)
delimeter = ""
words[ind] = delimeter.join(charl)
def letter_cnt_dic(word,freq_d):
for let in word:
freq_d[let] += 1
import string
def letter_freq(fname):
fhand = open(fname)
freqs = dict()
alpha = list(string.uppercase[:26])
for let in alpha: freqs[let] = freqs.get(let,0)
for line in fhand:
line = line.rstrip()
words = line.split()
punc_remove(words)
#map(lambda word: word.upper(),words)
words = [word.upper() for word in words]
for word in words:
letter_cnt_dic(word,freqs)
fhand.close()
return freqs.values()

You can read the docs about the Counter and the List Comprehensions or run this as a small demo:
from collections import Counter
words = ["acdefg","abcdefg","abcdfg"]
#list comprehension no need for lambda or map
new_words = [word.upper() for word in words]
print(new_words)
# Lets create a dict and a counter
letters = {}
letters_counter = Counter()
for word in words:
# The counter count and add the deltas.
letters_counter += Counter(word)
# We can do it to
for letter in word:
letters[letter] = letters.get(letter,0) + 1
print(letters_counter)
print(letters)

Python - Count letters in random strings

I have a bunch of integers which are allocated values using the random module, then converted to letters depending on their position of the alphabet.
I then combine a random sample of these variables into a "master" variable, which is printed to the console.
I want to then count the occurrence of each character, which will later be written to an output file.
Any help on how i would go about doing this?

>>> from collections import Counter
>>> for letter, count in Counter("aaassd").items():
... print("letter", letter, "count", count)
...
letter s count 2
letter a count 3
letter d count 1

Probably better to use collections.Counter(), but here is a list comprehension
>>> li = 'aaassd'
>>> res = {ch: sum(1 for x in li if x==ch) for ch in set(li)}
{'d': 1, 's': 2, 'a': 3}

sum of counter object of a list within a list

I am trying to find the sum of occurrence of a words from a list in a multiple lists. the list objects within list is huge so I used just a dummy instance
multiple=[['apple','ball','cat']['apple','ball']['apple','cat'].......]
words=['apple','ball','cat','duck'......]
word = 'apple'
cnt = Counter()
total = 0
for i in multiple:
for j in i:
if word in j:
cnt[word] +=1
total += cnt[word]
I wanted an output like this:
{'apple':3,'ball':2,'cat':2}

You can just feed the Counter a generator expression:
cnt = Counter(word for sublist in multiple for word in sublist)
cnt
Out[40]: Counter({'apple': 3, 'ball': 2, 'cat': 2})
sum(cnt.values())
Out[41]: 7
I didn't really see the point of your words list. You didn't use it.
If you need to filter out words that are not in words, make words a set, not a list.
words = {'apple','ball','cat','duck'}
cnt = Counter(word for sublist in multiple for word in sublist if word in words)
Otherwise you get O(n**2) behavior in what should be a O(n) operation.

This works in Python 2.7 and Python 3.x:
from collections import Counter
multiple=[['apple','ball','cat'],['apple','ball'],['apple','cat']]
words=['apple','ball','cat','duck']
cnt = Counter()
total = 0
for i in multiple:
for word in i:
if word in words:
cnt[word] +=1
total += 1
print cnt #: Counter({'apple': 3, 'ball': 2, 'cat': 2})
print dict(cnt) #: {'apple': 3, 'ball': 2, 'cat': 2}
print total #: 7
print sum(cnt.values()) #: 7
In Python 2.x you should use .itervalues() instead of .values() even though both work.
A bit shorter solution, based on roippi's answer:
from collections import Counter
multiple=[['apple','ball','cat'],['apple','ball'],['apple','cat']]
cnt = Counter(word for sublist in multiple for word in sublist)
print cnt #: Counter({'apple': 3, 'ball': 2, 'cat': 2})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - counting duplicate strings - python

If what you want is a dictionary counting the words in a string, you can try this: string = 'hello world hello again now hi there hi world'.split() d = {} for word in string: d[word] = d.get(word, 0) +1 print d Output: {'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}

Related

How do I convert a string to a dictionary, where each entry to the dictionary is assigned a value?

Python- How to return a dictionary that counts occurrences in a list of strings?

using lambda and dictionaries functions

Python - Count letters in random strings

sum of counter object of a list within a list

Categories

Resources