JSON File: Count the Full Number of Words with Python

JSON File: Count the Full Number of Words with Python - python

For a current research project, I am planning to measure the relative occurrence of a unique word within a JSON file. Currently, I have an indicator for the number of unique words within the file and their corresponding number of occurrences (e.g. "technology":"325") but am still lacking a method for a full word count.
The code as I am using for a full word count (total = sum(d[key])) yields the following notification. I have checked some solutions for similar problems but not found an applicable answer yet. Is there any smart way to get this solved?
total = sum(d[key]) - TypeError: 'int' object is not iterable
The corresponding code section looks like this:
# Create an empty dictionary
d = dict()
# processing:
for row in data:
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total = sum(d[key])
print(total)

https://docs.python.org/3/library/functions.html#sum
You are trying to sum(iterable, /, start=0) an integer. This doesn't make sense, because sum is meant to be called on an iterable. A brief explanation of an iterable is that it's something that you could use a for loop on. For example, a list.
You could either modify your # Print the contents of dictionary loop in one of the two following ways:
# Print the contents of dictionary
total = 0
for key in list(d.keys()):
print(key, ":", d[key])
# Count the total number of words
total += d[key]
print(total)
print("Actual total: ," total)
Or, more condensed:
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])
# Get the total word count
total = sum(d.values())

python's built-in sum function takes iterable as argument, but you trying to pass an single number to it. your code is equivalent to
total = sum(1)
but sum function need add something iterable to compute sum from. e.g.
sum([1,2,3,4,5,6,7])
if you want to compute total number of words you can try:
sum(d.values())

d=dict()
d['A']=1
d['B']=2
d['C']=3
total = sum(d.values())
print total
for key in list(d.keys()):
print(key, ":", d[key], float(d[key])/total)
#Count the total number of words
d[key] is a single int
d.values() is a list

Related

How to extract words from repeating strings

Here I have a string in a list:
['aaaaaaappppppprrrrrriiiiiilll']
I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.
The output should be something like:
['aprilaprilapril']
Because the word 'april' occurred three times in that string.
Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.
My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.
But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.
If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!

One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)
from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
result += ''.join(each)
# result = 'aprilaprilapril'
Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):
def getCounts(strng):
if not strng:
return [], 0
counts = {}
current = strng[0]
for c in strng:
if c in counts.keys():
if current==c:
counts[c] += 1
else:
current = c
counts[c] = 1
return counts.keys(), min(counts.values())
result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
result += ''.join(counts[0])
# result = 'aprilaprilapril'

How about using regex?
import re
word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'
regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)
if match:
# Find the lowest amount of character repeats
lowest_amount = min(len(g) for g in match.groups())
print(word * lowest_amount)
else:
print("no match")
Outputs:
aprilaprilapril
Works like a charm

Here is a more native approach, with plain iteration.
It has a time complexity of O(n).
It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.
def countCompleteSequenceOccurences(searchString, key):
left = 0
minLetterCount = 0
letterCount = 0
for i, searchChar in enumerate(key):
while left < len(searchString) and searchString[left] == searchChar:
letterCount += 1
left += 1
minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
letterCount = 0
return minLetterCount
Testing:
testCasesToOracles = {
"aaaaaaappppppprrrrrriiiiiilll": 3,
"ppppppprrrrrriiiiiilll": 0,
"aaaaaaappppppprrrrrriiiiii": 0,
"aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
"pppppppaaaaaaarrrrrriiiiiilll": 0,
"zaaaaaaappppppprrrrrriiiiiilll": 3,
"zzzaaaaaaappppppprrrrrriiiiiilll": 3,
"aaaaaaappppppprrrrrriiiiiilllzzz": 3,
"zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}
key = "april"
for case, oracle in testCasesToOracles.items():
result = countCompleteSequenceOccurences(case, key)
assert result == oracle
Usage:
key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)
Output:
aprilaprilapril

A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:
from collections import Counter
def count_recurrence(kernel, string):
# we need to count both strings
kernel_counter = Counter(kernel)
string_counter = Counter(string)
# now get effective count by dividing the occurence in string by occurrence
# in kernel
effective_counter = {
k: int(string_counter.get(k, 0)/v)
for k, v in kernel_counter.items()
}
# min occurence of kernel is min of effective counter
min_recurring_count = min(effective_counter.values())
return kernel * min_recurring_count

finding the word with most repeated letters from a string containing a sentence in python

I want to find a word with the most repeated letters given an input a sentence.
I know how to find the most repeated letters given the sentence but I'm not able how to print the word.
For example:
this is an elementary test example
should print
elementary
def most_repeating_word(strg):
words =strg.split()
for words1 in words:
dict1 = {}
max_repeat_count = 0
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result

You are resetting the most_repeat_count variable for each word to 0. You should move that upper in you code, above first for loop, like this:
def most_repeating_word(strg):
words =strg.split()
max_repeat_count = 0
for words1 in words:
dict1 = {}
for letter in words1:
if letter not in dict1:
dict1[letter] = 1
else:
dict1[letter] += 1
if dict1[letter]> max_repeat_count:
max_repeat_count = dict1[letter]
most_repeated_char = letter
result=words1
return result
Hope this helps

Use a regex instead. It is simple and easy. Iteration is an expensive operation compared to regular expressions.
Please refer to the solution for your problem in this post:
Count repeated letters in a string

Interesting exercise! +1 for using Counter(). Here's my suggestion also making use of max() and its key argument, and the * unpacking operator.
For a final solution note that this (and the other proposed solutions to the question) don't currently consider case, other possible characters (digits, symbols etc) or whether more than one word will have the maximum letter count, or if a word will have more than one letter with the maximum letter count.
from collections import Counter
def most_repeating_word(strg):
# Create list of word tuples: (word, max_letter, max_count)
counters = [ (word, *max(Counter(word).items(), key=lambda item: item[1]))
for word in strg.split() ]
max_word, max_letter, max_count = max(counters, key=lambda item: item[2])
return max_word

word="SBDDUKRWZHUYLRVLIPVVFYFKMSVLVEQTHRUOFHPOALGXCNLXXGUQHQVXMRGVQTBEYVEGMFD"
def most_repeating_word(strg):
dict={}
max_repeat_count = 0
for word in strg:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
if dict[word]> max_repeat_count:
max_repeat_count = dict[word]
result={}
for word, value in dict.items():
if value==max_repeat_count:
result[word]=value
return result
print(most_repeating_word(word))

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you

The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))

from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]

import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.

Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.

You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

iterating and printing words in a python dictionary

I'm in the process of learning python and I love how much can be accomplished in such a small amount of code but I'm getting confused about the syntax. I'm just trying to iterate through a dictionary and print out each item and value.
Here is my code:
words = {}
value = 1
for line in open("test.txt", 'r'):
for word in line.split():
print (word)
try:
words[word] += 1
except KeyError:
#wur you at key?
print("no")
words[word]=1
for item in words:
print ("{",item, ": ", words[item][0], " }")
My current print statement doesn't work and I can't find a good example of a large print statement using multiple variables. How would I print this properly?

Your problem seems to be that you're trying to print words[item][0], but words[item] is always going to be a number, and number can't be indexed.
So, just… don't do that:
print ("{",item, ": ", words[item], " }")
That's enough to fix it, but there are ways you could improve this code:
print with multiple arguments puts a space between each one, so you're going to end up printing { item : 3 }, when you probably didn't want all those spaces. You can fix that by using the keyword argument sep='', but a better solution is to use string formatting or the % operator.
You can get the keys and values at the same time by iterating over words.items() instead of words.
You can simplify the whole "store a default value if one isn't already there" by using the setdefault method, or by using a defaultdict—or, even more simply, you can use a Counter.
You should always close files that you open—preferably by using a with statement.
Be consistent in your style—don't put spaces after some functions but not others.
So:
import collections
with open("test.txt") as f:
words = collections.Counter(word for line in f for word in line.split())
for item, count in words.items():
print("{%s: %d}" % (item, count))

you can use dict.get and can eliminate try and except block.
words = {}
for line in open("test.txt", 'r'):
for word in line.split():
print (word)
words[word] = words.get(word,0) +1
for word,count in words.items():
print(word,count)
dict.get it return the key, if present in dictionary else default value
syntax: dict.get(key[,default])
you can also override __missing__:
class my_dict(dict):
def __missing__(self,key):
return 0
words = my_dict()
for line in open("test.txt", 'r'):
for word in line.split():
print (word)
words[word] += 1
for word,count in words.items():
print(word,count)

The best way to iterate through a dictionary as you're doing here is to loop by key AND value, unpacking the key-value tuple each time through the loop:
for item, count in words.items():
print("{", item, ": ", count, "}")
And as a side note, you don't really need that exception handling logic in that loop where you build the array. Dictionaries' get() methods can return a default value if the key isn't in the dictionary, simplifying your code to this:
words[word] = words.get(word, 0) + 1

Counting every word in a text file only once using python

I have a small python script I am working on for a class homework assignment. The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. For this assignment, a word is defined as 2 letters or more. I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. Unique words meaning count every word in the document, only once.
Without changing my current script too much, how can I count all the words in the document only one time?
p.s. I am using Python 2.6 so please don't mention the use of collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique

Count the number of keys in your counter dictionary:
total_unique = len(counter.keys())
Or more simply:
total_unique = len(counter)

A defaultdict is great, but it might be more that what you need. You will need it for the part about most frequent words. But in the absence of that question, using a defaultdict is overkill. In such a situation, I would suggest using a set instead:
words = set()
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
words.add(word)
num_unique_words = len(words)
Now words contains only unique words.
I am only posting this because you say that you are new to python, so I want to make sure that you are aware of sets as well. Again, for your purposes, a defaultdict works fine and is justified

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.