Counting every word in a text file only once using python

Counting every word in a text file only once using python - python

I have a small python script I am working on for a class homework assignment. The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. For this assignment, a word is defined as 2 letters or more. I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. Unique words meaning count every word in the document, only once.
Without changing my current script too much, how can I count all the words in the document only one time?
p.s. I am using Python 2.6 so please don't mention the use of collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique

Count the number of keys in your counter dictionary:
total_unique = len(counter.keys())
Or more simply:
total_unique = len(counter)

A defaultdict is great, but it might be more that what you need. You will need it for the part about most frequent words. But in the absence of that question, using a defaultdict is overkill. In such a situation, I would suggest using a set instead:
words = set()
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
words.add(word)
num_unique_words = len(words)
Now words contains only unique words.
I am only posting this because you say that you are new to python, so I want to make sure that you are aware of sets as well. Again, for your purposes, a defaultdict works fine and is justified

Related

Count words (even multiples) in a text with Python

I have to write a function that counts how many times a word (or a series of words) appears in a given text.
This is my function so far. What I noticed is that with a series of 3 words the functions works well, but not with 4 words and so on.
from nltk import ngrams
def function(text, word):
for char in ".?!-":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = word.lower()
n_grams = ngrams(bigram_lower.split(), n)
for gram in n_grams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)

First thing, please fix your indentation and don't use bigrams as a variable for ngrams as it's a bit confusing (Since you are not storing just bigrams in the bigrams variable). Secondly lets look at this part of your code -
for gram in bigrams:
for i in range (0, n):
if gram[i] == word_lower.split()[i]:
countN = countN + 1
print (countN)
Here you are increasing countN by one for each time a word in your ngram matches up instead of increasing it when the whole ngram matches up. You should instead only increase countN if all the words have matched up -
for gram in bigrams:
if list(gram) == word_lower.split():
countN = countN + 1
print (countN)

May be it was already done in here
Is nltk mandatory?
# Open the file in read mode
text = open("sample.txt", "r")
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
# Print the contents of dictionary
for key in list(d.keys()):
print(key, ":", d[key])

This shuld work for you:
def function(text, word):
for char in ".?!-,":
text = text.replace(char, ' ')
n = len(word.split())
countN = 0
bigram_lower = text.lower()
word_lower = tuple(word.lower().split())
bigrams = nltk.ngrams(bigram_lower.split(), n)
for gram in bigrams:
if gram == word_lower:
countN += 1
print (countN)
>>> tekst="this is the text i want to search, i want to search it for the words i want to search for, and it should count the occurances of the words i want to search for"
>>> function(tekst, "i want to search")
4
>>> function(tekst, "i want to search for")
2

Python 3 counter that is ignoring strings with less than x characters

i have a programm that counts words of a text file. Now i want to restrict the counter to strings with more than x characters
from collections import Counter
input = 'C:/Users/micha/Dropbox/IPCC_Boox/FOD_v1_ch15.txt'
Counter = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
for line in fh:
word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
for word in word_list:
if word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1
N = 20
top_words = Counter(Counter).most_common(N)
for word, frequency in top_words:
print("%s %d" % (word, frequency))
I tried the re code, but it did not work.
re.sub(r'\b\w{1,3}\b')
I dont know how to implement it...
At the end I would like to have an output that ignores all the short words like and, you, be etc.

You could do this more simply with:
for word in word_list:
if len(word) < 5: # check the length of each word is less than 5 for example
continue # this skips the counter portion and jumps to next word in word_list
elif word not in Counter:
Counter[word] = 1
else:
Counter[word] = Counter[word] + 1

Few notes.
1) You import a Counter but don't use it properly (you do a Counter = {} thus overwriting the import).
from collections import Counter
2) Instead of doing several replaces use list comprehension with a set, its faster and only does one (two with the join) iterations instead of several:
sentence = ''.join([char for char in line if char not in {'.', ',', "'"}])
word_list = sentence.split()
3) Use the counter and list comp for length:
c = Counter(word for word in word_list if len(word) > 3)
Thats it.

Counter already does what you want. You can "feed" it wiht an iterable and this will work.
https://docs.python.org/2/library/collections.html#counter-objects
You can use the filter function too https://docs.python.org/3.7/library/functions.html#filter
The could look alike:
counted = Counter(filter(lambda x: len(x) >= 5, words))

Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

Need to write a Python program that analyzes a file and counts:
The number of words
The average length of a word
How many times each word occurs
How many words start with each letter of the alphabet
I've got the code to do the first 2 things:
with open(input('Please enter the full name of the file: '),'r') as f:
w = [len(word) for line in f for word in line.rstrip().split(" ")]
total_w = len(w)
avg_w = sum(w)/total_w
print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)
But I'm not sure on how to do the others. Any help is appreciated.
Btw, when I say "How many words start with each letter of the alphabet" I mean how many words start with "A", how many start with "B", how many start with "C", etc all the way through to "Z".

There are many ways to achieve this, a more advanced approach would involve an initial simple gathering of the text and its words, then working on the data with ML/DS tools, with which you could extrapolate more statistics (Things like "a new paragraph starts mostly with X words" / "X words are mostly preceeded/succeeded by Y words" etc.)
If you just need very basic statistics you can gather them while iterating over each word and do the calculations at the end of it, like:
stats = {
'amount': 0,
'length': 0,
'word_count': {},
'initial_count': {}
}
with open('lorem.txt', 'r') as f:
for line in f:
line = line.strip()
if not line:
continue
for word in line.split():
word = word.lower()
initial = word[0]
# Add word and length count
stats['amount'] += 1
stats['length'] += len(word)
# Add initial count
if not initial in stats['initial_count']:
stats['initial_count'][initial] = 0
stats['initial_count'][initial] += 1
# Add word count
if not word in stats['word_count']:
stats['word_count'][word] = 0
stats['word_count'][word] += 1
# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']
Online Demo here

Interesting challenge you were given, i made a proposition for question 3, how many times a word occurs inside the string. This code is not optimal at all, but it does work.
also i used the file text.txt
edit: noticed i forgot to create wordlist as it was saved in my ram memory
with open('text.txt', 'r') as doc:
print('opened txt')
for words in doc:
wordlist = words.split()
for numbers in range(len(wordlist)):
for inner_numbers in range(len(wordlist)):
if inner_numbers != numbers:
if wordlist[numbers] == wordlist[inner_numbers]:
print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))
Answer to question four: This one wasn't really hard after you have created a list with all the words since strings can be treated like a list and you can easily get the first letter of the string by simply doing string[0] and if its a list with strings stringList[position of word][0]
for numbers in range(len(wordlist)):
if wordlist[numbers][0] == 'a':
print(wordlist[numbers])

How to create a dictionary for a text file

My program opens a file and it can word count the words contained in it but i want to create a dictionary consisting of all the unique words in the text
for example if the word 'computer' appears three times i want that to count as one unique word
def main():
file = input('Enter the name of the input file: ')
infile = open(file, 'r')
file_contents = infile.read()
infile.close()
words = file_contents.split()
number_of_words = len(words)
print("There are", number_of_words, "words contained in this paragarph")
main()

Use a set. This will only include unique words:
words = set(words)
If you don't care about case, you can do this:
words = set(word.lower() for word in words)
This assumes there is no punctuation. If there is, you will need to strip the punctuation.
import string
words = set(word.lower().strip(string.punctuation) for word in words)
If you need to keep track of how many of each word you have, just replace set with Counter in the examples above:
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
This will give you a dictionary-like object that tells you how many of each word there is.
You can also get the number of unique words from this (although it is slower if that is all you care about):
import string
from collections import Counter
words = Counter(word.lower().strip(string.punctuation) for word in words)
nword = len(words)

#TheBlackCat his solution works but only gives you how much unique words are in the string/file. This solution also shows you how many times it occurs.
dictionaryName = {}
for word in words:
if word not in list(dictionaryName):
dictionaryName[word] = 1
else:
number = dictionaryName.get(word)
dictionaryName[word] = dictionaryName.get(word) + 1
print dictionaryName
tested with:
words = "Foo", "Bar", "Baz", "Baz"
output: {'Foo': 1, 'Bar': 1, 'Baz': 2}

Probably more cleaner and quick solution:
words_dict = {}
for word in words:
word_count = words_dict.get(word, 0)
words_dict[word] = word_count + 1

10 ,most frequent words in a string Python

I need to display the 10 most frequent words in a text file, from the most frequent to the least as well as the number of times it has been used. I can't use the dictionary or counter function. So far I have this:
import urllib
cnt = 0
i=0
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
uniques = []
for line in txtFile:
words = line.split()
for word in words:
if word not in uniques:
uniques.append(word)
for word in words:
while i<len(uniques):
i+=1
if word in uniques:
cnt += 1
print cnt
Now I think I should look for every word in the array 'uniques' and see how many times it is repeated in this file and then add that to another array that counts the instance of each word. But this is where I am stuck. I don't know how to proceed.
Any help would be appreciated. Thank you

The above problem can be easily done by using python collections
below is the Solution.
from collections import Counter
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
Counters_found = Counter(split_it)
#print(Counters)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counters_found.most_common(4)
print(most_occur)

You're on the right track. Note that this algorithm is quite slow because for each unique word, it iterates over all of the words. A much faster approach without hashing would involve building a trie.
# The following assumes that we already have alice30.txt on disk.
# Start by splitting the file into lowercase words.
words = open('alice30.txt').read().lower().split()
# Get the set of unique words.
uniques = []
for word in words:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in words: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))

from string import punctuation #you will need it to strip the punctuation
import urllib
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt")
counter = {}
for line in txtFile:
words = line.split()
for word in words:
k = word.strip(punctuation).lower() #the The or you You counted only once
# you still have words like I've, you're, Alice's
# you could change re to are, ve to have, etc...
if "'" in k:
ks = k.split("'")
else:
ks = [k,]
#now the tally
for k in ks:
counter[k] = counter.get(k, 0) + 1
#and sorting the counter by the value which holds the tally
for word in sorted(counter, key=lambda k: counter[k], reverse=True)[:10]:
print word, "\t", counter[word]

import urllib
import operator
txtFile = urllib.urlopen("http://textfiles.com/etext/FICTION/alice30.txt").readlines()
txtFile = " ".join(txtFile) # this with .readlines() replaces new lines with spaces
txtFile = "".join(char for char in txtFile if char.isalnum() or char.isspace()) # removes everything that's not alphanumeric or spaces.
word_counter = {}
for word in txtFile.split(" "): # split in every space.
if len(word) > 0 and word != '\r\n':
if word not in word_counter: # if 'word' not in word_counter, add it, and set value to 1
word_counter[word] = 1
else:
word_counter[word] += 1 # if 'word' already in word_counter, increment it by 1
for i,word in enumerate(sorted(word_counter,key=word_counter.get,reverse=True)[:10]):
# sorts the dict by the values, from top to botton, takes the 10 top items,
print "%s: %s - %s"%(i+1,word,word_counter[word])
output:
1: the - 1432
2: and - 734
3: to - 703
4: a - 579
5: of - 501
6: she - 466
7: it - 440
8: said - 434
9: I - 371
10: in - 338
This methods ensures that only alphanumeric and spaces are in the counter. Doesn't matter that much tho.

Personally I'd make my own implementation of collections.Counter. I assume you know how that object works, but if not I'll summarize:
text = "some words that are mostly different but are not all different not at all"
words = text.split()
resulting_count = collections.Counter(words)
# {'all': 2,
# 'are': 2,
# 'at': 1,
# 'but': 1,
# 'different': 2,
# 'mostly': 1,
# 'not': 2,
# 'some': 1,
# 'that': 1,
# 'words': 1}
We can certainly sort that based on frequency by using the key keyword argument of sorted, and return the first 10 items in that list. However that doesn't much help you because you don't have Counter implemented. I'll leave THAT part as an exercise for you, and show you how you might implement Counter as a function rather than an object.
def counter(iterable):
d = {}
for element in iterable:
if element in d:
d[element] += 1
else:
d[element] = 1
return d
Not difficult, actually. Go through each element of an iterable. If that element is NOT in d, add it to d with a value of 1. If it IS in d, increment that value. It's more easily expressed by:
def counter(iterable):
d = {}
for element in iterable:
d.setdefault(element, 0) += 1
Note that in your use case, you probably want to strip out the punctuation and possibly casefold the whole thing (so that someword gets counted the same as Someword rather than as two separate words). I'll leave that to you as well, but I will point out str.strip takes an argument as to what to strip out, and string.punctuation contains all the punctuation you're likely to need.

You can also do it through pandas dataframes and get result in convinient form as a table: "word-its freq." ordered.
def count_words(words_list):
words_df = pn.DataFrame(words_list)
words_df.columns = ["word"]
words_df_unique = pn.DataFrame(pn.unique(words_list))
words_df_unique.columns = ["unique"]
words_df_unique["count"] = 0
i = 0
for word in pn.Series.tolist(words_df_unique.unique):
words_df_unique.iloc[i, 1] = len(words_df.word[words_df.word == word])
i+=1
res = words_df_unique.sort_values('count', ascending = False)
return(res)

To do the same operation on a pandas data frame, you may use the following through Counter function from Collections:
from collections import Counter
cnt = Counter()
for text in df['text']:
for word in text.split():
cnt[word] += 1
# Find most common 10 words from the Pandas dataframe
cnt.most_common(10)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting every word in a text file only once using python - python

Count the number of keys in your counter dictionary: total_unique = len(counter.keys()) Or more simply: total_unique = len(counter)

Related

Count words (even multiples) in a text with Python

Python 3 counter that is ignoring strings with less than x characters

Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

How to create a dictionary for a text file

10 ,most frequent words in a string Python

Categories

Resources