To count the word frequency in multiple documents python [duplicate]

To count the word frequency in multiple documents python [duplicate] - python

This question already has an answer here:
how to find frequency of the keys in a dictionary across multiple text files?
(1 answer)
Closed 9 years ago.
I have a list of the addresses of multiple text files in a dictionary 'd':
'd:/individual-articles/9.txt', 'd:/individual-articles/11.txt', 'd:/individual-articles/12.txt',...
and so on...
Now, I need to read each file in the dictionary and keep a list of the word occurrences of each and every word that occurs in the entire dictionary.
My output should be of the form:
the-500
a-78
in-56
and so on..
where 500 is the number of times the word "the" occurs in all the files in the dictionary..and so on..
I need to do this for all the words.
I am a python newbie..plz help!
My code below doesn't work,it shows no output!There must be a mistake in my logic, please rectify!!
import collections
import itertools
import os
from glob import glob
from collections import Counter
folderpaths='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
folderpath='d:/individual-articles/'
# i am creating my dictionary here, can be ignored
d = collections.defaultdict(list)
with open('topics.txt') as f:
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
if key=='earn':
d[key].append(folderpath+value+".txt")
for key, value in d.items() :
print(value)
word_count_dict={}
for file in d.values():
with open(file,"r") as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
for word in words:
word_count_dict[word].append(counter)
for word, counts in word_count_dict.values():
print(word, counts)

Inspired from the Counter collection that you use:
from glob import glob
from collections import Counter
import re
folderpaths = 'd:/individual-articles'
counter = Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
with open(file) as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
print counter

Your code should give you an error in this line:
word_count_dict[word][file]+= 1
Because your word_count_dict is empty, so when you do word_count_dict[word][file] you should get a key error, because word_count_dict[word] doesn't exist, so you can do [file] on it.
And I found another error:
while file in d.items():
This would make file a tuple. But then you do f = open(file,"r"), so you assume file is a string. This would also raise an error.
This means that none of these lines are ever executed. That in turn means that either while file in d.items(): is empty or for file in filepaths: is empty.
And to be honest I don't understand why you have both of them. I don't understand what you are trying to achieve there. You have generated a list of filenames to parse. You should just iterate over them. I also don't know why d is a dict. All you need is a list of all the files. You don't need to keep track of when key the file came from in the topics, list, do you?

Related

The output is unsorted and sorting on the second value is not possible. Is there special method to sort on the second value

The output is unsorted, and sorting on the second column is not possible. Is there special method to sort on the second value.
This program takes a text and counts how many times a word is in a text
import string
with open("romeo.txt") as file: # opens the file with text
lst = []
d = dict ()
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
print(d)

Dictionaries with default value
The code snippet:
d = dict()
...
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
has become so common in python that a subclass of dict has been created to get rid of it. It goes by the name defaultdict and can be found in module collections.
Thus we can simplify your code snippet to:
from collections import defaultdict
d = defaultdict(int)
...
d[word] = d[word] + 1
No need for this manual if/else test; if word is not in the defaultdict, it will be added automatically with initial value 0.
Counters
Counting occurrences is also something that is frequently useful; so much so that there exists a subclass of dict called Counter in module collections. It will do all the hard work for you.
from collections import Counter
import string
with open('romeo.txt') as input_file:
counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split())
with open('romeo_unique.txt', 'w') as output_file:
for word in counts:
output_file.write(word + '\n')
As far as I can tell from the documentation, Counters are not guaranteed to be ordered by number of occurrences by default; however:
When I use them in the interactive python interpreter they are always printed in decreasing number of occurrences;
they provide a method .most_common() which is guaranteed to return in decreasing number of occurrences.

In Python, standard dictionaries are an unsorted data type, but you can look here, assuming that with sorting your output you mean d

A couple of remarks first:
You are not sorting explicitly (e.g. by using sorted) by a given property. Dictionaries might be considered to have a "natural" order by the alphanumeric value of the value part of each key-value pair and they might sort correctly when iterated (e.g. for printing), but it is better to explicitly sort a dict.
You check the existence of a word in the lst variable, which is very slow since checking a list requires checking all entries until something is found (or not). It would be much better to check for existence in a dict.
I'm assuming by "the second column" you mean the information for each word that counts the order in which the word first appeared.
With that I'd change the code to also record the word index of the first occurence of each word with, which then allows for sorting on exactly that.
Edit: Fixed the code. The sorting yielded by sorted sorts by key, not value. That's what I get for not testing code before posting an answer.
import string
from operator import itemgetter
with open("romeo.txt") as file: # opens the file with text
first_occurence = {}
uniqueness = {}
word_index = 1
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in uniqueness:
uniqueness[word] = 1
else:
uniqueness[word] += 1
if word not in first_occurence:
first_occurence[word] = word_index
uniquewords.write(str(word) + '\n') # write the unique word to the file
word_index += 1
print(sorted(uniqueness.items(), key=itemgetter(1)))
print(sorted(first_occurence.items(), key=itemgetter(1)))

Append multiple Counter() objects and convert to Data Frame

I would like to find word frequencies of a list of reserved words in multiple .txt files as a pandas data frame. I am using collections.Counter() objects and if a certain word does not appear in a text, the value of that word (key) is zero in the Counter().
Ideally, the result is a data frame where each row corresponds to each .txt file, column headers correspond to the reserved words and the entry in the row i column j corresponds to the frequency of j-th word in the i-th .txt file.
Here is my code, but the problem is that the Counter() objects are not appended, in the sense of a dictionary with multiple values for each key (or reserved word), but summed instead:
for filepath in iglob(os.path.join(folder_path, '*.txt')):
with open(filepath) as file:
cnt = Counter()
tokens = re.findall(r'\w+', file.read().lower())
for word in tokens:
if word in mylist:
cnt[word] += 1
for key in mylist:
if key not in cnt:
cnt[key] = 0
dictionary = defaultdict(list)
for key, value in cnt.items():
dictionary[key].append(value)
print(dictionary)
Any hint will be much appreciated!

You need to create the dictionary for the dataframe before the loop and then copy/append the Counter values of each text file over.
#!/usr/bin/env python3
import os
import re
from collections import Counter
from glob import iglob
def main():
folder_path = '...'
keywords = ['spam', 'ham', 'parrot']
keyword2counts = {keyword: list() for keyword in keywords}
for filename in iglob(os.path.join(folder_path, '*.txt')):
with open(filename) as file:
words = re.findall(r'\w+', file.read().lower())
keyword2count = Counter(word for word in words if word in keywords)
for keyword in keywords:
keyword2counts[keyword].append(keyword2count[keyword])
print(keyword2counts)
if __name__ == '__main__':
main()
Testing if for item in a list can be significantly slower than doing the same test for items in a set. So if this is too slow you might use a set for keywords or an additional one just for the test.
And a collections.OrderedDict prior to Python 3.7 (or CPython 3.6) if the order of the columns is relevant.

Counting occurences of words that appear in a list using Python

I have appended excel sheet values to a list using xlrd. I called the list a_master. I have a text file with words that I want to count the occurrences of that appear in this list (I called this file dictionary and theirs 1 word per line). Here is the code:
with open("dictionary.txt","r") as f:
for line in f:
print "Count " + line + str((a_master).count(line))
For some reason though, the count comes back with zero for every count word that exists in the text file. If I write out the count for one of these words myself:
print str((a_master).count("server"))
It counts the occurrences no problem.I have also tried
print line
in order to see if it is seeing the words in the dictionary.txt file correctly and it is.

Lines read from the file is terminated by newline character. There may also be white space at the end. It is better to strip out any whitespace before doing a lookup
with open("dictionary.txt","r") as f:
for line in f:
print "Count " + line + str((a_master).count(line.strip()))
Note Ideally, searching a list is linear and may not be optimal in most cases. I think collections.Counter is suitable for situation as you depicted.
Re-interpret your list as a dictionary where the key is the item and the value is the occurrence by passing it through collections.Counter as shown below
a_master = collections.Counter(a_master)
and you can re-write your code as
from itertools import imap
with open("dictionary.txt","r") as f:
for line in imap(str.strip, f):
print "Count {} {}".format(line, a_master[line])

Use collections.Counter():
import re
import collections
words = re.findall(r'\w+', open('dictionary.txt').read().lower())
collections.Counter(words)
Why is this question tagged xlrd by the way?

How do I see if a value matches another value in a text file in Python?

Here's what I have so far.
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
for bob in permutations(original):
print(''.join(bob))
inputFile = open(dic.txt, 'r')
compare = inputFile.read()
inputFile.close()
Basically, what I'm trying to do is create a word unscrambler by having Python find all possible rearrangements of a string and then only print the rearrangements that are actual words, which can be found out by running each rearrangement through a dictionary file (in this case dic.txt) to see if there is a match. I am running Python 3.3, if that matters. What do I need to add in order to compare the rearrangements with the dictionary file?

You could store the permutations in a list, add the dictionary in another list and select those being in both lists…
For example this way:
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
perms = []
for bob in permutations(original):
perms.append(''.join(bob))
inputFile = open(dic.txt, 'r')
dict_entries = inputFile.read().split('\n')
inputFile.close()
for word in [perm for perm in perms if perm in dict_entries]:
print word
(Assuming the dictionary contains one word per line…)

Read the dictionary file into a list line by line, iterate through each of the rearrangements and check if it's in the dictionary like so:
if word in dict_list:
...

Although this puts a little more up-front effort into processing the input file, once you've built the word_dict it's much more efficient to look up the sorted form of the word rather than build and check for all permutations:
def get_word_dict(filename):
words = {}
with open(filename) as word_dict:
for line in word_dict:
word = line.strip()
key = sorted(word)
if key not in words:
words[key] = []
words[key].append(word)
return words
word_dict = get_word_dict('dic.txt')
original = input("What word would you like to unscramble: ")
key = sorted(original)
if key in word_dict:
for word in word_dict[key]:
print(word)
else:
print("Not in the dictionary.")
This will be particularly beneficial if you want to look up more than one word - you can process the file once, then repeatedly refer to the word_dict.

how to find word frequency without duplicates across multiple files? [duplicate]

This question already has an answer here:
word frequency calculation in multiple files [duplicate]
(1 answer)
Closed 9 years ago.
I am trying to find the frequency of words in multiple files in a folder, i need to increment a word's count by 1 if it is found in a file.
For ex:the line "all's well that ends well" if read in file 1 must increment the count of "well" by 1 and not 2,
and if "she is not well" is read in file2, the count of "well" shall become 2
i need to increment the counter without including the duplicates, but my program does not take that into account, so plz help!!
import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
def removeduplicates(l):
return list(set(l))
folderpath='d:/articles-words'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
num_files = len(filepaths)
# Add all words to counter
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
cwords=removeduplicates(words)
counter.update(cwords)
# Display most common
for word, count in counter.most_common():
# Break out if the frequency is less than 0.1 * the number of files
if count < 0.1*num_files:
break
print('{} {}'.format(word,count))
I have tried the sort and remove duplicate techniques, but it still doesn't work!

If I understand your problem correctly, basically you want to know for each word, how many times does it appear across all files (regardless if the same word is more than once in a same file).
In order to do this, I did the following schema, that simulates a list of many files (I just cared about the process, not the files per-se, so you might have to manage to change the "files" for the actual list you want to process.
d = {}
i = 0
for f in files:
i += 1
for line in f:
words = line.split()
for word in words:
if word not in d:
d[word] = {}
d[word][i] = 1
d2 = {}
for word,occurences in d.iteritems():
d2[word] = sum( d[word].values() )
The result will give you something like the following:
{'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's": 1}

I'd do it a much different way, but the crux of it is using a set.
frequency = Counter()
for line in open("file", "r"):
for word in set(line):
frequency[word] += 1
I'm not sure if it's preferable to use .readline() or whatnot; I typically use for loops because they're so damn simple.
Edit: I see what you're doing wrong. You read the entire contents of the file with .read(), (perform removegarbage() on it) and then .split() the result. That'll give you a single list, destroying the newlines:
>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

To count the word frequency in multiple documents python [duplicate] - python

Related

The output is unsorted and sorting on the second value is not possible. Is there special method to sort on the second value

Append multiple Counter() objects and convert to Data Frame

Counting occurences of words that appear in a list using Python

How do I see if a value matches another value in a text file in Python?

how to find word frequency without duplicates across multiple files? [duplicate]

Categories

Resources