how to find word frequency without duplicates across multiple files? [duplicate] - python

This question already has an answer here:
word frequency calculation in multiple files [duplicate]
(1 answer)
Closed 9 years ago.
I am trying to find the frequency of words in multiple files in a folder, i need to increment a word's count by 1 if it is found in a file.
For ex:the line "all's well that ends well" if read in file 1 must increment the count of "well" by 1 and not 2,
and if "she is not well" is read in file2, the count of "well" shall become 2
i need to increment the counter without including the duplicates, but my program does not take that into account, so plz help!!
import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
sorted(text)
return text
def removeduplicates(l):
return list(set(l))
folderpath='d:/articles-words'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
num_files = len(filepaths)
# Add all words to counter
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
cwords=removeduplicates(words)
counter.update(cwords)
# Display most common
for word, count in counter.most_common():
# Break out if the frequency is less than 0.1 * the number of files
if count < 0.1*num_files:
break
print('{} {}'.format(word,count))
I have tried the sort and remove duplicate techniques, but it still doesn't work!

If I understand your problem correctly, basically you want to know for each word, how many times does it appear across all files (regardless if the same word is more than once in a same file).
In order to do this, I did the following schema, that simulates a list of many files (I just cared about the process, not the files per-se, so you might have to manage to change the "files" for the actual list you want to process.
d = {}
i = 0
for f in files:
i += 1
for line in f:
words = line.split()
for word in words:
if word not in d:
d[word] = {}
d[word][i] = 1
d2 = {}
for word,occurences in d.iteritems():
d2[word] = sum( d[word].values() )
The result will give you something like the following:
{'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's": 1}

I'd do it a much different way, but the crux of it is using a set.
frequency = Counter()
for line in open("file", "r"):
for word in set(line):
frequency[word] += 1
I'm not sure if it's preferable to use .readline() or whatnot; I typically use for loops because they're so damn simple.
Edit: I see what you're doing wrong. You read the entire contents of the file with .read(), (perform removegarbage() on it) and then .split() the result. That'll give you a single list, destroying the newlines:
>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

Related

Count common words

I have two files .txt and I should compare them and just count common words.What I should get is just a total count of how many words in common 2 different files have. How I can do it? Can you help me? This is the code that I have try, but I need only the total count ex "I have found 125 occurrences"
So now you have a dict with non-zero values for common words. Counting them is as simple as:
sum(v != 0 for v in dct.values())
If I understood correctly, I would create one set for each of the files with its words but I would also convert anything to lower to be sure that I ll have "proper" matches. Then I would create an intersection of those sets and just get the length.
file1_words = set()
file2_words = set()
file1 = open("text1.txt")
file2 = open("verbs.txt")
for word in file1.read().strip().split():
file1_words.add(word.lower())
for word in file2.read().strip().split():
file2_words.add(word.lower())
print(file1_words)
print(file2_words)
common_words = file1_words.intersection(file2_words)
print(len(common_words))
Based on your explanation, what you want to do is to sum the values of the dictionary you have created. You can do this simply by:
print(sum(dct.values()))
This counts all the occurences of the words in common.
It uses Counter see https://docs.python.org/fr/3/library/collections.html#collections.Counter
import re
from collections import Counter
with open('text1.txt', 'r') as f:
words1 = re.findall(r'\w+', f.read().lower())
with open('verbs.txt', 'r') as f:
words2 = re.findall(r'\w+', f.read().lower())
counter1 = Counter(words1)
counter2 = Counter(words2)
common = set(counter1.keys()).intersection(counter2.keys())
sum([counter1[e] + counter2[e] for e in common])

Counting words in python from the text file

Need to open text file and find numbers of occurrences for the names given in the other file. Program should write name; count pairs, separated by semicolons into the file with .csv format
It should look like:
Jane; 77
Hector; 34
Anna; 39
...
Tried to use "Counter" but then it looks like a list, so I think that this is a wrong way to do the task
import re
import collections
from collections import Counter
wanted = re.findall('\w+', open('iliadcounts.csv').read().lower())
cnt = Counter()
words = re.findall('\w+', open('pg6130.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
but this is definitely not the right code for this task...
You can feed the whole list of words to Counter at once, it will count it for you.
You can then print only the words in wanted by iterating over it:
import re
import collections
from collections import Counter
# create some demo data as I do not have your data at hand - uses your filenames
def create_demo_files():
with open('iliadcounts.csv',"w") as f:
f.write("hug,crane,box")
with open('pg6130.txt',"w") as f:
f.write("hug,shoe,blues,crane,crane,box,box,box,wood")
create_demo_files()
# work with your files
with open('iliadcounts.csv') as f:
wanted = re.findall('\w+', f.read().lower())
with open('pg6130.txt') as f:
cnt = Counter( re.findall('\w+', f.read().lower()) )
# printed output for all words in wanted (all words are counted)
for word in wanted:
print("{}; {}".format(word, cnt.get(word)))
# would work as well:
# https://docs.python.org/3/library/string.html#string-formatting
# print(f"{word}; {cnt.get(word)}")
Output:
hug; 1
crane; 2
box; 3
Or you can print the whole Counter:
print(cnt)
Output:
Counter({'box': 3, 'crane': 2, 'hug': 1, 'shoe': 1, 'blues': 1, 'wood': 1})
Links:
https://pyformat.info/
string formatting
with open(...) as f:

Word Count from File: Is it having problems opening the file, or have I coded it incorrectly?

Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least
Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes
There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!

How do I only use one column of imported text file? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
In my code I am importing 3 different list of names and numbers and I want to get the names that occur least often. Right now I get a list of all the names and how many times they occur. But the code also counts all the other columns that I do not need.
How do I only analyze the data of 1 column of the text files?
2.Only get out and answer with words that occur once, not multiple times?
import re
filelist = ['D.txt','A.txt','S.txt']
wordbank = {}
for file in filelist:
article_one = re.findall('\w+', open(file,).read().lower())
for word in article_one:
word = word.lower().strip(string.punctuation)
if word not in wordbank:
wordbank[word] = 1
else:
wordbank[word] += 1
sortedwords = sorted(wordbank.items(), key=operator.itemgetter(1))
for word in sortedwords:
print (word[1], word[0])
What separates your columns in the text files? For the sake of example, lets say they are tab separated columns. Rather than use regular expressions, all you need to do is read in each line of the text file and split the line by '\t'. Then to use only the first column, take index zero of the list that contains your split line.
What you are doing with wordbank should suffice for finding words that occur only once. All you have to do is take check the count of each word to make sure it is not greater than 1. For example:
filelist = ['D.txt','A.txt','S.txt']
wordbank = {}
for file in filelist:
f = open(file, 'r')
lines = f.readlines()
for l in lines:
line = l.split('\t')
word = line[0]
if word not in wordbank:
wordbank[word] = 1
else:
wordbank[word] += 1
f.close()
# Gather unique words
unique_words = []
for word in wordbank.keys():
if wordbank[word] == 1:
unique_words.append(word)

To count the word frequency in multiple documents python [duplicate]

This question already has an answer here:
how to find frequency of the keys in a dictionary across multiple text files?
(1 answer)
Closed 9 years ago.
I have a list of the addresses of multiple text files in a dictionary 'd':
'd:/individual-articles/9.txt', 'd:/individual-articles/11.txt', 'd:/individual-articles/12.txt',...
and so on...
Now, I need to read each file in the dictionary and keep a list of the word occurrences of each and every word that occurs in the entire dictionary.
My output should be of the form:
the-500
a-78
in-56
and so on..
where 500 is the number of times the word "the" occurs in all the files in the dictionary..and so on..
I need to do this for all the words.
I am a python newbie..plz help!
My code below doesn't work,it shows no output!There must be a mistake in my logic, please rectify!!
import collections
import itertools
import os
from glob import glob
from collections import Counter
folderpaths='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
folderpath='d:/individual-articles/'
# i am creating my dictionary here, can be ignored
d = collections.defaultdict(list)
with open('topics.txt') as f:
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
if key=='earn':
d[key].append(folderpath+value+".txt")
for key, value in d.items() :
print(value)
word_count_dict={}
for file in d.values():
with open(file,"r") as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
for word in words:
word_count_dict[word].append(counter)
for word, counts in word_count_dict.values():
print(word, counts)
Inspired from the Counter collection that you use:
from glob import glob
from collections import Counter
import re
folderpaths = 'd:/individual-articles'
counter = Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
with open(file) as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
print counter
Your code should give you an error in this line:
word_count_dict[word][file]+= 1
Because your word_count_dict is empty, so when you do word_count_dict[word][file] you should get a key error, because word_count_dict[word] doesn't exist, so you can do [file] on it.
And I found another error:
while file in d.items():
This would make file a tuple. But then you do f = open(file,"r"), so you assume file is a string. This would also raise an error.
This means that none of these lines are ever executed. That in turn means that either while file in d.items(): is empty or for file in filepaths: is empty.
And to be honest I don't understand why you have both of them. I don't understand what you are trying to achieve there. You have generated a list of filenames to parse. You should just iterate over them. I also don't know why d is a dict. All you need is a list of all the files. You don't need to keep track of when key the file came from in the topics, list, do you?

Categories

Resources