I would like to find word frequencies of a list of reserved words in multiple .txt files as a pandas data frame. I am using collections.Counter() objects and if a certain word does not appear in a text, the value of that word (key) is zero in the Counter().
Ideally, the result is a data frame where each row corresponds to each .txt file, column headers correspond to the reserved words and the entry in the row i column j corresponds to the frequency of j-th word in the i-th .txt file.
Here is my code, but the problem is that the Counter() objects are not appended, in the sense of a dictionary with multiple values for each key (or reserved word), but summed instead:
for filepath in iglob(os.path.join(folder_path, '*.txt')):
with open(filepath) as file:
cnt = Counter()
tokens = re.findall(r'\w+', file.read().lower())
for word in tokens:
if word in mylist:
cnt[word] += 1
for key in mylist:
if key not in cnt:
cnt[key] = 0
dictionary = defaultdict(list)
for key, value in cnt.items():
dictionary[key].append(value)
print(dictionary)
Any hint will be much appreciated!
You need to create the dictionary for the dataframe before the loop and then copy/append the Counter values of each text file over.
#!/usr/bin/env python3
import os
import re
from collections import Counter
from glob import iglob
def main():
folder_path = '...'
keywords = ['spam', 'ham', 'parrot']
keyword2counts = {keyword: list() for keyword in keywords}
for filename in iglob(os.path.join(folder_path, '*.txt')):
with open(filename) as file:
words = re.findall(r'\w+', file.read().lower())
keyword2count = Counter(word for word in words if word in keywords)
for keyword in keywords:
keyword2counts[keyword].append(keyword2count[keyword])
print(keyword2counts)
if __name__ == '__main__':
main()
Testing if for item in a list can be significantly slower than doing the same test for items in a set. So if this is too slow you might use a set for keywords or an additional one just for the test.
And a collections.OrderedDict prior to Python 3.7 (or CPython 3.6) if the order of the columns is relevant.
Related
The output is unsorted, and sorting on the second column is not possible. Is there special method to sort on the second value.
This program takes a text and counts how many times a word is in a text
import string
with open("romeo.txt") as file: # opens the file with text
lst = []
d = dict ()
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
if word not in lst:
lst.append(word) # append only this unique word to the list
uniquewords.write(str(word) + '\n') # write the unique word to the file
print(d)
Dictionaries with default value
The code snippet:
d = dict()
...
if word not in d:
d[word] =1
else:
d[word] = d[word] +1
has become so common in python that a subclass of dict has been created to get rid of it. It goes by the name defaultdict and can be found in module collections.
Thus we can simplify your code snippet to:
from collections import defaultdict
d = defaultdict(int)
...
d[word] = d[word] + 1
No need for this manual if/else test; if word is not in the defaultdict, it will be added automatically with initial value 0.
Counters
Counting occurrences is also something that is frequently useful; so much so that there exists a subclass of dict called Counter in module collections. It will do all the hard work for you.
from collections import Counter
import string
with open('romeo.txt') as input_file:
counts = Counter(word.translate(str.maketrans('', '', string.punctuation)).upper() for line in input_file for word in line.split())
with open('romeo_unique.txt', 'w') as output_file:
for word in counts:
output_file.write(word + '\n')
As far as I can tell from the documentation, Counters are not guaranteed to be ordered by number of occurrences by default; however:
When I use them in the interactive python interpreter they are always printed in decreasing number of occurrences;
they provide a method .most_common() which is guaranteed to return in decreasing number of occurrences.
In Python, standard dictionaries are an unsorted data type, but you can look here, assuming that with sorting your output you mean d
A couple of remarks first:
You are not sorting explicitly (e.g. by using sorted) by a given property. Dictionaries might be considered to have a "natural" order by the alphanumeric value of the value part of each key-value pair and they might sort correctly when iterated (e.g. for printing), but it is better to explicitly sort a dict.
You check the existence of a word in the lst variable, which is very slow since checking a list requires checking all entries until something is found (or not). It would be much better to check for existence in a dict.
I'm assuming by "the second column" you mean the information for each word that counts the order in which the word first appeared.
With that I'd change the code to also record the word index of the first occurence of each word with, which then allows for sorting on exactly that.
Edit: Fixed the code. The sorting yielded by sorted sorts by key, not value. That's what I get for not testing code before posting an answer.
import string
from operator import itemgetter
with open("romeo.txt") as file: # opens the file with text
first_occurence = {}
uniqueness = {}
word_index = 1
uniquewords = open('romeo_unique.txt', 'w')
for line in file:
words = line.split()
for word in words: # loops through all words
word = word.translate(str.maketrans('', '', string.punctuation)).upper() #removes the punctuations
if word not in uniqueness:
uniqueness[word] = 1
else:
uniqueness[word] += 1
if word not in first_occurence:
first_occurence[word] = word_index
uniquewords.write(str(word) + '\n') # write the unique word to the file
word_index += 1
print(sorted(uniqueness.items(), key=itemgetter(1)))
print(sorted(first_occurence.items(), key=itemgetter(1)))
I am an absolute beginner in Python. I am doing a textual analysis of greek plays and counting the word frequencies of each word. Because the plays are very long, I am unable to see my full set of data, it only shows the words with the lowest frequencies because there is not enough space in the Python window. I am thinking of converting it to a .csv file. My full code is below:
#read the file as one string and spit the string into a list of separate words
input = open('Aeschylus.txt', 'r')
text = input.read()
wordlist = text.split()
#read file containing stopwords and split the string into a list of separate words
stopwords = open("stopwords .txt", 'r').read().split()
#remove stopwords
wordsFiltered = []
for w in wordlist:
if w not in stopwords:
wordsFiltered.append(w)
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
#create word-frequency pairs and create a dictionary
dictionary = dict(zip(wordsFiltered,wordfreq))
#sort by decreasing frequency and print
aux = [(dictionary[word], word) for word in dictionary]
aux.sort()
aux.reverse()
for y in aux: print y
import csv
with open('Aeschylus.csv', 'w') as csvfile:
fieldnames = ['dictionary[word]', 'word']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'dictionary[word]': '1', 'word': 'inherited'})
writer.writerow({'dictionary[word]': '1', 'word': 'inheritance'})
writer.writerow({'dictionary[word]': '1', 'word': 'inherit'})
I found the code for the csv on the internet. What I'm hoping to get is the full list of data from the highest to lowest frequency. Using this code I have right now, python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
Any idea on what I should code to see my intended result?
Thank you.
Since you have a dictionary where the words are keys and their frequencies the values, a DictWriter is ill suited. It is good for sequences of mappings that share some common set of keys, used as the columns of the csv. For example if you had had a list of dicts such as you manually create:
a_list = [{'dictionary[word]': '1', 'word': 'inherited'},
{'dictionary[word]': '1', 'word': 'inheritance'},
{'dictionary[word]': '1', 'word': 'inherit'}]
then a DictWriter would be the tool for the job. But instead you have a single dictionary like:
dictionary = {'inherited': 1,
'inheritance': 1,
'inherit': 1,
...: ...}
But, you've already built a sorted list of (freq, word) pairs as aux, which is perfect for writing to csv:
with open('Aeschylus.csv', 'wb') as csvfile:
header = ['frequency', 'word']
writer = csv.writer(csvfile)
writer.writerow(header)
# Note the plural method name
writer.writerows(aux)
python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
sounds rather odd. At least you should've gotten a file Aeschylus.csv containing:
dictionary[word],word
1,inherited
1,inheritance
1,inherit
Your frequency counting method could also be improved. At the moment
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
has to loop through the list wordsFiltered for each word in wordsFiltered, so O(n²). You could instead iterate through the words in the file, filter, and count as you go. Python has a specialized dictionary for counting hashable objects called Counter:
from __future__ import print_function
from collections import Counter
import csv
# Many ways to go about this, could for example yield from (<gen expr>)
def words(filelike):
for line in filelike:
for word in line.split():
yield word
def remove(iterable, stopwords):
stopwords = set(stopwords) # O(1) lookups instead of O(n)
for word in iterable:
if word not in stopwords:
yield word
if __name__ == '__main__':
with open("stopwords.txt") as f:
stopwords = f.read().split()
with open('Aeschylus.txt') as wordfile:
wordfreq = Counter(remove(words(wordfile), stopwords))
Then, as before, print the words and their frequencies, beginning from most common:
for word, freq in wordfreq.most_common():
print(word, freq)
And/or write as csv:
# Since you're using python 2, 'wb' and no newline=''
with open('Aeschylus.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['word', 'freq'])
# If you want to keep most common order in CSV as well. Otherwise
# wordfreq.items() would do as well.
writer.writerows(wordfreq.most_common())
I have a code where it returns a dictionary from a csv file. With this I would like to count the number of words are in any string. For instance, if I input:
"How many words in this string are from dict1"
How would I loop through this string and count the words from dict1 that appear in the string?
Code:
import csv
def read_csv(filename, col_list):
"""This function expects the name of a CSV file and a list of strings
representing a subset of the headers of the columns in the file, and
returns a dictionary of the data in those columns."""
with open(filename, 'r') as f:
# Better covert reader to a list (items represent every row)
reader = list(csv.DictReader(f))
dict1 = {}
for col in col_list:
dict1[col] = []
# Going in every row of the file
for row in reader:
# Append to the list the row item of this key
dict1[col].append(row[col])
return dict1
How about something like this:
str_ = "How many words in this string are from dict1"
dict_1 = dict(a='many', b='words')
# With list comprehension
len([x for x in str_.split() if x in dict_1.values()])
# 2
# These ones don't count duplicates because they use sets
len(set(str_.split()).intersection(dict_1.values()))
# 2
len(set(str_.split()) & set(dict_1.values())) # Equivalent, but different syntax
# 2
# To be case insensitive, do e.g.,
words = str_.lower().split()
dict_words = map(str.lower, dict_1.values())
Here's what I have so far.
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
for bob in permutations(original):
print(''.join(bob))
inputFile = open(dic.txt, 'r')
compare = inputFile.read()
inputFile.close()
Basically, what I'm trying to do is create a word unscrambler by having Python find all possible rearrangements of a string and then only print the rearrangements that are actual words, which can be found out by running each rearrangement through a dictionary file (in this case dic.txt) to see if there is a match. I am running Python 3.3, if that matters. What do I need to add in order to compare the rearrangements with the dictionary file?
You could store the permutations in a list, add the dictionary in another list and select those being in both lists…
For example this way:
from itertools import permutations
original = str(input('What word would you like to unscramble?: '))
perms = []
for bob in permutations(original):
perms.append(''.join(bob))
inputFile = open(dic.txt, 'r')
dict_entries = inputFile.read().split('\n')
inputFile.close()
for word in [perm for perm in perms if perm in dict_entries]:
print word
(Assuming the dictionary contains one word per line…)
Read the dictionary file into a list line by line, iterate through each of the rearrangements and check if it's in the dictionary like so:
if word in dict_list:
...
Although this puts a little more up-front effort into processing the input file, once you've built the word_dict it's much more efficient to look up the sorted form of the word rather than build and check for all permutations:
def get_word_dict(filename):
words = {}
with open(filename) as word_dict:
for line in word_dict:
word = line.strip()
key = sorted(word)
if key not in words:
words[key] = []
words[key].append(word)
return words
word_dict = get_word_dict('dic.txt')
original = input("What word would you like to unscramble: ")
key = sorted(original)
if key in word_dict:
for word in word_dict[key]:
print(word)
else:
print("Not in the dictionary.")
This will be particularly beneficial if you want to look up more than one word - you can process the file once, then repeatedly refer to the word_dict.
This question already has an answer here:
how to find frequency of the keys in a dictionary across multiple text files?
(1 answer)
Closed 9 years ago.
I have a list of the addresses of multiple text files in a dictionary 'd':
'd:/individual-articles/9.txt', 'd:/individual-articles/11.txt', 'd:/individual-articles/12.txt',...
and so on...
Now, I need to read each file in the dictionary and keep a list of the word occurrences of each and every word that occurs in the entire dictionary.
My output should be of the form:
the-500
a-78
in-56
and so on..
where 500 is the number of times the word "the" occurs in all the files in the dictionary..and so on..
I need to do this for all the words.
I am a python newbie..plz help!
My code below doesn't work,it shows no output!There must be a mistake in my logic, please rectify!!
import collections
import itertools
import os
from glob import glob
from collections import Counter
folderpaths='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
folderpath='d:/individual-articles/'
# i am creating my dictionary here, can be ignored
d = collections.defaultdict(list)
with open('topics.txt') as f:
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
if key=='earn':
d[key].append(folderpath+value+".txt")
for key, value in d.items() :
print(value)
word_count_dict={}
for file in d.values():
with open(file,"r") as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
for word in words:
word_count_dict[word].append(counter)
for word, counts in word_count_dict.values():
print(word, counts)
Inspired from the Counter collection that you use:
from glob import glob
from collections import Counter
import re
folderpaths = 'd:/individual-articles'
counter = Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
with open(file) as f:
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
print counter
Your code should give you an error in this line:
word_count_dict[word][file]+= 1
Because your word_count_dict is empty, so when you do word_count_dict[word][file] you should get a key error, because word_count_dict[word] doesn't exist, so you can do [file] on it.
And I found another error:
while file in d.items():
This would make file a tuple. But then you do f = open(file,"r"), so you assume file is a string. This would also raise an error.
This means that none of these lines are ever executed. That in turn means that either while file in d.items(): is empty or for file in filepaths: is empty.
And to be honest I don't understand why you have both of them. I don't understand what you are trying to achieve there. You have generated a list of filenames to parse. You should just iterate over them. I also don't know why d is a dict. All you need is a list of all the files. You don't need to keep track of when key the file came from in the topics, list, do you?