I am an absolute beginner in Python. I am doing a textual analysis of greek plays and counting the word frequencies of each word. Because the plays are very long, I am unable to see my full set of data, it only shows the words with the lowest frequencies because there is not enough space in the Python window. I am thinking of converting it to a .csv file. My full code is below:
#read the file as one string and spit the string into a list of separate words
input = open('Aeschylus.txt', 'r')
text = input.read()
wordlist = text.split()
#read file containing stopwords and split the string into a list of separate words
stopwords = open("stopwords .txt", 'r').read().split()
#remove stopwords
wordsFiltered = []
for w in wordlist:
if w not in stopwords:
wordsFiltered.append(w)
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
#create word-frequency pairs and create a dictionary
dictionary = dict(zip(wordsFiltered,wordfreq))
#sort by decreasing frequency and print
aux = [(dictionary[word], word) for word in dictionary]
aux.sort()
aux.reverse()
for y in aux: print y
import csv
with open('Aeschylus.csv', 'w') as csvfile:
fieldnames = ['dictionary[word]', 'word']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'dictionary[word]': '1', 'word': 'inherited'})
writer.writerow({'dictionary[word]': '1', 'word': 'inheritance'})
writer.writerow({'dictionary[word]': '1', 'word': 'inherit'})
I found the code for the csv on the internet. What I'm hoping to get is the full list of data from the highest to lowest frequency. Using this code I have right now, python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
Any idea on what I should code to see my intended result?
Thank you.
Since you have a dictionary where the words are keys and their frequencies the values, a DictWriter is ill suited. It is good for sequences of mappings that share some common set of keys, used as the columns of the csv. For example if you had had a list of dicts such as you manually create:
a_list = [{'dictionary[word]': '1', 'word': 'inherited'},
{'dictionary[word]': '1', 'word': 'inheritance'},
{'dictionary[word]': '1', 'word': 'inherit'}]
then a DictWriter would be the tool for the job. But instead you have a single dictionary like:
dictionary = {'inherited': 1,
'inheritance': 1,
'inherit': 1,
...: ...}
But, you've already built a sorted list of (freq, word) pairs as aux, which is perfect for writing to csv:
with open('Aeschylus.csv', 'wb') as csvfile:
header = ['frequency', 'word']
writer = csv.writer(csvfile)
writer.writerow(header)
# Note the plural method name
writer.writerows(aux)
python seems to be totally ignoring the csv part and just printing the data as if I didn't code for the csv.
sounds rather odd. At least you should've gotten a file Aeschylus.csv containing:
dictionary[word],word
1,inherited
1,inheritance
1,inherit
Your frequency counting method could also be improved. At the moment
#create dictionary by counting no of occurences of each word in list
wordfreq = [wordsFiltered.count(x) for x in wordsFiltered]
has to loop through the list wordsFiltered for each word in wordsFiltered, so O(n²). You could instead iterate through the words in the file, filter, and count as you go. Python has a specialized dictionary for counting hashable objects called Counter:
from __future__ import print_function
from collections import Counter
import csv
# Many ways to go about this, could for example yield from (<gen expr>)
def words(filelike):
for line in filelike:
for word in line.split():
yield word
def remove(iterable, stopwords):
stopwords = set(stopwords) # O(1) lookups instead of O(n)
for word in iterable:
if word not in stopwords:
yield word
if __name__ == '__main__':
with open("stopwords.txt") as f:
stopwords = f.read().split()
with open('Aeschylus.txt') as wordfile:
wordfreq = Counter(remove(words(wordfile), stopwords))
Then, as before, print the words and their frequencies, beginning from most common:
for word, freq in wordfreq.most_common():
print(word, freq)
And/or write as csv:
# Since you're using python 2, 'wb' and no newline=''
with open('Aeschylus.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['word', 'freq'])
# If you want to keep most common order in CSV as well. Otherwise
# wordfreq.items() would do as well.
writer.writerows(wordfreq.most_common())
Related
I'm attempting to write out the results of a frequency count of specific words in a text file based on a collection of words in a python list ( I haven't included it in the code listing as there are several hundred)
file_path = 'D:/TestHedges/Hedges_Test_11.csv'
corpus_root = test_path
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
CIK_List = []
freq_out = []
for filename in glob.glob(os.path.join(test_path, '*.txt')):
CIK = filename[33:39]
CIK = CIK.strip('_')
# CIK = CIK.strip('_0') commented out to see if it deals with just removing _. It does not 13/9/2020
newstext = wordlists.words()
fdist = nltk.FreqDist([w.lower() for w in newstext])
CIK_List.append(CIK)
with open(file_path, 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["CIK"] + word_list)
for val in CIK_List:
writer.writerow([val])
for m in word_list:
print(CIK, [fdist[m]], end='')
writer.writerows([fdist[m]])
My problem is with the writing of fdist[m] as a row into a .csv file. It is generating an error
_csv.Error: iterable expected, not int
How can I re-write this to place the frequency distribution into a row in a .csv file?
Thanks in advance
You have two choices - either use writerow instead of writerows or create a list of values first and then pass it to writer.writerows instead of fdist[m]. Now, each of the row values in the list should be a tuple (or an interable). Therefore, for writerows to work you would have to encapsulate it again in a tuple:
writer.writerows([(fdist[m],)])
Here, the comma denotes a 1-value tuple.
In order to write all of the values in one row instead of this code:
for m in word_list:
print(CIK, [fdist[m]], end='')
writer.writerows([fdist[m]])
You should use:
for m in word_list:
print(CIK, [fdist[m]], end='')
writer.writerows(([fdist[m] for m in word_list],))
Please note a list comprehension.
On a different note, just by looking at your code, it seems to me that you could do the same without involving NLTK library just by using collections.Counter from standard library. It is the underlying container in FreqDist class.
I want to do lexical normalization on a corpus using a dictionary. The corpus has eight thousands of lines and the dictionary has thousands of word pairs (nonstandard : standard).
I have adopted an approach which is discussed here. The code looks like this:
with open("corpus.txt", 'r', encoding='utf8') as main:
words = main.read().split()
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3', and so on}
for x in lexnorm:
for y in words:
if lexnorm[x][0] == y:
y == x[1]
text = ' '.join(lexnorm.get(y, y) for y in words)
print(text)
The code above works well, but I'm facing a problem since there are thousands of word pairs in the dictionary. Is it possible to represent the dictionary through a text file?
Last question, the output file of the code consists only of one line. It would be great if it has the same number of lines as the original corpus does.
Anyone could help me with this? I'd be thankful.
One way to output the dictionary as a text file is as a JSON string:
import json
lexnorm = {'nonstandard1': 'standard1', 'nonstandard2': 'standard2', 'nonstandard3': 'standard3'} # etc.
with open('lexnorm.txt', 'w') as f:
json.dump(lexnorm, f)
See my comment to your original. I am only guessing what you are trying to do:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main, open('new_corpus.txt', 'w') as new_main:
for line in main:
words = re.split(r'[^a-zA-z]+', line)
for word in words:
if word in lexnorm:
line = line.replace(word, lexnorm[word])
new_main.write(line)
The above program reads in the corpus.txt file line by line and attempts to intelligently split the line into words. Splitting on a single space is not adequate. Consider the following sentence:
'"The fox\'s foot grazed the sleeping dog, waking it."'
A standard split on a single space yields:
['"The', "fox's", 'foot', 'grazed', 'the', 'sleeping', 'dog,', 'waking', 'it."']
You would never be able to match The, fox, dog nor it.
There are several ways to handle it. I am splitting on one or more non-alpha characters. This may need to be "tweeked" if the words in lexnorm consist of characters other than a-z:
re.split(r'[^a-zA-z]+', '"The fox\'s foot grazed the sleeping dog, waking it."')
Yields:
['', 'The', 'fox', 's', 'foot', 'grazed', 'the', 'sleeping', 'dog', 'waking', 'it', '']
Once the line is split into words, each word is looked up in the lexnorm dictionary and if found then a simple replace of that word is done in the original line. Finally, the line and any replacements done to that line are written out to a new file. You can then delete the old file and rename the new file.
Think about how you might handle words that would match if they had been converted to lower case first.
Update (Major Optimization)
Since there is likely to be a lot of duplicate words in a file, an optimization is to process each unique word once, which can be done if the file is not so large that it cannot be read into memory:
import json, re
with open('lexnorm.txt') as f:
lexnorm = json.load(f) # read back lexnorm dictionary
with open("corpus.txt", 'r', encoding='utf8') as main:
text = main.read()
word_set = set(re.split(r'[^a-zA-z]+', text))
for word in word_set:
if word in lexnorm:
text = text.replace(word, lexnorm[word])
with open("corpus.txt", 'w', encoding='utf8') as main:
main.write(text)
Here the entire file is read into text, split into words and then the words are added to a set word_set guaranteeing the uniqueness of words. Then each word in word_set is looked up and replaced in the entire text and the entire text rewritten back out to the original file.
I would like to find word frequencies of a list of reserved words in multiple .txt files as a pandas data frame. I am using collections.Counter() objects and if a certain word does not appear in a text, the value of that word (key) is zero in the Counter().
Ideally, the result is a data frame where each row corresponds to each .txt file, column headers correspond to the reserved words and the entry in the row i column j corresponds to the frequency of j-th word in the i-th .txt file.
Here is my code, but the problem is that the Counter() objects are not appended, in the sense of a dictionary with multiple values for each key (or reserved word), but summed instead:
for filepath in iglob(os.path.join(folder_path, '*.txt')):
with open(filepath) as file:
cnt = Counter()
tokens = re.findall(r'\w+', file.read().lower())
for word in tokens:
if word in mylist:
cnt[word] += 1
for key in mylist:
if key not in cnt:
cnt[key] = 0
dictionary = defaultdict(list)
for key, value in cnt.items():
dictionary[key].append(value)
print(dictionary)
Any hint will be much appreciated!
You need to create the dictionary for the dataframe before the loop and then copy/append the Counter values of each text file over.
#!/usr/bin/env python3
import os
import re
from collections import Counter
from glob import iglob
def main():
folder_path = '...'
keywords = ['spam', 'ham', 'parrot']
keyword2counts = {keyword: list() for keyword in keywords}
for filename in iglob(os.path.join(folder_path, '*.txt')):
with open(filename) as file:
words = re.findall(r'\w+', file.read().lower())
keyword2count = Counter(word for word in words if word in keywords)
for keyword in keywords:
keyword2counts[keyword].append(keyword2count[keyword])
print(keyword2counts)
if __name__ == '__main__':
main()
Testing if for item in a list can be significantly slower than doing the same test for items in a set. So if this is too slow you might use a set for keywords or an additional one just for the test.
And a collections.OrderedDict prior to Python 3.7 (or CPython 3.6) if the order of the columns is relevant.
So I'm trying to attach a bunch of values in a huge list to their corresponding headings stored in another list. For example, right now I have a list with only the headings (e.g. ['Target'],['Choice']) and I want to take values corresponding with those headings (e.g. 'Target':3, 'Choice':1) and append them to the corresponding value in the headings list while also stripping off the headings from the values in the second list--so everything after the ':' needs to be attached to it's corresponding value in the original list. Then I want to take each of these strings in the new list (e.g. [Target: 1, 2, 1, 3 . . .]) and import them into a column in a csv or excel file. I'm admittedly a bit of a noob, but I've worked really hard thus far. Here is what I have that doesn't word (excluding exporting to csv, as I have no idea how to do that).
key_name = ['distractor_shape','stimulus_param','prompt_format','test_format','d0_type','d1_type','d2_type','encoding_rt','verification_rt', 'target', 'choice','correct','debrief','response','CompletionCode']
def headers(list):
brah_list = []
for dup in list:
z = dup.count(',')
nw = dup.split(',',z)
brah_list.append(nw)
return brah_list
def parse_data(filename):
big_list = []
with open(filename) as f:
for word in f:
x = word.count(',')
new_word = word.split(',',x)
big_list.append(new_word)
return big_list
b_list = parse_data('A1T6RFUU0OTS0M.txt')
k_list = headers(key_name)
def f_un(things):
for t in things:
return t
h_list = f_un(k_list)
def f_in(stuff):
for sf in stuff:
for s in sf:
print(s)
z = 0
head_r = "h_list[z]"
if s.startswith(head_r):
s.strip(head_r)
h_list.append(s)
z += 1
print(stuff)
f_in(b_list)
If i understand your question correctly, you have a list, and you have some corresponding values in another list retaliative to the first list. If that is the case, and assuming that all your values corresponded in the correct order you could use list comprehension:
lst1 = ['th', 'an', 'bu']
lst2 = ['e', 'd', 't']
#here
lst1 = [lst1[i] + lst2[i] for i in range(len(lst1))]
#here
print(lst1) #output: ['the', 'and', 'but']
I'm still having a hard time understanding the question, but if my guess is right, I think this is what you need. Note that this code uses the excellent tablib module:
from collections import defaultdict
import re
import tablib
data = defaultdict(list)
# input.txt:
# [t_show:[1471531808217,1471531809222,1471531809723,1471531812159,1471531815049],t_remove:[1471531809222,1471531809722,1471531812158,1471531815046,null],param:target:0,distractor_shape:square,stimulus_param:shape1:star,shape2:plus,relation:below,negate:false,prompt_format:verbal,test_format:visual,d0_type:shape:false,order:false,relation:true,d1_type:shape:2,order:true,relation:false,d2_type:shape:2,order:true,relation:true,encoding_rt:2435,verification_rt:2887,target:0,choice:0,correct:true]
# I had to remove "debrief", "response", and "CompletionCode", since those weren't in the example data
headers = ['distractor_shape','stimulus_param','prompt_format','test_format','d0_type','d1_type','d2_type','encoding_rt','verification_rt', 'target', 'choice','correct']#,'debrief','response','CompletionCode']
with open('input.txt', 'rt') as f:
# The input data seems to have some zero-width unicode characters interspersed.
# This next line removes them.
input = f.read()[1:-1].replace('\u200c\u200b', '')
# This matches things like foo:bar,... as well as foo:[1,2,3],...
for item in re.findall(r'[^,\[]+(?:\[[^\]]*\])?[^,\[]*', input):
# Split on the first colon
key, value = item.partition(':')[::2]
data[key].append(value)
dataset = tablib.Dataset()
dataset.headers = headers
for row in zip(*(data[header] for header in headers)):
dataset.append(row)
with open('out.csv', 'wt') as f:
f.write(dataset.csv)
# out.csv:
# distractor_shape,stimulus_param,prompt_format,test_format,d0_type,d1_type,d2_type,encoding_rt,verification_rt,target,choice,correct
# square,shape1:star,verbal,visual,shape:false,shape:2,shape:2,2435,2887,0,0,true
with open('out.xlsx', 'wb') as f:
f.write(dataset.xlsx)
I'm using python to convert the words in sentences in a text file to individual tokens in a list for the purpose of counting up word frequencies. I'm having trouble converting the different sentences into a single list. Here's what I do:
f = open('music.txt', 'r')
sent = [word.lower().split() for word in f]
That gives me the following list:
[['party', 'rock', 'is', 'in', 'the', 'house', 'tonight'],
['everybody', 'just', 'have', 'a', 'good', 'time'],...]
Since the sentences in the file were in separate lines, it returns this list of lists and defaultdict can't identify the individual tokens to count up.
It tried the following list comprehension to isolate the tokens in the different lists and return them to a single list, but it returns an empty list instead:
sent2 = [[w for w in word] for word in sent]
Is there a way to do this using list comprehensions? Or perhaps another easier way?
Just use a nested loop inside the list comprehension:
sent = [word for line in f for word in line.lower().split()]
There are some alternatives to this approach, for example using itertools.chain.from_iterable(), but I think the nested loop is much easier in this case.
Just read the entire file to memory,a s a single string, and apply splitonce tot hat string.
There is no need to read the file line by line in such a case.
Therefore your core can be as short as:
sent = open("music.txt").read().split()
(A few niceties like closing the file, checking for errors, turn the code a little larger, of course)
Since you want to be counting word frequencies, you can use the collections.Counter class for that:
from collections import Counter
counter = Counter()
for word in open("music.txt").read().split():
counter[word] += 1
List comprehensions can do the job but will accumulate everything in memory. For large inputs this could be an unacceptable cost. The below solution will not accumulate large amounts of data in memory, even for large files. The final product is a dictionary of the form {token: occurrences}.
import itertools
def distinct_tokens(filename):
tokendict = {}
f = open(filename, 'r')
tokens = itertools.imap(lambda L: iter(L.lower.split()), f)
for tok in itertools.chain.from_iterable(tokens):
if tok in tokendict:
tokendict[tok] += 1
else:
tokendict[tok] = 1
f.close()
return tokendict