TAB break for line split in a dictionary in Python - python

I am attempting to run a python program that can run a dictionary from a file with a list of words with each word given a score and standard deviation. My program looks like this:
theFile = open('word-happiness.csv' , 'r')
theFile.close()
def make_happiness_table(filename):
'''make_happiness_table: string -> dict
creates a dictionary of happiness scores from the given file'''
with open(filename) as f:
d = dict( line.split(' ') for line in f)
return d
make_happiness_table("word-happiness.csv")
table = make_happiness_table("word-happiness.csv")
(score, stddev) = table['hunger']
print("the score for 'hunger' is %f" % score)
My .csv file is in the form
word{TAB}score{TAB}standard_deviation
and I am trying to create the dictionary in that way. How can I create such a dictionary so that I can print a word such as 'hunger' from the function and get its score and std deviation?

def make_happiness_table(filename):
with open(filename) as f:
d = dict()
for line in f:
word,score,std = line.split() #splits on any consecutive runs of whitspace
d[word]=score,std # May want to make floats: `d[word] = float(score),float(std)`
return d
Note that if your word can have a tab character in it, but you're guaranteed that you only have 3 fields (word, score, std), you can split the string from the right (str.rsplit), only splitting twice (resulting in 3 fields at the end). e.g. word,score,std = line.rsplit(None,2).
As mentioned in the comments above, you can also use the csv module to read these sorts of files -- csv really shines if your fields can be "quoted". e.g.:
"this is field 0" "this is field 1" "this is field 2"
If you don't have that scenario, then I find that str.split works just fine.
Also, unrelated, but your code calls make_happiness_table twice (the first time you don't assign the return value to anything). The first call is useless (all it does is read the file and build a dictionary which you can never use). Finally, opening and closeing theFile at the beginning of your script is also just a waste since you don't do anything with the file there.

If you are sure your word will not have space, you can just split the line e.g.
word, score, stddev = line.split()
But if word can have space use tab char \t to split e.g.
word, score, stddev = line.split('\t')
But for a very generic case when word may have tab itself use the csv module
reader = csv.reader(filename, dialect='excel-tab')
for word, score, stddev in reader:
...
and then you can create dict of word and score, stddev e.g.
word_dict[word] = (score, stddev)

Related

Finding specific words and add them into a dictionary

I want to find the words that start with "CHAPTER" and add them to a dictionary.
I have written some but It gives me 0 as an output all the time:
def wordcount(filename, listwords):
try:
file = open(filename, "r")
read = file.readlines()
file.close()
for word in listwords:
lower = word.lower()
count = 0
for sentence in read:
line = sentence.split()
for each in line:
line2=each.lower()
line2=line2.strip("")
if lower == line2:
count += 1
print(lower, ":", count)
except FileExistError:
print("The file is not there ")
wordcount("dad.txt", ["CHAPTER"])
the txt file is here
EDİT*
The problem was encoding type and I solved it but the new question is that How can I add these words into a dictionary?
and How can I make this code case sensitive I mean when I type wordcount("dad.txt", ["CHAPTER"]) I want it to find only CHAPTER words with upper case.
It cannot work because of this line:
if lower == line2:
you can use this line to find the words that start with "CHAPTER"
if line2.startswith(lower):
I notice that you need to check if a word starts with a certain words from listwords rather than equality (lower == line2). Hence, you should use startswith method.
You can have a simpler code, something like this.
def wordcount(filename, listwords):
listwords = [s.lower() for s in listwords]
wordCount = {s:0 for s in listwords} # A dict to store the counts
with open(filename,"r") as f:
for line in f.readlines():
for word in line.split():
for s in listwords:
if word.lower().startswith(s):
wordCount[s]+=1
return wordCount
If the goal is to find chapters and paragraphs, don't try and count words or split any line
For example, start simpler. Since chapters are in numeric order, you only need a list, not a dictionary
chapters = [] # list of chapters
chapter = "" # store one chapter
with open(filename, encoding="UTF-8") as f:
for line in f.readlines():
# TODO: should skip to the first line that starts with "CHAPTER", otherwise 'chapters' variable gets extra, header information
if line.startswith("CHAPTER"):
print("Found chapter: " + line)
# Save off the most recent, non empty chapter text, and reset
if chapter:
chapters.append(chapter)
chapter = ""
else:
# up to you if you want to skip empty lines
chapter += line # don't manipulate any data yet
# Capture the last chapter at the end of the file
if chapter:
chapters.append(chapter)
del chapter # no longer needed
# del chapters[0] if you want to remove the header information before the first chapter header
# Done reading the file, now work with strings in your lists
print(len(chapters)) # find how many chapters there are
If you actually did want the text following "CHAPTER", then you can split that line in the first if statement, however note that the chapter numbers repeat between volumes, and this solution assumes the volume header is part of a chapter
If you want to count the paragraphs, start with finding the empty lines (for example split each element on '\n\n')

Take tokens from a text file, calculate their frequency, and return them in a new text file in Python

After a long time researching and asking friends, I am still a dumb-dumb and don't know how to solve this.
So, for homework, we are supposed to define a function which accesses two files, the first of which is a text file with the following sentence, from which we are to calculate the word frequencies:
In a Berlin divided by the Berlin Wall , two angels , Damiel and Cassiel , watch the city , unseen and unheard by its human inhabitants .
We are also to include commas and periods: each single item has already been tokenised (individual items are surrounded by whitespaces - including the commas and periods). Then, the word frequencies must be entered into a new txt-file as "word:count", and in the order in which the words appear, i.e.:
In:1
a:1
Berlin:2
divided:1
etc.
I have tried the following:
def find_token_frequency(x, y):
with open(x, encoding='utf-8') as fobj_1:
with open(y, 'w', encoding='utf-8') as fobj_2:
fobj_1list = fobj_1.split()
unique_string = []
for i in fobj_1list:
if i not in unique_string:
unique_string.append(i)
for i in range(0, len(unique_string)):
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
I am not sure I need to actually use .split() at all, but I don't know what else to do, and it does not work anyway, since it tells me I cannot split that object.
I am told:
Traceback (most recent call last):
[...]
fobj_1list = fobj_1.split()
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
When I remove the .split(), the displayed error is:
fobj_2.write("{}: {}".format(unique_string[i], fobj_1list.count(unique_string[i])))
AttributeError: '_io.TextIOWrapper' object has no attribute 'count'
Let's divide your problem into smaller problems so we can more easily solve this.
First we need to read a file, so let's do so and save it into a variable:
with open("myfile.txt") as fobj_1:
sentences = fobj_1.read()
Ok, so now we have your file as a string stored in sentences. Let's turn it into a list and count the occurrence of each word:
words = sentence.split(" ")
frequency = {word:words.count(word) for word in set(words)}
Here frequency is a dictionary where each word in the sentences is a key with the value being how many times they appear on the sentence. Note the usage of set(words). A set does not have repeated elements, that's why we are iterating over the set of words and not the word list.
Finally, we can save the word frequencies into a file
with open("results.txt", 'w') as fobj_2:
for word in frequency: fobj_2.write(f"{word}:{frequency[word]}\n")
Here we use f strings to format each line into the desired output. Note that f-strings are available for python3.6+.
I'm unable to comment as I don't have the required reputation, but the reason split() isn't working is because you're calling it on the file object itself, not a string. Try calling:
fobj_1list = fobj_1.readline().split()
instead. Also, when I ran this locally, I got an error saying that TypeError: 'encoding' is an invalid keyword argument for this function. You may want to remove the encoding argument from your function calls.
I think that should be enough to get you going.
The following script should do what you want.
#!/usr/local/bin/python3
def find_token_frequency(inputFileName, outputFileName):
# wordOrderList to maintain order
# dict to keep track of count
wordOrderList = []
wordCountDict = dict()
# read the file
inputFile = open(inputFileName, encoding='utf-8')
lines = inputFile.readlines()
inputFile.close()
# iterate over all lines in the file
for line in lines:
# and split them into words
words = line.split()
# now, iterate over all words
for word in words:
# and add them to the list and dict
if word not in wordOrderList:
wordOrderList.append(word)
wordCountDict[word] = 1
else:
# or increment their count
wordCountDict[word] = wordCountDict[word] +1
# store result in outputFile
outputFile = open(outputFileName, 'w', encoding='utf-8')
for index in range(0, len(wordOrderList)):
word = wordOrderList[index]
outputFile.write(f'{word}:{wordCountDict[word]}\n')
outputFile.close()
find_token_frequency("input.txt", "output.txt")
I changed your variable names a bit to make the code more readable.

Word Count from File: Is it having problems opening the file, or have I coded it incorrectly?

Problem: Program seems to get stuck opening a file to read.
My problem is that at the very beginning the program seems to be broken. It just displays
[(1, 'C:\Users\....\Desktop\Sense_and_Sensibility.txt')]
over and over, never-ending.
(NOTE: .... is a replacement for the purpose of posting because my computer username is my full name).
I'm not sure if I've coded this completely incorrectly, or if it's having problems opening the file. Any help is appreciated.
The program should:
1: open a file, replace all punctuation with spaces, change all words to lowercase, then store them in a dictionary.
2: look at a list of words (stop words) that will be removed from the original dictionary.
3: count the remaining words and sort based on frequency.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" # file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" # words to delete
with open(fname) as file: # have the program run the file
for line in file: # loop through
fname.replace('-.,"!?', " ") # replace punc. with space
words = fname.lower() # make all words lowercase
word_list = fname.split() # separate the words, store
word_dict = {} # create a dictionary
with open(swfilename) as delete: # open stop word list
for line in delete:
sw_list = swfilename.split() # separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) # delete common words
for word in word_list: # loop through
word_dict[word] = word_dict.get(word, 0) + 1 # count frequency
word_freq = [] # create index
for key, value in word_dict.items(): # count occurrences
word_freq.append((value, key)) # append freq list
word_freq.sort(reverse=True) # sort the words by freq
print(word_freq) # print most to least
Importing files in windows using python is some what different when compared to Mac and Linux OS
Just change the path of file from fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt"
To fname = "C:\\Users\\....\\Desktop\\Sense_and_Sensibility.txt"
Use double slashes
There are a couple of issues with your code. I would only discuss the most obvious one, given that it is impossible to reproduce your exact observations because the input you are using is not accessible to the readers.
I will first report your code verbatim and mark weak points with ??? followed by a number, which I will address after the code.
fname = r"C:\Users\....\Desktop\Sense_and_Sensibility.txt" #file to read
swfilename = r"C:\Users\....\Desktop\stopwords.txt" #words to delete
with open(fname) as file: #???(1) have the program run the file
for line in file: #loop through
fname.replace ('-.,"!?', " ") #???(2) replace punc. with space
words = fname.lower() #???(3) make all words lowercase
word_list = fname.split() #separate the words, store
word_dict = {} #???(4) create a dictionary
with open(swfilename) as delete: #open stop word list
for line in delete:
sw_list = swfilename.split() #separate the words, store them
sw_dict = {}
for key in sw_dict:
word_dict.pop(key, None) #???(5) delete common words
for word in word_list: #???(6) loop through
word_dict[word] = word_dict.get(word, 0) + 1 #???(7) count frequency
word_freq = [] #???(8)create index
for key, value in word_dict.items(): #count occurrences
word_freq.append((value, key)) #append freq list
word_freq.sort(reverse = True) #sort the words by freq
print(word_freq) #print most to least
(minor) file is a reserved word in Python, and it is a good practice not to use for custom purposes as you are doing
(major) .replace() will replace the exact string on the left with the exact string on the right, but what you would like to do is to perform some sort of multi_replace(), which you could implement yourself (for example as a function) by consecutive calls to .replace() for example in a loop (or using functools.reduce()).
(major) fname contains the file name (path, actually) and not the content of the file you want to work with.
(major) You are looping through the lines of the file, but if you create your word_list and word_dict for each line, you will "overwrite" the content at each iteration. Also, the word_dict is created empty and never filled.
(major) The logic you are trying to implement will not work on a dictionary, because dictionaries cannot contain multiple identical keys. A more effective approach would be to create a filtered_list from the word_list by excluding the stop_words. The dictionary can then be used to implement a counter. I do understand that at your level it may be worth learning how to implement a counter, but please keep in mind that the module collections.Counter() from the standard library (thus accessible using import collections) does exactly what you want.
(major) given that at this point there is nothing useful left from your code, but looping through the original list instead of through the filtered list will have no information about the stop words.
(major) dictionary[key] can be used both for accessing (which you do not do) and for writing (which you do) the value associated to a specific key in a dictionary.
(minor) Obviously, your approach for sorting according to word frequency would work, but a much better approach would be to use the parameter key of .sort() and sorted().
Hope this helps!

Python: merging dictionaries with adding values but conserving other fields

I have a textfile with the following format:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... with 1 million items
But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:
cup'board cup blabla 12
cupboard cup blabla2 10
into this one (frequencies added):
cupboard cup blabla2 22
I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!
The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries
with addition with Counter:
Python: Elegantly merge dictionaries with sum() of values
Merge and sum of two dictionaries
How to sum dict elements
How to merge two Python dictionaries in a single expression?
Is there any pythonic way to combine two dicts (adding values for keys that appear in both)?
but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.
I am open to all your ideas or reframing of the problem, since my goal is
to have this program running once only to obtain this merged list in a text file, thank you in advance!
Here's something that does more or less what you want. Just change the file names at the top. It doesn't modify the original file.
input_file_name = "input.txt"
output_file_name = "output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None, "'")
stripped2 = word2.translate(None, "'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if "'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None, "'")
def get_num(line):
return int(line.split()[-1])
print "Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print "File read and sorted"
combined_lines = []
print "Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append(" ".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print "Entries combined"
print "Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line + "\n")
print "Finished"
It sorts the words and messes up the spacing a bit. If that's important, let me know and it can be adjusted.
Another thing is that it sorts the whole thing. For only a million lines, the probably won't take overly long, but again, let me know if that's an issue.

Python dictionary key error

I am attempting to run a python program that can run a dictionary from a file with a list of words with each word given a score and standard deviation. My program looks like this:
theFile = open('word-happiness.csv', 'r')
theFile.close()
def make_happiness_table(filename):
''' make_happiness_table: string -> dict
creates a dictionary of happiness scores from the given file '''
return {}
make_happiness_table("word-happiness.csv")
table = make_happiness_table("word-happiness.csv")
(score, stddev) = table['hunger']
print("the score for 'hunger' is %f" % score)
I have the word 'hunger' in my file but when I run this program to take 'hunger' and return its given score and std deviation, I get:
(score, stddev) = table['hunger']
KeyError: 'hunger'
How is it that I get a key error even though 'hunger' is in the dictionary?
"hunger" isn't in the dictionary (that's what the KeyError tells you). The problem is probably your make_happiness_table function. I don't know if you've posted the full code or not, but it doesn't really matter. At the end of the function, you return an empty dictionary ({}) regardless of what else went on inside the function.
You probably want to open your file inside that function, create the dictionary and return it. For example, if your csv file is just 2 columns (separated by a comma), you could do:
def make_happiness_table(filename):
with open(filename) as f:
d = dict( line.split(',') for line in f )
#Alternative if you find it more easy to understand
#d = {}
#for line in f:
# key,value = line.split(',')
# d[key] = value
return d

Categories

Resources