Python: merging dictionaries with adding values but conserving other fields - python

I have a textfile with the following format:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
... with 1 million items
But some of the word_forms contain an apostrophe ('), others do not, so I would like to count them as instances of the same word, that's to say I would like to merge lines like these two:
cup'board cup blabla 12
cupboard cup blabla2 10
into this one (frequencies added):
cupboard cup blabla2 22
I am searching a solution in Python 2.7 to do that, my first idea was to read the textfile, store in two different dictionaries the words with apostrophe and the words without, then go over the dictionary of words with apostrophe, test if these words are already in the dictionary without apostrophe, if they are actualise the frequency, if not simply add this line with apostrophe removed. Here is my code:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
This solution works in a reasonable time till the storage of the two dictionaries, whether I write in two textfiles line by line to avoid any storage or I store them as dict objects in the program. But the merging of the two dictionaries never ends!
The function 'update' for dictionaries would work but override one frequency count instead of adding the two. I saw some solutions of merging dictionaries
with addition with Counter:
Python: Elegantly merge dictionaries with sum() of values
Merge and sum of two dictionaries
How to sum dict elements
How to merge two Python dictionaries in a single expression?
Is there any pythonic way to combine two dicts (adding values for keys that appear in both)?
but they seem to work only when the dictionaries are of the form (word, count) whereas I want to carry the other fields in the dictionary as well.
I am open to all your ideas or reframing of the problem, since my goal is
to have this program running once only to obtain this merged list in a text file, thank you in advance!

Here's something that does more or less what you want. Just change the file names at the top. It doesn't modify the original file.
input_file_name = "input.txt"
output_file_name = "output.txt"
def custom_comp(s1, s2):
word1 = s1.split()[0]
word2 = s2.split()[0]
stripped1 = word1.translate(None, "'")
stripped2 = word2.translate(None, "'")
if stripped1 > stripped2:
return 1
elif stripped1 < stripped2:
return -1
else:
if "'" in word1:
return -1
else:
return 1
def get_word(line):
return line.split()[0].translate(None, "'")
def get_num(line):
return int(line.split()[-1])
print "Reading file and sorting..."
lines = []
with open(input_file_name, 'r') as f:
for line in sorted(f, cmp=custom_comp):
lines.append(line)
print "File read and sorted"
combined_lines = []
print "Combining entries..."
i = 0
while i < len(lines) - 1:
if get_word(lines[i]) == get_word(lines[i+1]):
total = get_num(lines[i]) + get_num(lines[i+1])
new_parts = lines[i+1].split()
new_parts[-1] = str(total)
combined_lines.append(" ".join(new_parts))
i += 2
else:
combined_lines.append(lines[i].strip())
i += 1
print "Entries combined"
print "Writing to file..."
with open(output_file_name, 'w+') as f:
for line in combined_lines:
f.write(line + "\n")
print "Finished"
It sorts the words and messes up the spacing a bit. If that's important, let me know and it can be adjusted.
Another thing is that it sorts the whole thing. For only a million lines, the probably won't take overly long, but again, let me know if that's an issue.

Related

File Names Chain in python

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it's not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can't figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:
'''
The objective of the homework assignment is to design and implement a function
that reads some strings contained in a series of files and generates a new
string from all the strings read.
The strings to be read are contained in several files, linked together to
form a closed chain. The first string in each file is the name of another
file that belongs to the chain: starting from any file and following the
chain, you always return to the starting file.
Example: the first line of file "A.txt" is "B.txt," the first line of file
"B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the
chain "A.txt"-"B.txt"-"C.txt".
In addition to the string with the name of the next file, each file also
contains other strings separated by spaces, tabs, or carriage return
characters. The function must read all the strings in the files in the chain
and construct the string obtained by concatenating the characters with the
highest frequency in each position. That is, in the string to be constructed,
at position p, there will be the character with the highest frequency at
position p of each string read from the files. In the case where there are
multiple characters with the same frequency, consider the alphabetical order.
The generated string has a length equal to the maximum length of the strings
read from the files.
Therefore, you must write a function that takes as input a string "filename"
representing the name of a file and returns a string.
The function must construct the string according to the directions outlined
above and return the constructed string.
Example: if the contents of the three files A.txt, B.txt, and C.txt in the
directory test01 are as follows
test01/A.txt test01/B.txt test01/C.txt
-------------------------------------------------------------------------------
test01/B.txt test01/C.txt test01/A.txt
house home kite
garden park hello
kitchen affair portrait
balloon angel
surfing
the function most_frequent_chars ("test01/A.txt") will return "hareennt".
'''
def file_names_list(filename):
intermezzo = []
lista_file = []
a_file = open(filename)
lines = a_file.readlines()
for line in lines:
intermezzo.extend(line.split())
del intermezzo[1:]
lista_file.append(intermezzo[0])
intermezzo.pop(0)
return lista_file
def words_list(filename):
lista_file = []
a_file = open(filename)
lines = a_file.readlines()[1:]
for line in lines:
lista_file.extend(line.split())
return lista_file
def stuff_list(filename):
file_list = file_names_list(filename)
the_rest = words_list(filename)
second_file_name = file_names_list(file_list[0])
the_lists = words_list(file_list[0]) and
words_list(second_file_name[0])
the_rest += the_lists[0:]
return the_rest
def most_frequent_chars(filename):
huge_words_list = stuff_list(filename)
maxOccurs = ""
list_of_chars = []
for i in range(len(max(huge_words_list, key=len))):
for item in huge_words_list:
try:
list_of_chars.append(item[i])
except IndexError:
pass
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
list_of_chars.clear()
return maxOccurs
print(most_frequent_chars("test01/A.txt"))
This assignment is relatively easy, if the code has a good structure. Here is a full implementation:
def read_file(fname):
with open(fname, 'r') as f:
return list(filter(None, [y.rstrip(' \n').lstrip(' ') for x in f for y in x.split()]))
def read_chain(fname):
seen = set()
new = fname
result = []
while not new in seen:
A = read_file(new)
seen.add(new)
new, words = A[0], A[1:]
result.extend(words)
return result
def most_frequent_chars (fname):
all_words = read_chain(fname)
result = []
for i in range(max(map(len,all_words))):
chars = [word[i] for word in all_words if i<len(word)]
result.append(max(sorted(set(chars)), key = chars.count))
return ''.join(result)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
In the code above, we define 3 functions:
read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.
read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.
most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.
PS. This line of code you had is very interesting:
maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
I edited my code to include it.
Space complexity optimization
The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:
def scan_file(fname, database):
with open(fname, 'r') as f:
next_file = None
for x in f:
for y in x.split():
if next_file is None:
next_file = y
else:
for i,c in enumerate(y):
while len(database) <= i:
database.append({})
if c in database[i]:
database[i][c] += 1
else:
database[i][c] = 1
return next_file
def most_frequent_chars (fname):
database = []
seen = set()
new = fname
while not new in seen:
seen.add(new)
new = scan_file(new, database)
return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"
Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.
Ok, here's my solution:
def parsi_file(filename):
visited_files = set()
words_list = []
# Getting words from all files
while filename not in visited_files:
visited_files.add(filename)
with open(filename) as f:
filename = f.readline().strip()
words_list += [line.strip() for line in f.readlines()]
# Creating dictionaries of letters:count for each index
letters_dicts = []
for word in words_list:
for i in range(len(word)):
if i > len(letters_dicts)-1:
letters_dicts.append({})
letter = word[i]
if letters_dicts[i].get(letter):
letters_dicts[i][letter] += 1
else:
letters_dicts[i][letter] = 1
# Sorting dicts and getting the "best" letter
code = ""
for dic in letters_dicts:
sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
code += sorted_letters[0]
return code
We first get the words_list from all files.
Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.
Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

Finding specific words and add them into a dictionary

I want to find the words that start with "CHAPTER" and add them to a dictionary.
I have written some but It gives me 0 as an output all the time:
def wordcount(filename, listwords):
try:
file = open(filename, "r")
read = file.readlines()
file.close()
for word in listwords:
lower = word.lower()
count = 0
for sentence in read:
line = sentence.split()
for each in line:
line2=each.lower()
line2=line2.strip("")
if lower == line2:
count += 1
print(lower, ":", count)
except FileExistError:
print("The file is not there ")
wordcount("dad.txt", ["CHAPTER"])
the txt file is here
EDİT*
The problem was encoding type and I solved it but the new question is that How can I add these words into a dictionary?
and How can I make this code case sensitive I mean when I type wordcount("dad.txt", ["CHAPTER"]) I want it to find only CHAPTER words with upper case.
It cannot work because of this line:
if lower == line2:
you can use this line to find the words that start with "CHAPTER"
if line2.startswith(lower):
I notice that you need to check if a word starts with a certain words from listwords rather than equality (lower == line2). Hence, you should use startswith method.
You can have a simpler code, something like this.
def wordcount(filename, listwords):
listwords = [s.lower() for s in listwords]
wordCount = {s:0 for s in listwords} # A dict to store the counts
with open(filename,"r") as f:
for line in f.readlines():
for word in line.split():
for s in listwords:
if word.lower().startswith(s):
wordCount[s]+=1
return wordCount
If the goal is to find chapters and paragraphs, don't try and count words or split any line
For example, start simpler. Since chapters are in numeric order, you only need a list, not a dictionary
chapters = [] # list of chapters
chapter = "" # store one chapter
with open(filename, encoding="UTF-8") as f:
for line in f.readlines():
# TODO: should skip to the first line that starts with "CHAPTER", otherwise 'chapters' variable gets extra, header information
if line.startswith("CHAPTER"):
print("Found chapter: " + line)
# Save off the most recent, non empty chapter text, and reset
if chapter:
chapters.append(chapter)
chapter = ""
else:
# up to you if you want to skip empty lines
chapter += line # don't manipulate any data yet
# Capture the last chapter at the end of the file
if chapter:
chapters.append(chapter)
del chapter # no longer needed
# del chapters[0] if you want to remove the header information before the first chapter header
# Done reading the file, now work with strings in your lists
print(len(chapters)) # find how many chapters there are
If you actually did want the text following "CHAPTER", then you can split that line in the first if statement, however note that the chapter numbers repeat between volumes, and this solution assumes the volume header is part of a chapter
If you want to count the paragraphs, start with finding the empty lines (for example split each element on '\n\n')

How can I merge two snippets of text that both contain a desired keyword?

I have a program that pulls out text around a specific keyword. I'm trying to modify it so that if two keywords are close enough together, it just shows one longer snippet of text instead of two individual snippets.
My current code, below, adds words after the keyword to a list and resets the counter if another keyword is found. However, I've found two problems with this. The first is that the data rate limit in my spyder notebook is exceeded, and I haven't been able to deal with that. The second is that though this would make a longer snippet, it wouldn't get rid of the duplicate.
Does anyone know a way to get rid of the duplicate snippet, or know how to merge the snippets in a way that doesn't exceed the data rate limit (or know how to change the spyder rate limit)? Thank you!!
def occurs(word1, word2, file, filewrite):
import os
infile = open(file,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(file, 'w')
f.write("start")
f.write(os.linesep)
g = open(filewrite,'w')
g.write("start")
g.write(os.linesep)
for item in wordlist: #multiple words
matches = [i for i, w in enumerate(words) if w.lower().find(item) != -1]
#above line goes through lines, finds the specific words we want
for m in matches: #next three lines find each instance of the word, print out surrounding words
list = []
s = ""
l = " ".join(words[m-20:m+1])
j = 0
while j < 20:
list.append(words[m+i])
j = j+1
if words[m+i] == word1 or words[m+i] == word2:
j = 0
print (list)
k = " ".join(list)
f.write(f"{s}...{l}{k}...") #writes the data to the external file
f.write(os.linesep)
g.write(str(m))
g.write(os.linesep)
f.close
g.close

Python - Performance when iterating over dictionary keys

i have a relatively large text file (around 7m lines) and i want to run a specific logic over it which i ll try to explain below:
A1KEY1
A2KEY1
B1KEY2
C1KEY3
D1KEY3
E1KEY4
I want to count the frequency of appearence of the keys, and then output those with a frequency of 1 into one text file, those with a frequency of 2 in another, and those with a frequency higher than 2 in another.
This is the code i have so far, but it iterates over the dictionary painfully slow, and it gets slower the more it progresses.
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
dict_in = dict()
seen = []
fileinlist = filetoliststrip(file_in)
out_file = open(file_ot, 'w')
out_file2 = open(file_ot2, 'w')
out_file3 = open(file_ot3, 'w')
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
for j in dict_in.keys():
print("Processing key: " + str(j))
#print(dict_in[j])
if dict_in[j][0] < 2:
out_file.write(str(dict_in[j][1]))
elif dict_in[j][0] == 2:
for line_in in dict_in[j][1:]:
out_file2.write(str(line_in) + "\n")
elif dict_in[j][0] > 2:
for line_in in dict_in[j][1:]:
out_file3.write(str(line_in) + "\n")
out_file.close()
out_file2.close()
out_file3.close()
I m running this on a windows PC i7 with 8GB Ram, this should be not taking hours to perform. Is this a problem with the way i read the file into a list? Should i use a different method? Thanks in advance.
You have multiple points that slow down your code - there is no need to load the whole file into memory only to iterate over it again, there is no need to get a list of keys each time you want to do a lookup (if key not in dict_in: ... will suffice and will be blazingly fast), you don't need to keep the line count as you can post-check the lines length anyway... to name but a few.
I'd completely restructure your code as:
import collections
dict_in = collections.defaultdict(list) # save some time with a dictionary factory
with open(file_in, "r") as f: # open the file_in for reading
for line in file_in: # read the file line by line
key = line.strip()[10:69] # assuming this is how you get your key
dict_in[key].append(line) # add the line as an element of the found key
# now that we have the lines in their own key brackets, lets write them based on frequency
with open(file_ot, "w") as f1, open(file_ot2, "w") as f2, open(file_ot3, "w") as f3:
selector = {1: f1, 2: f2} # make our life easier with a quick length-based lookup
for values in dict_in.values(): # use dict_in.itervalues() on Python 2.x
selector.get(len(values), f3).writelines(values) # write the collected lines
And you'll hardly get more efficient than that, at least in Python.
Keep in mind that this will not guarantee the order of lines in the output prior to Python 3.7 (or CPython 3.6). The order within a key itself will be preserved, however. If you need to keep the line order prior to the aforementioned Python versions you'll have to do keep a separate key order list and iterate over it to pick up the dict_in values in order.
The first function:
def filetoliststrip(file):
file_in = str(file)
lines = list(open(file_in, 'r'))
content = [x.strip() for x in lines]
return content
Here a list of raw lines is produced only to be stripped. That will require roughly twice as much memory as necessary, and just as importantly, several passes over data that doesn't fit in cache. We also don't need to make str of things repeatedly. So we can simplify it a bit:
def filetoliststrip(filename):
return [line.strip() for line in open(filename, 'r')]
This still produces a list. If we're reading through the data only once, not storing each line, replace [] with () to turn it into a generator expression; in this case, since lines are actually held intact in memory until the end of the program, we'd only save the space for the list (which is still at least 30MB in your case).
Then we have the main parsing loop (I adjusted the indentation as I thought it should be):
counter = 0
for line in fileinlist:
counter += 1
keyf = line[10:69]
print("Loading line " + str(counter) + " : " + str(line))
if keyf not in dict_in.keys():
dict_in[keyf] = []
dict_in[keyf].append(1)
dict_in[keyf].append(line)
else:
dict_in[keyf][0] += 1
dict_in[keyf].append(line)
There are several suboptimal things here.
First, the counter could be an enumerate (when you don't have an iterable, there's range or itertools.count). Changing this will help with clarity and reduce the risk of mistakes.
for counter, line in enumerate(fileinlist, 1):
Second, it's more efficient to form a string in one operation than add it from bits:
print("Loading line {} : {}".format(counter, line))
Third, there's no need to extract the keys for a dictionary member check. In Python 2 that builds a new list, which means copying all the references held in the keys, and gets slower with every iteration. In Python 3, it still means building a key view object needlessly. Just use keyf not in dict_in if the check is needed.
Fourth, the check really isn't needed. Catching the exception when a lookup fails is pretty much as fast as the if check, and repeating the lookup after the if check is almost certainly slower. For that matter, stop repeating lookups in general:
try:
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
except KeyError:
dict_in[keyf] = [1, line]
This is such a common pattern, however, that we have two standard library implementations of it: Counter and defaultdict. We could use both here, but the Counter is more practical when you only want the count.
from collections import defaultdict
def newentry():
return [0]
dict_in = defaultdict(newentry)
for counter, line in enumerate(fileinlist, 1):
keyf = line[10:69]
print("Loading line {} : {}".format(counter, line))
dictvalue = dict_in[keyf]
dictvalue[0] += 1
dictvalue.append(line)
Using defaultdict let us not worry about whether the entries existed or not.
We now arrive at the output phase. Again we have needless lookups, so let's reduce them to one iteration:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
out_file.write(lines[0])
elif count == 2:
for line_in in lines:
out_file2.write(line_in + "\n")
elif count > 2:
for line_in in lines:
out_file3.write(line_in + "\n")
That still has a few annoyances. We've repeated the writing code, it builds other strings (tagging on "\n"), and it has a whole chunk of similar code for each case. In fact, the repetition probably caused a bug: there's no newline separator for the single occurrences in out_file. Let's factor out what really differs:
for key, value in dict_in.iteritems(): # just items() in Python 3
print("Processing key: " + key)
#print(value)
count, lines = value[0], value[1:]
if count < 2:
key_outf = out_file
elif count == 2:
key_outf = out_file2
else: # elif count > 2: # Test not needed
key_outf = out_file3
key_outf.writelines(line_in + "\n" for line_in in lines)
I've left the newline concatenation because it's more complex to mix them in as separate calls. The string is short-lived and it serves a purpose to have the newline in the same place: it makes it less likely at OS level that a line is broken up by concurrent writes.
You'll have noticed there are Python 2 and 3 differences here. Most likely your code wasn't all that slow if run in Python 3 in the first place. There exists a compatibility module called six to write code that more easily runs in either; it lets you use e.g. six.viewkeys and six.iteritems to avoid this gotcha.
You load a very large file onto memory at once. When you don't actually needs the lines, and you just need to process it, use a generator. It is more memory-efficient.
Counter is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. You can use that to count the frequency of the keys. Then simply iterate over the new dict and append the key to the relevant file:
from collections import Counter
keys = ['A1KEY1', 'A2KEY1', 'B1KEY2', 'C1KEY3', 'D1KEY3', 'E1KEY4']
count = Counter(keys)
with open('single.txt') as f1:
with open('double.txt') as f2:
with open('more_than_double.txt') as f3:
for k, v in count.items():
if v == 1:
f1.writelines(k)
elif v == 2:
f2.writelines(k)
else:
f3.writelines(k)

TAB break for line split in a dictionary in Python

I am attempting to run a python program that can run a dictionary from a file with a list of words with each word given a score and standard deviation. My program looks like this:
theFile = open('word-happiness.csv' , 'r')
theFile.close()
def make_happiness_table(filename):
'''make_happiness_table: string -> dict
creates a dictionary of happiness scores from the given file'''
with open(filename) as f:
d = dict( line.split(' ') for line in f)
return d
make_happiness_table("word-happiness.csv")
table = make_happiness_table("word-happiness.csv")
(score, stddev) = table['hunger']
print("the score for 'hunger' is %f" % score)
My .csv file is in the form
word{TAB}score{TAB}standard_deviation
and I am trying to create the dictionary in that way. How can I create such a dictionary so that I can print a word such as 'hunger' from the function and get its score and std deviation?
def make_happiness_table(filename):
with open(filename) as f:
d = dict()
for line in f:
word,score,std = line.split() #splits on any consecutive runs of whitspace
d[word]=score,std # May want to make floats: `d[word] = float(score),float(std)`
return d
Note that if your word can have a tab character in it, but you're guaranteed that you only have 3 fields (word, score, std), you can split the string from the right (str.rsplit), only splitting twice (resulting in 3 fields at the end). e.g. word,score,std = line.rsplit(None,2).
As mentioned in the comments above, you can also use the csv module to read these sorts of files -- csv really shines if your fields can be "quoted". e.g.:
"this is field 0" "this is field 1" "this is field 2"
If you don't have that scenario, then I find that str.split works just fine.
Also, unrelated, but your code calls make_happiness_table twice (the first time you don't assign the return value to anything). The first call is useless (all it does is read the file and build a dictionary which you can never use). Finally, opening and closeing theFile at the beginning of your script is also just a waste since you don't do anything with the file there.
If you are sure your word will not have space, you can just split the line e.g.
word, score, stddev = line.split()
But if word can have space use tab char \t to split e.g.
word, score, stddev = line.split('\t')
But for a very generic case when word may have tab itself use the csv module
reader = csv.reader(filename, dialect='excel-tab')
for word, score, stddev in reader:
...
and then you can create dict of word and score, stddev e.g.
word_dict[word] = (score, stddev)

Categories

Resources