I am trying to write a code that will take a .txt file containing words and their definitions and produce a dictionary of {'word1':['definition1', 'definition2'...]}. The .txt file is in the following format:
word1
definition1
definition2
(blank line)
word2
definition1
definition2
...
so far the body of the function I have written is as follows:
line = definition_file.readline()
dictx = {}
while line != '':
key = line.strip()
defs = []
line = definition_file.readline()
while line != '\n':
defx = [line.strip()]
defs += defx
line = definition_file.readline()
if key not in dictx:
dictx[key] = defs
return dictx
I quickly realized the problem with this code is that it will only return a dictionary with the very first word within it. I need a way to make the code loop so that it returns a dictionary with all the words + definitions. I was hoping to do this without using a break.
thanks!
This should do it:
from collections import defaultdict
d = defaultdict(list)
is_definition = False
with open('test.txt') as f:
for line in f:
line = line.strip().rstrip('\n')
if line == '': # blank line
is_definition=False
continue
if is_definition: # definition line
d[word].append(line)
else: # word line
word = line
is_definition = True
This one-liner will also do the trick:
>>> tesaurus = open('tesaurus.txt').read()
>>> dict(map(lambda x: (x[0], x[1].split()), [term.split("\n", 1) for term in tesaurus.replace("\r", "").split("\n\n")]))
{'word1': ['definition1', 'definition2'], 'word3': ['def1', 'def2'], 'word2': ['definition1', 'definition2']}
Here's another possibility:
d = dict()
defs = list()
with open('test.txt') as infile:
for line in infile:
if not line:
d[defs[0]] = defs[1:]
defs = list()
else:
defs.append(line.strip())
Read the whole file
d = dict()
with open('file.txt') as f:
stuff = f.read()
Split the file on blank lines.
word_defs = stuff.split('\n\n')
Iterate over the definition groups and split the word from the definitions.
for word_def in word_defs:
word_def = word_def.split('\n')
word = word_def[0]
defs = word_def[1:]
d[word] = defs
If you prefer something more functional /compact (same thing but different). First an iterator that produces [word, def, def, ...] groups.
definition_groups = (thing.split('\n') for thing in stuff.split('\n\n'))
dict comprehension to build the dictionary
import operator
word = operator.itemgetter(0)
defs = operator.itemgetter(slice(1,None))
g = {word(group):defs(group) for group in definition_groups}
Here is my best answer that meets your criteria.
import sys
d = {}
with open(sys.argv[1], "r") as f:
done = False
while not done:
word = f.readline().strip()
done = not word
line = True
defs = []
while line:
line = f.readline().rstrip('\n')
if line.strip():
defs.append(line)
if not done:
d[word] = defs
print(d)
But I don't understand why you are trying to avoid using break. I think this code is clearer with break... the flow of control is simpler and we don't need as many variables. When word is an empty string, this code just breaks out (immediately stops what it is doing) and that is very easy to understand. You have to study the first code to make sure you know how it works when end-of-file is reached.
import sys
d = {}
with open(sys.argv[1], "r") as f:
while True:
word = f.readline().strip()
defs = []
if not word:
break
while True:
line = f.readline().rstrip('\n')
if not line:
break
defs.append(line)
d[word] = defs
print(d)
But I think the best way to write this is to make a helper function that packages up the job of parsing out the definitions:
import sys
def _read_defs(f):
while True:
line = f.readline().rstrip('\n')
if not line:
break
yield line
d = {}
with open(sys.argv[1], "r") as f:
while True:
word = f.readline().strip()
if not word:
break
d[word] = list(_read_defs(f))
print(d)
The first one is trickier because it is avoiding the use of break. The others are simpler to understand, with two similar loops that have similar flow of control.
Related
I have a text file which is named test.txt. I want to read it and return a list of all words (with newlines removed) from the file.
This is my current code:
def read_words(test.txt):
open_file = open(words_file, 'r')
words_list =[]
contents = open_file.readlines()
for i in range(len(contents)):
words_list.append(contents[i].strip('\n'))
return words_list
open_file.close()
Running this code produces this list:
['hello there how is everything ', 'thank you all', 'again', 'thanks a lot']
I want the list to look like this:
['hello','there','how','is','everything','thank','you','all','again','thanks','a','lot']
Depending on the size of the file, this seems like it would be as easy as:
with open(file) as f:
words = f.read().split()
Replace the words_list.append(...) line in the for loop with the following:
words_list.extend(contents[i].split())
This will split each line on whitespace characters, and then add each element of the resulting list to words_list.
Or as an alternative method for rewriting the entire function as a list comprehension:
def read_words(words_file):
return [word for line in open(words_file, 'r') for word in line.split()]
Here is how I'd write that:
def read_words(words_file):
with open(words_file, 'r') as f:
ret = []
for line in f:
ret += line.split()
return ret
print read_words('test.txt')
The function can be somewhat shortened by using itertools, but I personally find the result less readable:
import itertools
def read_words(words_file):
with open(words_file, 'r') as f:
return list(itertools.chain.from_iterable(line.split() for line in f))
print read_words('test.txt')
The nice thing about the second version is that it can be made to be entirely generator-based and thus avoid keeping all of the file's words in memory at once.
There are several ways to do this. Here are a few:
If you don't care about repeated words:
def getWords(filepath):
with open('filepath') as f:
return list(itertools.chain(line.split() for line in f))
If you want to return a list of words in which each word appears only once:
Note: this does not preserve the order of the words
def getWords(filepath):
with open('filepath') as f:
return {word for word in line.split() for line in f} # python2.7
return set((word for word in line.split() for line in f)) # python 2.6
If you want a set --and-- want to preserve the order of words:
def getWords(filepath):
with open('filepath') as f:
words = []
pos = {}
position = itertools.count()
for line in f:
for word in line.split():
if word not in pos:
pos[word] = position.next()
words.append(word)
return sorted(words, key=pos.__getitem__)
If you want a word-frequency dictionary:
def getWords(filepath):
with open('filepath') as f:
return collections.Counter(itertools.chain(line.split() for line in file))
Hope these help
The actual question has already been answered, but I would like to point out that the line f.close() will not be executed as the function returns before that line. Try writing f.close() before the return statement.
So, i have this text file which contains this infos:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
What i want to do is i want to take each line of this text file as a dictionary entry with this format:
students = { student_num1: [student_name1, student_grade1], student_num2: [student_name2, student_grade2], student_num3: [student_name3, student_grade3] }
Basically, the first string of the line should be the key and the 2 strings next to it would be the value. But i don't know how will i make python separate the strings in each line and assign them as the key and value for the dictionary.
EDIT:
So, i've tried some code: (I saw all your solutions, and i think they'll all definitely work, but i also want to learn to create my solution, so i will really appreciate if you could check mine!)
for line in fh:
line = line.split(";")
student_num = line[0]
student_name = line[1]
student_grade = line[2]
count =+ 1
direc[student_num] = [student_name,student_grade]
student_num = "student_num" + str(count)
student_grade = "student_grade" + str(count)
student_name = "student_name" + str(count)
print(direc)
The problem is i get an error of list index out of range on line 10 or this part "student_name = line[1]"
EDIT: THANK YOU EVERYONE! Every single one of your suggested solutions works! I've also fixed my own solution. This is the fixed one (as suggest by #norok2):
for line in fh:
line = line.split(" ")
student_num = line[0]
student_name = line[1]
student_grade = line[2]
count =+ 1
direc[student_num] = [student_name,student_grade]
student_num = "student_num" + str(count)
student_grade = "student_grade" + str(count)
student_name = "student_name" + str(count)
As a dict comprehension:
with open("data.txt", "r") as f:
students = {k:v for k, *v in map(str.split, f)}
Explanation:
The file object f is already an iterator (that yields each line), we want to split the lines, so we can use map(str.split, f) or (line.split() for line in f).
After that we know, that the first item is the key of the dictionary, and the remaining items are the values. We can use unpacking for that. An unpacking example:
>>> a, *b = [1,2,3]
>>> a
1
>>> b
[2, 3]
Then we use a comprehension to build the dict with the values we are capturing in the unpacking.
A dict comprehension is an expresion to build up dictionaries, for example:
>>> {x:x+1 for x in range(5)}
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
Example,
File data.txt:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
Reading it
>>> with open("data.txt", "r") as f:
... students = {k:v for k, *v in map(str.split, f)}
...
>>> students
{'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
My current approach uses file handling to open a file in read mode, and then reading the lines present in the file. Then for each line, remove extra new line and whitespaces and split it at space, to create a list. Then used unpacking to store single value as key and a list of 2 values as value. Added values to the dictonary.
temp.txt
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
main.py
d = dict()
with open("temp.txt", "r") as f:
for line in f.readlines():
key, *values = line.strip().split(" ")
d[key] = values
print(d)
Output
{'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
with open('data.txt') as f:
lines = f.readlines()
d = {}
for line in lines:
tokens = line.split()
d[tokens[0]] = tokens[1:]
print(d)
I hope this is understandable. To split the lines into the different tokens, we use the split1 function.
The reason why your solution is giving you that error is that it seems your lines do not contain the character ;, yet you try to split by that character with line = line.split(";").
You should replace that with:
line = line.split(" ") to split by the space character
or
line = line.split(";") to split by any blank character
However, for a more elegant solution, see here.
Have you tried something as simple as this:
d = {}
with open('students.txt') as f:
for line in f:
key, *rest = line.split()
d[key] = rest
print(d)
# {'student_num1': ['student_name1', 'student_grade1'], 'student_num2': ['student_name2', 'student_grade2'], 'student_num3': ['student_name3', 'student_grade3']}
file.txt:
student_num1 student_name1 student_grade1
student_num2 student_name2 student_grade2
student_num3 student_name3 student_grade3
Main.py:
def main():
file = open('file.txt', 'r')
students = {}
for line in file:
fields = line.split(" ")
fields[2] = fields[2].replace("\n", "")
students[fields[1]] = [fields[0], fields[2]]
print(students)
main()
Output:
{'student_name1': ['student_num1', 'student_grade1'], 'student_name2': ['student_num2', 'student_grade2'], 'student_name3': ['student_num3', 'student_grade3']}
In the following code, if I use:
for line in fin:
It only executes for 'a'
But if I use:
wordlist = fin.readlines()
for line in wordlist:
Then it executes for a thru z.
But readlines() reads the whole file at once, which I don't want.
How to avoid this?
def avoids():
alphabet = 'abcdefghijklmnopqrstuvwxyz'
num_words = {}
fin = open('words.txt')
for char in alphabet:
num_words[char] = 0
for line in fin:
not_found = True
word = line.strip()
if word.lower().find(char.lower()) != -1:
num_words[char] += 1
fin.close()
return num_words
the syntax for line in fin can only be used once. After you do that, you've exhausted the file and you can't read it again unless you "reset the file pointer" by fin.seek(0). Conversely, fin.readlines() will give you a list which you can iterate over and over again.
I think a simple refactor with Counter (python2.7+) could save you this headache:
from collections import Counter
with open('file') as fin:
result = Counter()
for line in fin:
result += Counter(set(line.strip().lower()))
which will count the number of words in your file (1 word per line) that contain a particular character (which is what your original code does I believe ... Please correct me if I'm wrong)
You could also do this easily with a defaultdict (python2.5+):
from collections import defaultdict
with open('file') as fin:
result = defaultdict(int)
for line in fin:
chars = set(line.strip().lower())
for c in chars:
result[c] += 1
And finally, kicking it old-school -- I don't even know when setdefault was introduced...:
fin = open('file')
result = dict()
for line in fin:
chars = set(line.strip().lower())
for c in chars:
result[c] = result.setdefault(c,0) + 1
fin.close()
You have three options:
Read in the whole file anyway.
Seek back to the beginning of the file before attempting to iterate over it again.
Rearchitect your code so that it doesn't need to iterate over the file more than once.
Try:
from collections import defaultdict
from itertools import product
def avoids():
alphabet = 'abcdefghijklmnopqrstuvwxyz'
num_words = defaultdict(int)
with open('words.txt') as fin:
words = [x.strip() for x in fin.readlines() if x.strip()]
for ch, word in product(alphabet, words):
if ch not in word:
continue
num_words[ch] += 1
return num_words
Write a program that reads the contents of a random text file. The program should create a dictionary in which the keys are individual words found in the file and the values are the number of times each word appears.
How would I go about doing this?
def main():
c = 0
dic = {}
words = set()
inFile = open('text2', 'r')
for line in inFile:
line = line.strip()
line = line.replace('.', '')
line = line.replace(',', '')
line = line.replace("'", '') #strips the punctuation
line = line.replace('"', '')
line = line.replace(';', '')
line = line.replace('?', '')
line = line.replace(':', '')
words = line.split()
for x in words:
for y in words:
if x == y:
c += 1
dic[x] = c
print(dic)
print(words)
inFile.close()
main()
Sorry for the vague question. Never asked any questions here before. This is what I have so far. Also, this is the first ever programming I've done so I expect it to be pretty terrible.
with open('path/to/file') as infile:
# code goes here
That's how you open a file
for line in infile:
# code goes here
That's how you read a file line-by-line
line.strip().split()
That's how you split a line into (white-space separated) words.
some_dictionary['abcd']
That's how you access the key 'abcd' in some_dictionary.
Questions for you:
What does it mean if you can't access the key in a dictionary?
What error does that give you? Can you catch it with a try/except block?
How do you increment a value?
Is there some function that GETS a default value from a dict if the key doesn't exist?
For what it's worth, there's also a function that does almost exactly this, but since this is pretty obviously homework it won't fulfill your assignment requirements anyway. It's in the collections module. If you're interested, try and figure out what it is :)
There are at least three different approaches to add a new word to the dictionary and count the number of occurences in this file.
def add_element_check1(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 1
else:
my_dict[e] += 1
def add_element_check2(my_dict, elements):
for e in elements:
if e not in my_dict:
my_dict[e] = 0
my_dict[e] += 1
def add_element_except(my_dict, elements):
for e in elements:
try:
my_dict[e] += 1
except KeyError:
my_dict[e] = 1
my_words = {}
with open('pathtomyfile.txt', r) as in_file:
for line in in_file:
words = [word.strip().lower() word in line.strip().split()]
add_element_check1(my_words, words)
#or add_element_check2(my_words, words)
#or add_element_except(my_words, words)
If you are wondering which is the fastest? The answer is: it depends. It depends on how often a given word might occur in the file. If a word does only occur (relatively) few times, the try-except would be the best choice in your case.
I have done some simple benchmarks here
This is a perfect job for the built in Python Collections class. From it, you can import Counter, which is a dictionary subclass made for just this.
How you want to process your data is up to you. One way to do this would be something like this
from collections import Counter
# Open your file and split by white spaces
with open("yourfile.txt","r") as infile:
textData = infile.read()
# Replace characters you don't want with empty strings
textData = textData.replace(".","")
textData = textData.replace(",","")
textList = textData.split(" ")
# Put your data into the counter container datatype
dic = Counter(textList)
# Print out the results
for key,value in dic.items():
print "Word: %s\n Count: %d\n" % (key,value)
Hope this helps!
Matt
I would like to define a function scaryDict() which takes one parameter (a textfile) and returns the words from the textfile in alphabetical order, basically produce a dictionary but does not print any one or two letter words.
Here is what I have so far...it isn't much but I don't know the next step
def scaryDict(fineName):
inFile = open(fileName,'r')
lines = inFile.read()
line = lines.split()
myDict = {}
for word in inFile:
myDict[words] = []
#I am not sure what goes between the line above and below
for x in lines:
print(word, end='\n')
You are doing fine till line = lines.split(). But your for loop must loop through the line array, not the inFile.
for word in line:
if len(word) > 2: # Make sure to check the word length!
myDict[word] = 'something'
I'm not sure what you want with the dictionary (maybe get the word count?), but once you have it, you can get the words you added to it by,
allWords = myDict.keys() # so allWords is now a list of words
And then you can sort allWords to get them in alphabetical order.
allWords.sort()
I would store all of the words into a set (to eliminate dups), then sort that set:
#!/usr/bin/python3
def scaryDict(fileName):
with open(fileName) as inFile:
return sorted(set(word
for line in inFile
for word in line.split()
if len(word) > 2))
scaryWords = scaryDict('frankenstein.txt')
print ('\n'.join(scaryWords))
Also keep in mind as of 2.5 the 'with' file contains an enter and exit methods which can prevent some issues (such as that file never getting closed)
with open(...) as f:
for line in f:
<do something with line>
Unique set
Sort the set
Now you can put it all together.
sorry that i am 3 years late : ) here is my version
def scaryDict():
infile = open('filename', 'r')
content = infile.read()
infile.close()
table = str.maketrans('.`/()|,\';!:"?=-', 15 * ' ')
content = content.translate(table)
words = content.split()
new_words = list()
for word in words:
if len(word) > 2:
new_words.append(word)
new_words = list(set(new_words))
new_words.sort()
for word in new_words:
print(word)