I am learning python from an introductory Python textbook and I am stuck on the following problem:
You will implement function index() that takes as input the name of a text file and a list of words. For every word in the list, your function will find the lines in the text file where the word occurs and print the corresponding line numbers.
Ex:
>>>> index('raven.txt', ['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])
ghost 9
dying 9
demon 122
evil 99, 106
ghastly 82
mortal 30
raven 44, 53, 55, 64, 78, 97, 104, 111, 118, 120
Here is my attempt at the problem:
def index(filename, lst):
infile = open(filename, 'r')
lines = infile.readlines()
lst = []
dic = {}
for line in lines:
words = line.split()
lst. append(words)
for i in range(len(lst)):
for j in range(len(lst[i])):
if lst[i][j] in lst:
dic[lst[i][j]] = i
return dic
When I run the function, I get back an empty dictionary. I do not understand why I am getting an empty dictionary. So what is wrong with my function? Thanks.
You are overwriting the value of lst. You use it as both a parameter to a function (in which case it is a list of strings) and as the list of words in the file (in which case it's a list of list of strings). When you do:
if lst[i][j] in lst
The comparison always returns False because lst[i][j] is a str, but lst contains only lists of strings, not strings themselves. This means that the assignment to the dic is never executed and you get an empty dict as result.
To avoid this you should use a different name for the list in which you store the words, for example:
In [4]: !echo 'a b c\nd e f' > test.txt
In [5]: def index(filename, lst):
...: infile = open(filename, 'r')
...: lines = infile.readlines()
...: words = []
...: dic = {}
...: for line in lines:
...: line_words = line.split()
...: words.append(line_words)
...: for i in range(len(words)):
...: for j in range(len(words[i])):
...: if words[i][j] in lst:
...: dic[words[i][j]] = i
...: return dic
...:
In [6]: index('test.txt', ['a', 'b', 'c'])
Out[6]: {'a': 0, 'c': 0, 'b': 0}
There are also a lot of things you can change.
When you want to iterate a list you don't have to explicitly use indexes. If you need the index you can use enumerate:
for i, line_words in enumerate(words):
for word in line_words:
if word in lst: dict[word] = i
You can also iterate directly on a file (refer to Reading and Writing Files section of the python tutorial for a bit more information):
# use the with statement to make sure that the file gets closed
with open('test.txt') as infile:
for i, line in enumerate(infile):
print('Line {}: {}'.format(i, line))
In fact I don't see why would you first build that words list of list. Just itertate on the file directly while building the dictionary:
def index(filename, lst):
with open(filename, 'r') as infile:
dic = {}
for i, line in enumerate(infile):
for word in line.split():
if word in lst:
dic[word] = i
return dic
Your dic values should be lists, since more than one line can contain the same word. As it stands your dic would only store the last line where a word is found:
from collections import defaultdict
def index(filename, words):
# make faster the in check afterwards
words = frozenset(words)
with open(filename) as infile:
dic = defaultdict(list)
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
return dic
If you don't want to use the collections.defaultdict you can replace dic = defaultdict(list) with dic = {} and then change the:
dic[word].append(i)
With:
if word in dic:
dic[word] = [i]
else:
dic[word].append(i)
Or, alternatively, you can use dict.setdefault:
dic.setdefault(word, []).append(i)
although this last way is a bit slower than the original code.
Note that all these solutions have the property that if a word isn't found in the file it will not appear in the result at all. However you may want it in the result, with an emty list as value. In such a case it's simpler the dict with empty lists before starting to loop, such as in:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
Refer to the documentation about List Comprehensions and Dictionaries to understand the first line.
You can also iterate over words instead of the line, like this:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in words:
if word in line.split():
dic[word].append(i)
Note however that this is going to be slower because:
line.split() returns a list, so word in line.split() will have to scan all the list.
You are repeating the computation of line.split().
You can try to solve these two problems doing:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words:
if word in line_words:
dic[word].append(i)
Note that here we are iterating once over line.split() to build the set and also over words. Depending on the sizes of the two sets this may be slower or faster than the original version (iteratinv over line.split()).
However at this point it's probably faster to intersect the sets:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words & line_words: # & stands for set intersection
dic[word].append(i)
Try this,
def index(filename, lst):
dic = {w:[] for w in lst}
for n,line in enumerate( open(filename,'r') ):
for word in lst:
if word in line.split(' '):
dic[word].append(n+1)
return dic
There are some features of the language introduced here that you should be aware of because they will make life a lot easier in the long run.
The first is a dictionary comprehension. It basically initializes a dictionary using the words in lst as keys and an empty list [] as the value for each key.
Next the enumerate command. This allows us to iterate over the items in a sequence but also gives us the index of those items. In this case, because we passed a file object to enumerate it will loop over the lines. For each iteration, n will be the 0-based index of the line and line will be the line itself. Next we iterate over the words in lst.
Notice that we don't need any indices here. Python encourages looping over objects in sequences rather than looping over indices and then accessing the objects in a sequence based on index (for example discourages doing for i in range(len(lst)): do something with lst[i]).
Finally, the in operator is a very straightforward way to test membership for many types of objects and the syntax is very intuitive. In this case, we are asking is the current word from lst in the current line.
Note that we use line.split(' ') to get a list of the words in the line. If we don't do this, 'the' in 'there was a ghost' would return True as the is a substring of one of the words.
On the other hand 'the' in ['there', 'was', 'a', 'ghost'] would return False. If the conditional returns True, we append it to the list associated to the key in our dictionary.
That might be a lot to chew on, but these concepts make problems like this more straight forward.
First, your function param with the words is named lst and also the list where you put all the words in the file is also named lst, so you are not saving the words passed to your functions, because on line 4 you're redeclaring the list.
Second, You are iterating over each line in the file (the first for), and getting the words in that line. After that lst has all the words in the entire file. So in the for i ... you are iterating over all the words readed from the file, there's no need to use the third for j where you are iterating over each character in every word.
In resume, in that if you are saying "If this single character is in the lists of words ..." wich is not, so the dict will be never filled up.
for i in range(len(lst)):
if words[i] in lst:
dic[words[i]] = dic[words[i]] + i # To count repetitions
You need to rethink the problem, even my answer will fail because the word in the dict will not exist giving an error, but you get the point. Good luck!
Related
I have been trying to fit the following list of words from a text file into one list comprehension:
file = open("Lincoln.txt", "r").read().split()
world_list = []
for v in file:
word_list.append(v.translate(str.maketrans("", "",string.punctuation)).lower())
for i in word_list:
if i != '':
world_list.append(i)
This is succesful, but I'm not sure how to include the second part of the for loop in that same list comprehension:
word_list = [v.translate(str.maketrans("", "",string.punctuation)).lower() for i,v in enumerate(word_list)]
Without the second part, I still get empty strings from my word extraction:
click to see my output
edited
[i for i in (v.translate(str.maketrans("", "",string.punctuation)).lower() for v in word_list) if i != '']
I have the following code:
result = set()
with open("words.txt") as fd:
for line in fd:
matching_words = {word for word in line.lower().split() if len(word)==4 and "'" not in word}
result.update(matching_words)
print(result)
print(len(result))
result_dict = {}
for word in result:
result_dict[word[2:]] = result_dict.get(word[2:], []) + [word]
print(result_dict)
print({key: len(value) for key, value in result_dict.items()})
Output
This takes a .txt file finds all the unique four letter words and excludes any that include an apostrophe. These words are then split using the last 2 characters. Each of the word endings are then added to a dictionary with the number of words containing that ending displayed as the value.
What I now need to do is disregard any list with less than 30 words in it.
Then randomly select one word from each of the remaining lists and print the list of words.
The following comprehension should work:
[random.choice(v) for v in result_dict.values() if len(v) >= 30]
Why not use random.choice and use a list comprehension to limit the values given to it:
random.choice([k for k, v in result_dict.items() if len(v) >= 30])
I'm sort of new to programming. I have created a class that uses list comprehension in its initializer. It is as follows:
class Collection_of_word_counts():
'''this class has one instance variable, called counts which stores a
dictionary where the keys are words and the values are their occurences'''
def __init__(self:'Collection_of_words', file_name: str) -> None:
''' this initializer will read in the words from the file,
and store them in self.counts'''
l_words = open(file_name).read().split()
s_words = set(l_words)
self.counts = dict([ [word, l_words.count(word)]
for word
in s_words])
I think I did alright for a novice. It works! But I don't exactly understand how this would be represented in a for-loop. My guess was terribly wrong:
self.counts =[]
for word in s_words:
self.counts = [word, l_words.count(word)]
dict(self.counts)
This is what your comprehension is as a for loop:
dictlist = []
for word in s_words:
dictlist.append([word, l_words.count(word)])
self.counts = dict(dictlist)
Your guess was not wrong at all; you just forgot to append and assign back to self.counts:
counts = []
for word in s_words:
counts.append([word, l_words.count(word)])
self.counts = dict(counts)
That's what a list comprehension does, essentially; build a list from the loop expression.
You could also translate that to a dictionary comprehension instead:
self.counts = {word: l_words.count(word) for word in s_words}
or better still, use a collections.Counter() object and save yourself all that work:
from collections import Counter
def __init__(self:'Collection_of_words', file_name: str) -> None:
''' this initializer will read in the words from the file,
and store them in self.counts'''
with open(file_name) as infile:
self.counts = Counter(infile.read().split())
The Counter() object goes about counting your words a little more efficiently, and gives you additional helpful functionality such as listing the top N counts and the ability to merge counts.
You are essentially creating a dictionary, where the key of the dictionary is the word and the value corresponding to that key is the number of times the word appears.
self.counts ={}
for word in s_words:
self.counts[word] = l_words.count(word)
I'm currently trying to create an index of words, reading each line from a text file and checking to see if the word is in that line. If so, it prints out the number line and continues the check. I've gotten it to work how I wanted to when printing each word and line number, but I'm not sure what storage system I could use to contain each number.
Code example:
def index(filename, wordList):
'string, list(string) ==> string & int, returns an index of words with the line number\
each word occurs in'
indexDict = {}
res = []
infile = open(filename, 'r')
count = 0
line = infile.readline()
while line != '':
count += 1
for word in wordList:
if word in line:
#indexDict[word] = [count]
print(word, count)
line = infile.readline()
#return indexDict
This prints the word and whatever the count is at the time (line number), but what I'm trying to do is store the numbers so that later on I can make it print out
word linenumber
word2 linenumber, linenumber
And so on. I felt a dictionary would work for this if I put each line number inside a list so each key can contain more than one value, but the closest I got was this:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}
When I wanted it to show up as:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}
Any ideas?
Try something like this:
import collections
def index(filename, wordList):
indexDict = collections.defaultdict(list)
with open(filename) as infile:
for (i, line) in enumerate(infile.readlines()):
for word in wordList:
if word in line:
indexDict[word].append(i+1)
return indexDict
This yields the exact same results as in your example (using Poe's Raven).
Alternatively, you might consider using a normal dict instead of a defaultdict and initialize it with all the words in the list; to make sure that the indexDict contains an entry even for words that are not in the text.
Also, note the use of enumerate. This builtin function is very useful for iterating over both the index and the item at that index of some list (like the lines in the file).
You are replacing the old value by this line
indexDict[word] = [count]
Changing it to
indexDict[word] = indexDict.setdefault(word, []) + [count]
Will yield the answer you want. It'll get the current value of indexDict[word] and append the new count to it, if there is no indexDict[word], it creates a new empty list and append count to it.
There is probably a more pythonic way to write this, but just for readability you could try this (a simple example):
dict = {1: [], 2: [], 3: []}
list = [1,2,2,2,3,3]
for k in dict.keys():
for i in list:
if i == k:
dict[k].append(i)
In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}
You need to append your next item to the list, if the list already exists.
The easiest way to have the list already be there even for the first time you find a word, is to use the collections.defaultdict class to track your word-to-lines mapping:
from collections import defaultdict
def index(filename, wordList):
indexDict = defaultdict(list)
with open(filename, 'r') as infile:
for i, line in enumerate(infile):
for word in wordList:
if word in line:
indexDict[word].append(i)
print(word, i)
return indexDict
I've simplified your code a little using best practices; opening the file as a context manager so it'll close automatically when done, and using enumerate() to create line numbers on the fly.
You could speed this up a little further still (and make it more accurate) if you turned your lines into a set of words (set(line.split()) perhaps, but that won't remove punctuation), as then you could use set intersection tests against wordList (also a set), which could be considerably faster to find matching words.
I have to write a function based on a open file that has one lowercase word per line. I have to return a dictionary with keys in single lowercase letters and each value is a list of the words from the file that starts with that letter. (The keys in the dictionary are from only the letters of the words that appear in the file.)
This is my code:
def words(file):
line = file.readline()
dict = {}
list = []
while (line != ""):
list = line[:].split()
if line[0] not in dict.keys():
dict[line[0]] = list
line = file.readline()
return dict
However, when I was testing it myself, my function doesn't seem to return all the values. If there are more than two words that start with a certain letter, only the first one shows up as the values in the output. What am I doing wrong?
For example, the file should return:
{'a': ['apple'], 'p': ['peach', 'pear', 'pineapple'], \
'b': ['banana', 'blueberry'], 'o': ['orange']}, ...
... but returns ...
{'a': ['apple'], 'p': ['pear'], \
'b': ['banana'], 'o': ['orange']}, ...
Try this solution, it takes into account the case where there are words starting with the same character in more than one line, and it doesn't use defaultdict. I also simplified the function a bit:
def words(file):
dict = {}
for line in file:
lst = line.split()
dict.setdefault(line[0], []).extend(lst)
return dict
You aren't adding to the list for each additional letter. Try:
if line[0] not in dict.keys():
dict[line[0]] = list
else:
dict[line[0]] += list
The specific problem is that dict[line[0]] = list replaces the value for the new key. There are many ways to fix this... I'm happy to provide one, but you asked what was wrong and that's it. Welcome StackOverflow.
It seems like every dictionary entry should be a list. Use the append method on the dictionary key.
Sacrificing performance (to a certain extent) for elegance:
with open(whatever) as f: words = f.read().split()
result = {
first: [word for word in words if word.startswith(first)]
for first in set(word[0] for word in words)
}
Something like this should work
def words(file):
dct = {}
for line in file:
word = line.strip()
try:
dct[word[0]].append(word)
except KeyError:
dct[word[0]] = [word]
return dct
The first time a new letter is found, there will be a KeyError, subsequent occurances of the letter will cause the word to be appended to the existing list
Another approach would be to prepopulate the dict with the keys you need
import string
def words(file):
dct = dict.fromkeys(string.lowercase, [])
for line in file:
word = line.strip()
dct[word[0]] = dct[word[0]] + [word]
return dct
I'll leave it as an exercise to work out why dct[word[0]] += [word] won't work
Try this function
def words(file):
dict = {}
line = file.readline()
while (line != ""):
my_key = line[0].lower()
dict.setdefault(my_key, []).extend(line.split() )
line = file.readline()
return dict