Creating an index of words - python

I'm currently trying to create an index of words, reading each line from a text file and checking to see if the word is in that line. If so, it prints out the number line and continues the check. I've gotten it to work how I wanted to when printing each word and line number, but I'm not sure what storage system I could use to contain each number.
Code example:
def index(filename, wordList):
'string, list(string) ==> string & int, returns an index of words with the line number\
each word occurs in'
indexDict = {}
res = []
infile = open(filename, 'r')
count = 0
line = infile.readline()
while line != '':
count += 1
for word in wordList:
if word in line:
#indexDict[word] = [count]
print(word, count)
line = infile.readline()
#return indexDict
This prints the word and whatever the count is at the time (line number), but what I'm trying to do is store the numbers so that later on I can make it print out
word linenumber
word2 linenumber, linenumber
And so on. I felt a dictionary would work for this if I put each line number inside a list so each key can contain more than one value, but the closest I got was this:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [120], 'evil': [106], 'demon': [122]}
When I wanted it to show up as:
{'mortal': [30], 'dying': [9], 'ghastly': [82], 'ghost': [9], 'raven': [44, 53, 55, 64, 78, 97, 104, 111, 118, 120], 'evil': [99, 106], 'demon': [122]}
Any ideas?

Try something like this:
import collections
def index(filename, wordList):
indexDict = collections.defaultdict(list)
with open(filename) as infile:
for (i, line) in enumerate(infile.readlines()):
for word in wordList:
if word in line:
indexDict[word].append(i+1)
return indexDict
This yields the exact same results as in your example (using Poe's Raven).
Alternatively, you might consider using a normal dict instead of a defaultdict and initialize it with all the words in the list; to make sure that the indexDict contains an entry even for words that are not in the text.
Also, note the use of enumerate. This builtin function is very useful for iterating over both the index and the item at that index of some list (like the lines in the file).

You are replacing the old value by this line
indexDict[word] = [count]
Changing it to
indexDict[word] = indexDict.setdefault(word, []) + [count]
Will yield the answer you want. It'll get the current value of indexDict[word] and append the new count to it, if there is no indexDict[word], it creates a new empty list and append count to it.

There is probably a more pythonic way to write this, but just for readability you could try this (a simple example):
dict = {1: [], 2: [], 3: []}
list = [1,2,2,2,3,3]
for k in dict.keys():
for i in list:
if i == k:
dict[k].append(i)
In [7]: dict
Out[7]: {1: [1], 2: [2, 2, 2], 3: [3, 3]}

You need to append your next item to the list, if the list already exists.
The easiest way to have the list already be there even for the first time you find a word, is to use the collections.defaultdict class to track your word-to-lines mapping:
from collections import defaultdict
def index(filename, wordList):
indexDict = defaultdict(list)
with open(filename, 'r') as infile:
for i, line in enumerate(infile):
for word in wordList:
if word in line:
indexDict[word].append(i)
print(word, i)
return indexDict
I've simplified your code a little using best practices; opening the file as a context manager so it'll close automatically when done, and using enumerate() to create line numbers on the fly.
You could speed this up a little further still (and make it more accurate) if you turned your lines into a set of words (set(line.split()) perhaps, but that won't remove punctuation), as then you could use set intersection tests against wordList (also a set), which could be considerably faster to find matching words.

Related

Write a program that reads the contents of a text file and return index of words into Values

I am doing an exercise from a textbook and I have been stuck for 3 days finally I decided to get help here.
The question is: write a program that reads the contents of a text file. The program should create a dictionary in which the key-value pairs are described as follows:
Key. The keys are the individual words found in the file.
Values. Each value is a list that contains the line numbers in the file where the word (the key) is found.
For example: suppose the word “robot” is found in lines 7, 18, 94, and 138. The dictionary would contain an element in which the key was the string “robot”, and the value was a list containing the numbers 7, 18, 94, and 138.
Once the dictionary is built, the program should create another text file, known as a word index, listing the contents of the dictionary. The word index file should contain an alphabetical listing of the words that are stored as keys in the dictionary, along with the line numbers where the words appear in the original file.
Figure 9-1 shows an example of an original text file (Kennedy.txt) and its index file (index.txt).
Here are the code i tried so far and the functions is not completed, not sure what to do next:
def create_Kennedytxt():
f = open('Kennedy.txt','w')
f.write('We observe today not a victory\n')
f.write('of party but a celebration\n')
f.write('of freedom symbolizing an end\n')
f.write('as well as a beginning\n')
f.write('signifying renewal as well\n')
f.write('as change\n')
f.close()
create_Kennedytxt()
def split_words():
f = open('Kennedy.txt','r')
count = 0
for x in f:
y = x.strip()
z = y.split(' ') #get individual character to find its index
count+=1 #get index for each line during for loop
split_words()
can anyone help me with the answer of code or give me some hints? and the answer shouldn't be import anythings, but only use methods and functions to achieved it. i will be very appreciated it!
You are on the right track. This is how it can be done
def build_word_index(txt):
out = {}
for i, line in enumerate(txt.split("\n")):
for word in line.strip().split(" "):
if word not in out:
out[word] = [i + 1]
else:
out[word].append(i + 1)
return out
print(build_word_index('''
We observe today not a victory
of party but a celebration
of freedom symbolizing an end
as well as a beginning
signifying renewal as well
as change
'''))
This works by first defining a dictionary
out = {}
Then we are going to loop line by line of input (we are going to use enumerate just so we have an index that starts from 0 and goes up by one each line
for i, line in enumerate(txt.split("\n")):
Next we are going to loop for each word in that line
for word in line.strip().split(" "):
Finally we are going to examine two cases by checking if our dictionary does not contain the word
if word not in out:
In the case we haven't seen the word before we need to create and entry in our dictionary that keeps track of words. We are using a list so that we can handle words being on multiple lines. (We are adding 1 to i here to offset us starting at 0).
out[word] = [i + 1]
In the case we have seen the word before we can just add the line we are currently on to the end of it
out[word].append(i + 1)
This will get us a dictionary where each word is the key and the value is a list of what lines the word appears in.
I am going to leave how to actually output the dictionary correctly to you.
This is a three step process:
Read the file line by line and split each line into words
Identify all unique words in each line (use set to do this)
For each word, check if word exists in the dictionary.
If it exists in the dictionary, then add the line number (line starts with 0, so you may need to add +1)
to add 1 to it)
If it does NOT exist in the dictionary, create a new key entry for the word and include the line number.
The dictionary will be a keys with lists.
To do this, you can create a program like this:
keys_in_file = {}
with open ('Kennedy.txt', 'r') as f:
for i,line in enumerate(f):
words = line.strip().split()
for word in set(words):
keys_in_file.setdefault(word, []).append(i+1)
print (keys_in_file)
The output of the file you provided (Kennedy.txt) is:
{'today': [1], 'victory': [1], 'observe': [1], 'a': [1, 2, 4], 'We': [1], 'not': [1], 'celebration': [2], 'of': [2, 3], 'party': [2], 'but': [2], 'freedom': [3], 'an': [3], 'symbolizing': [3], 'end': [3], 'as': [4, 5, 6], 'well': [4, 5], 'beginning': [4], 'renewal': [5], 'signifying': [5], 'change': [6]}
If you want to ensure that all words (We, WE, we) get counted as same word, you need to convert words to lowercase.
words = line.lower().strip().split()
If you want the values to be printed in the format of index.txt, then you add the following to the code:
for k in sorted(keys_in_file):
print (k+':', *keys_in_file[k])
The output will be as follows:
Note: I converted We to lowercase so it will show up later in the alphabetic order
a: 1 2 4
an: 3
as: 4 5 6
beginning: 4
but: 2
celebration: 2
change: 6
end: 3
freedom: 3
not: 1
observe: 1
of: 2 3
party: 2
renewal: 5
signifying: 5
symbolizing: 3
today: 1
victory: 1
we: 1
well: 4 5
from collections import Counter
fname = input("Enter file name: ")
with open (fname, 'r') as input_file:
count = Counter(word for line in input_file
for word in line.split())
print(count.most_common(20))
f= open("index.txt","w+")
s = str(count.most_common(20))
f.write(s)
f.close()

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Python - How do i build a dictionary from a text file?

for the class data structures and algorithms at Tilburg University i got a question in an in class test:
build a dictionary from testfile.txt, with only unique values, where if a value appears again, it should be added to the total sum of that productclass.
the text file looked like this, it was not a .csv file:
apples,1
pears,15
oranges,777
apples,-4
oranges,222
pears,1
bananas,3
so apples will be -3 and the output would be {"apples": -3, "oranges": 999...}
in the exams i am not allowed to import any external packages besides the normal: pcinput, math, etc. i am also not allowed to use the internet.
I have no idea how to accomplish this, and this seems to be a big problem in my development of python skills, because this is a question that is not given in a 'dictionaries in python' video on youtube (would be to hard maybe), but also not given in a expert course because there this question would be to simple.
hope you guys can help!
enter code here
from collections import Counter
from sys import exit
from os.path import exists, isfile
##i did not finish it, but wat i wanted to achieve was build a list of the
strings and their belonging integers. then use the counter method to add
them together
## by splitting the string by marking the comma as the split point.
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
keys = []
values = []
with open(filename) as f:
xs = f.read().split()
for i in xs:
keys.append([i])
print(keys)
my_dict = {}
for i in range(len(xs)):
my_dict[xs[i]] = xs.count(xs[i])
print(my_dict)
word_and_integers_dict = dict(zip(keys, values))
print(word_and_integers_dict)
values2 = my_dict.split(",")
for j in values2:
print( value2 )
the output becomes is this:
[['schijndel,-3'], ['amsterdam,0'], ['tokyo,5'], ['tilburg,777'], ['zaandam,5']]
{'zaandam,5': 1, 'tilburg,777': 1, 'amsterdam,0': 1, 'tokyo,5': 1, 'schijndel,-3': 1}
{}
so i got the dictionary from it, but i did not separate the values.
the error message is this:
28 values2 = my_dict.split(",") <-- here was the error
29 for j in values2:
30 print( value2 )
AttributeError: 'dict' object has no attribute 'split'
I don't understand what your code is actually doing, I think you don't know what your variables are containing, but this is an easy problem to solve in Python. Split into a list, split each item again, and count:
>>> input = "apples,1 pears,15 oranges,777 apples,-4 oranges,222 pears,1 bananas,3"
>>> parts = input.split()
>>> parts
['apples,1', 'pears,15', 'oranges,777', 'apples,-4', 'oranges,222', 'pears,1', 'bananas,3']
Then split again. Behold the list comprehension. This is an idiomatic way to transform a list to another in python. Note that the numbers are strings, not ints yet.
>>> strings = [s.split(',') for s in strings]
>>> strings
[['apples', '1'], ['pears', '15'], ['oranges', '777'], ['apples', '-4'], ['oranges', '222'], ['pears', '1'], ['bananas', '3']]
Now you want to iterate over pairs, and sum all the same fruits. This calls for a dict:
>>> result = {}
>>> for fruit, countstr in pairs:
... if fruit not in result:
... result[fruit] = 0
... result[fruit] += int(countstr)
>>> result
{'pears': 16, 'apples': -3, 'oranges': 999, 'bananas': 3}
This pattern of adding an element if it doesn't exist comes up frequently. You should checkout defaultdict in the collections module. If you use that, you don't even need the if.
Let's walk through what you need to do to. First, check if the file exists and read the contents to a variable. Second, parse each line - you need to split the line on the comma, convert the number from a string to an integer, and then pass the values to a dictionary. In this case I would recommend using defaultdict from collections, but we can also do it with a standard dictionary.
from os.path import exists, isfile
from collections import defaultdict
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
# this reads the file to a list, removing newline characters
with open(filename) as f:
line_list = [x.strip() for x in f]
# create a dictionary
my_dict = {}
# update the value in the dictionary if it already exists,
# otherwise add it to the dictionary
for line in line_list:
k, v_str = line.split(',')
if k in my_dict:
my_dict[k] += int(v_str)
else:
my_dict[k] = int(v_str)
# print the dictionary
table_str = '{:<30}{}'
print(table_str.format('Item','Count'))
print('='*35)
for k,v in sorted(my_dict.item()):
print(table_str.format(k,v))

Why am i getting an empty dictionary?

I am learning python from an introductory Python textbook and I am stuck on the following problem:
You will implement function index() that takes as input the name of a text file and a list of words. For every word in the list, your function will find the lines in the text file where the word occurs and print the corresponding line numbers.
Ex:
>>>> index('raven.txt', ['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])
ghost 9
dying 9
demon 122
evil 99, 106
ghastly 82
mortal 30
raven 44, 53, 55, 64, 78, 97, 104, 111, 118, 120
Here is my attempt at the problem:
def index(filename, lst):
infile = open(filename, 'r')
lines = infile.readlines()
lst = []
dic = {}
for line in lines:
words = line.split()
lst. append(words)
for i in range(len(lst)):
for j in range(len(lst[i])):
if lst[i][j] in lst:
dic[lst[i][j]] = i
return dic
When I run the function, I get back an empty dictionary. I do not understand why I am getting an empty dictionary. So what is wrong with my function? Thanks.
You are overwriting the value of lst. You use it as both a parameter to a function (in which case it is a list of strings) and as the list of words in the file (in which case it's a list of list of strings). When you do:
if lst[i][j] in lst
The comparison always returns False because lst[i][j] is a str, but lst contains only lists of strings, not strings themselves. This means that the assignment to the dic is never executed and you get an empty dict as result.
To avoid this you should use a different name for the list in which you store the words, for example:
In [4]: !echo 'a b c\nd e f' > test.txt
In [5]: def index(filename, lst):
...: infile = open(filename, 'r')
...: lines = infile.readlines()
...: words = []
...: dic = {}
...: for line in lines:
...: line_words = line.split()
...: words.append(line_words)
...: for i in range(len(words)):
...: for j in range(len(words[i])):
...: if words[i][j] in lst:
...: dic[words[i][j]] = i
...: return dic
...:
In [6]: index('test.txt', ['a', 'b', 'c'])
Out[6]: {'a': 0, 'c': 0, 'b': 0}
There are also a lot of things you can change.
When you want to iterate a list you don't have to explicitly use indexes. If you need the index you can use enumerate:
for i, line_words in enumerate(words):
for word in line_words:
if word in lst: dict[word] = i
You can also iterate directly on a file (refer to Reading and Writing Files section of the python tutorial for a bit more information):
# use the with statement to make sure that the file gets closed
with open('test.txt') as infile:
for i, line in enumerate(infile):
print('Line {}: {}'.format(i, line))
In fact I don't see why would you first build that words list of list. Just itertate on the file directly while building the dictionary:
def index(filename, lst):
with open(filename, 'r') as infile:
dic = {}
for i, line in enumerate(infile):
for word in line.split():
if word in lst:
dic[word] = i
return dic
Your dic values should be lists, since more than one line can contain the same word. As it stands your dic would only store the last line where a word is found:
from collections import defaultdict
def index(filename, words):
# make faster the in check afterwards
words = frozenset(words)
with open(filename) as infile:
dic = defaultdict(list)
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
return dic
If you don't want to use the collections.defaultdict you can replace dic = defaultdict(list) with dic = {} and then change the:
dic[word].append(i)
With:
if word in dic:
dic[word] = [i]
else:
dic[word].append(i)
Or, alternatively, you can use dict.setdefault:
dic.setdefault(word, []).append(i)
although this last way is a bit slower than the original code.
Note that all these solutions have the property that if a word isn't found in the file it will not appear in the result at all. However you may want it in the result, with an emty list as value. In such a case it's simpler the dict with empty lists before starting to loop, such as in:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
Refer to the documentation about List Comprehensions and Dictionaries to understand the first line.
You can also iterate over words instead of the line, like this:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in words:
if word in line.split():
dic[word].append(i)
Note however that this is going to be slower because:
line.split() returns a list, so word in line.split() will have to scan all the list.
You are repeating the computation of line.split().
You can try to solve these two problems doing:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words:
if word in line_words:
dic[word].append(i)
Note that here we are iterating once over line.split() to build the set and also over words. Depending on the sizes of the two sets this may be slower or faster than the original version (iteratinv over line.split()).
However at this point it's probably faster to intersect the sets:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words & line_words: # & stands for set intersection
dic[word].append(i)
Try this,
def index(filename, lst):
dic = {w:[] for w in lst}
for n,line in enumerate( open(filename,'r') ):
for word in lst:
if word in line.split(' '):
dic[word].append(n+1)
return dic
There are some features of the language introduced here that you should be aware of because they will make life a lot easier in the long run.
The first is a dictionary comprehension. It basically initializes a dictionary using the words in lst as keys and an empty list [] as the value for each key.
Next the enumerate command. This allows us to iterate over the items in a sequence but also gives us the index of those items. In this case, because we passed a file object to enumerate it will loop over the lines. For each iteration, n will be the 0-based index of the line and line will be the line itself. Next we iterate over the words in lst.
Notice that we don't need any indices here. Python encourages looping over objects in sequences rather than looping over indices and then accessing the objects in a sequence based on index (for example discourages doing for i in range(len(lst)): do something with lst[i]).
Finally, the in operator is a very straightforward way to test membership for many types of objects and the syntax is very intuitive. In this case, we are asking is the current word from lst in the current line.
Note that we use line.split(' ') to get a list of the words in the line. If we don't do this, 'the' in 'there was a ghost' would return True as the is a substring of one of the words.
On the other hand 'the' in ['there', 'was', 'a', 'ghost'] would return False. If the conditional returns True, we append it to the list associated to the key in our dictionary.
That might be a lot to chew on, but these concepts make problems like this more straight forward.
First, your function param with the words is named lst and also the list where you put all the words in the file is also named lst, so you are not saving the words passed to your functions, because on line 4 you're redeclaring the list.
Second, You are iterating over each line in the file (the first for), and getting the words in that line. After that lst has all the words in the entire file. So in the for i ... you are iterating over all the words readed from the file, there's no need to use the third for j where you are iterating over each character in every word.
In resume, in that if you are saying "If this single character is in the lists of words ..." wich is not, so the dict will be never filled up.
for i in range(len(lst)):
if words[i] in lst:
dic[words[i]] = dic[words[i]] + i # To count repetitions
You need to rethink the problem, even my answer will fail because the word in the dict will not exist giving an error, but you get the point. Good luck!

Not working: indexing the words in a file in a dict by first letter

I have to write a function based on a open file that has one lowercase word per line. I have to return a dictionary with keys in single lowercase letters and each value is a list of the words from the file that starts with that letter. (The keys in the dictionary are from only the letters of the words that appear in the file.)
This is my code:
def words(file):
line = file.readline()
dict = {}
list = []
while (line != ""):
list = line[:].split()
if line[0] not in dict.keys():
dict[line[0]] = list
line = file.readline()
return dict
However, when I was testing it myself, my function doesn't seem to return all the values. If there are more than two words that start with a certain letter, only the first one shows up as the values in the output. What am I doing wrong?
For example, the file should return:
{'a': ['apple'], 'p': ['peach', 'pear', 'pineapple'], \
'b': ['banana', 'blueberry'], 'o': ['orange']}, ...
... but returns ...
{'a': ['apple'], 'p': ['pear'], \
'b': ['banana'], 'o': ['orange']}, ...
Try this solution, it takes into account the case where there are words starting with the same character in more than one line, and it doesn't use defaultdict. I also simplified the function a bit:
def words(file):
dict = {}
for line in file:
lst = line.split()
dict.setdefault(line[0], []).extend(lst)
return dict
You aren't adding to the list for each additional letter. Try:
if line[0] not in dict.keys():
dict[line[0]] = list
else:
dict[line[0]] += list
The specific problem is that dict[line[0]] = list replaces the value for the new key. There are many ways to fix this... I'm happy to provide one, but you asked what was wrong and that's it. Welcome StackOverflow.
It seems like every dictionary entry should be a list. Use the append method on the dictionary key.
Sacrificing performance (to a certain extent) for elegance:
with open(whatever) as f: words = f.read().split()
result = {
first: [word for word in words if word.startswith(first)]
for first in set(word[0] for word in words)
}
Something like this should work
def words(file):
dct = {}
for line in file:
word = line.strip()
try:
dct[word[0]].append(word)
except KeyError:
dct[word[0]] = [word]
return dct
The first time a new letter is found, there will be a KeyError, subsequent occurances of the letter will cause the word to be appended to the existing list
Another approach would be to prepopulate the dict with the keys you need
import string
def words(file):
dct = dict.fromkeys(string.lowercase, [])
for line in file:
word = line.strip()
dct[word[0]] = dct[word[0]] + [word]
return dct
I'll leave it as an exercise to work out why dct[word[0]] += [word] won't work
Try this function
def words(file):
dict = {}
line = file.readline()
while (line != ""):
my_key = line[0].lower()
dict.setdefault(my_key, []).extend(line.split() )
line = file.readline()
return dict

Categories

Resources