Indexing and search in a text file

Indexing and search in a text file - python

I have a text file that contains the contents of a book. I want to take this file and build an index which allows the user to search through the file to make searches.
The search would consist of entering a word. Then, the program would return the following:
Every chapter which includes that word.
The line number of the line
which contains the word.
The entire line the word is on.
I tried the following code:
infile = open(file)
Dict = {}
word = input("Enter a word to search: ")
linenum = 0
line = infile.readline()
for line in infile
linenum += 1
for word in wordList:
if word in line:
Dict[word] = Dict.setdefault(word, []) + [linenum]
print(count, word)
line = infile.readline()
return Dict
Something like this does not work and seems too awkward for handling the other modules which would require:
An "or" operator to search for one word or another
An "and" operator to search for one word and another in the same chapter
Any suggestions would be great.

def classify_lines_on_chapter(book_contents):
lines_vs_chapter = []
for line in book_contents:
if line.isupper():
current_chapter = line.strip()
lines_vs_chapter.append(current_chapter)
return lines_vs_chapter
def classify_words_on_lines(book_contents):
words_vs_lines = {}
for i, line in enumerate(book_contents):
for word in set([word.strip(string.punctuation) for word in line.split()]):
if word:
words_vs_lines.setdefault(word, []).append(i)
return words_vs_lines
def main():
skip_lines = 93
with open('book.txt') as book:
book_contents = book.readlines()[skip_lines:]
lines_vs_chapter = classify_lines_on_chapter(book_contents)
words_vs_lines = classify_words_on_lines(book_contents)
while True:
word = input("Enter word to search - ")
# Enter a blank input to exit
if not word:
break
line_numbers = words_vs_lines.get(word, None)
if not line_numbers:
print("Word not found!!\n")
continue
for line_number in line_numbers:
line = book_contents[line_number]
chapter = lines_vs_chapter[line_number]
print("Line " + str(line_number + 1 + skip_lines))
print("Chapter '" + str(chapter) + "'")
print(line)
if __name__ == '__main__':
main()
Try it on this input file. Rename it as book.txt before running it.

Related

Finding longest word in a txt file

I am trying to create a function in which a filename is taken as a parameter and the function returns the longest word in the file with the line number attached to the front of it.
This is what I have so far but it is not producing the expected output I need.
def word_finder(file_name):
with open(file_name) as f:
lines = f.readlines()
line_num = 0
longest_word = None
for line in lines:
line = line.strip()
if len(line) == 0:
return None
else:
line_num += 1
tokens = line.split()
for token in tokens:
if longest_word is None or len(token) > len(longest_word):
longest_word = token
return (str(line_num) + ": " + str(longest_word))

I think this is the shortest way to find the word, correct if not
def wordFinder(filename):
with open(filename, "r") as f:
words = f.read().split() # split() returns a list with the words in the file
longestWord = max(words, key = len) # key = len returns the word size
print(longestWord) # prints the longest word

Issue
Exactly what ewong diagnosed:
last return statement is too deep indented
Currently:
the longest word in the first line only
Solution
Should be aligned with the loop's column, to be executed after the loop.
def word_finder(file_name):
with open(file_name) as f:
lines = f.readlines()
line_num = 0
longest_word = None
for line in lines:
line = line.strip()
if len(line) == 0:
return None
else:
line_num += 1
tokens = line.split()
for token in tokens:
if longest_word is None or len(token) > len(longest_word):
longest_word = token
# return here would exit the loop too early after 1st line
# loop ended
return (str(line_num) + ": " + str(longest_word))
Then:
the longest word in the file with the line number attached to the front of it.
Improved
def word_finder(file_name):
with open(file_name) as f:
line_word_longest = None # global max: tuple of (line-index, longest_word)
for i, line in enumerate(f): # line-index and line-content
line = line.strip()
if len(line) > 0: # split words only if line present
max_token = max(token for token in line.split(), key = len) # generator then max of tokens by length
if line_word_longest is None or len(max_token) > len(line_word_longest[1]):
line_word_longest = (i, max_token)
# loop ended
if line_word_longest is None:
return "No longest word found!"
return f"{line_word_longest[0]}: '{line_word_longest[1]}' ({len(line_word_longest[1])} chars)"
See also:
Basic python file-io variables with enumerate
List Comprehensions in Python to compute minimum and maximum values of a list
Some SO research for similar questions:
inspiration from all languages: longest word in file
only python: [python] longest word in file
non python: -[python] longest word in file

how to find the longest_word by using searching loop

Write a function longest_word that asks the user for words and returns the longest word entered by the user. It should stop when the user hits return without typing a word. If multiple words have the same maximum length, return the first word entered by the user. If the user quits before entering any words, return “No words were entered”. This function should use a searching loop.
(Hint: remember that the len function returns the length of a string.)

Takes list of files and a next and checks if the words in file appear in those files.
Code:
def list_search_dictionary(fname, listfname):
for search in listfname:
with open(fname,'r') as f1:
with open(search,'r') as f2:
#splits word by word
All = f2.read().split()
formater = lambda x:x.strip().lower()
#formats each element
All = list(map(formater,All))
for word in f1:
word = formater(word)
if word in All:
print(f'Found: {word}')
Using your functions:
Change your function to:
def my_search_dictionary(search_dic, fname):
# read the file
infile = open(fname,"r")
# look for each line in the file
for line in infile:
# remove the string at the end of the line
line_strip = line.strip()
# lowercase
line_lower = line_strip.lower()
# i love dogs
# more than hotdogs
# split " " in the line and make them in the list
line_split = line_lower.split(" ")
# ["i"," love dogs"]
# ["more", " than", " hotdogs"]
# look for each search term in the line_split
for word in line_split:
# check if every word in the line
for key in search_dic:
if word == key:
search_dic[key] += [fname]
infile.close()
The third function would be:
def list_search_dictionary(fname, listfname):
search_dic = key_emptylist(fname)
for file in listfname:
search_dic = my_search_dictonary(search_dic,file)

Deleting a specific word form a text file in python

I used this code to delete a word from a text file.
f = open('./test.txt','r')
a = ['word1','word2','word3']
lst = []
for line in f:
for word in a:
if word in line:
line = line.replace(word,'')
lst.append(line)
f.close()
f = open('./test.txt','w')
for line in lst:
f.write(line)
f.close()
But for some reason if the words have the same characters, all those characters get deleted. So for e.g
in my code:
def cancel():
global refID
f1=open("refID.txt","r")
line=f1.readline()
flag = 0
while flag==0:
refID=input("Enter the reference ID or type 'q' to quit: ")
for i in line.split(','):
if refID == i:
flag=1
if flag ==1:
print("reference ID found")
cancelsub()
elif (len(refID))<1:
print("Reference ID not found, please re-enter your reference ID\n")
cancel()
elif refID=="q":
flag=1
else:
print("reference ID not found\n")
menu()
def cancelsub():
global refIDarr, index
refIDarr=[]
index=0
f = open('flightbooking.csv')
csv_f = csv.reader(f)
for row in csv_f:
refIDarr.append(row[1])
for i in range (len(refIDarr)):
if refID==refIDarr[i]:
index=i
print(index)
while True:
proceed=input("You are about to cancel your flight booking, are you sure you would like to proceed? y/n?: ")
while proceed>"y" or proceed<"n" or (proceed>"n" and proceed<"y") :
proceed=input("Invalid entry. \nPlease enter y or n: ")
if proceed=="y":
Continue()
break
elif proceed=="n":
main_menu
break
exit
break
def Continue():
lines = list()
with open('flightbooking.csv', 'r') as readFile:
reader = csv.reader(readFile)
for row in reader:
lines.append(row)
for field in row:
if field ==refID:
lines.remove(row)
break
with open('flightbooking.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
writer.writerows(lines)
f = open('refID.txt','r')
a=refIDarr[index]
print(a)
lst = []
for line in f:
for word in a:
if word in line:
line = line.replace(word,'')
lst.append(line)
print(lst)
f.close()
f = open('refID.txt','w')
for line in lst:
f.write(line)
f.close()
print("Booking successfully cancelled")
menu()
When the code is run, the refID variable has one word stored in it, and it should replace just that word with a blank space, but it takes that word for e.g 'AB123', finds all other words which might have an 'A' or a 'B' or the numbers, and replace all of them. How do I make it so it only deletes the word?
Text file before running code:
AD123,AB123
Expected Output in the text file:
AD123,
Output in text file:
D,
Edit: I have added the entire code, and maybe you can help now after seeing that the array is being appended to and then being used to delete from a text file.

here's my opinion.
refIDarr = ["AB123"]
a = refIDarr[0] => a = "AB123"
strings in python are iterable, so when you do for word in a, you're getting 5 loops where each word is actually a letter.
Something like the following is being executed.
if "A" in line:
line = line.replace("A","")
if "B" in line:
line = line.replace("B","")
if "1" in line:
line = line.replace("1","")
if "2" in line:
line = line.replace("2","")
if "3" in line:
line = line.replace("3","")
they correct way to do this is loop over refIDarr
for word in refIDarr:
line = line.replace(word,'')
NOTE: You don't need the if statement, since if the word is not in the line it will return the same line as it was.
"abc".replace("bananan", "") => "abc"
Here's a working example:
refIDarr = ["hello", "world", "lol"]
with open('mytext.txt', "r") as f:
data = f.readlines()
for word in refIDarr:
data = [line.replace(word, "") for line in data]
with open("mytext.txt", "w") as newf:
newf.writelines(data)

The problem is here:
a=refIDarr[index]
If refIDarr is a list of words, accessing specific index makes a be a word. Later, when you iterate over a (for word in a:), word becomes a letter and not a word as you expect, which causes eventually replacing characters of word instead the word itself in your file.
To avoid that, remove a=refIDarr[index] and change your loop to be:
for line in f:
for word in refIDarr:
if word in line:
line = line.replace(word,'')

Parsing txt file for word occurrences python

I'm fairly new to python and found a personal project for myself. I am trying to parse through a large text file to find word occurrences, but I cannot seem to get the code to work. It starts like this:
file = 'randomfile.txt'
with open(file) as f:
word = input("Enter a word: ")
line = f.readline()
num_line = 1
while line:
if word in line:
print("line {}: {}".format(num_line, line.strip()))
print("Here are ___ I found with the word" + word)
num_line += 1
f.close()
This code will run but it will not give an output for search words, and I cant see a reason why not, unless python is not reading the text file in the path, or if the fourth line of code is not being read properly? How can I go about fixing this?

This will work:
with open(randonfile.txt",'r') as f:
word = input("Enter a word: ")
line = f.readline()
num_line = 1
for words in line.split():
if word in words:
print("line {}: {}".format(num_line, line.strip()))
print("Here are ___ I found with the word" + word)
num_line += 1

The problem with your code is that only read the first line, and never loop through the entire file.
with open(file) as f:
word = input("Enter a word: ")
line = f.readline() # this is the only place you read a line
num_line = 1
while line:
if word in line:
print("line {}: {}".format(num_line, line.strip()))
print("Here are ___ I found with the word" + word)
num_line += 1
You should read the next line in the while loop, like this:
with open(file) as f:
word = input("Enter a word: ")
line = f.readline()
num_line = 1
while line:
if word in line:
print("line {}: {}".format(num_line, line.strip()))
line = f.readline() # <-- read the next line
print("Here are ___ I found with the word" + word)
num_line += 1
Now you are reading through the file. The next problem is you only increment the line number num_line += 1 once the while loop is over, you need to move this to within the loop so it tracks how many lines have been processed.
with open(file) as f:
word = input("Enter a word: ")
line = f.readline()
num_line = 1
while line:
if word in line:
print("line {}: {}".format(num_line, line.strip()))
line = f.readline() # <-- read the next line
num_line +=1 # <-- increase the counter in the loop
print("Here are ___ I found with the word" + word)
You don't need f.close(), the with_statement automatically closes files. You can also loop directly over the file pointer f, to read each line, like this:
file = 'randomfile.txt'
with open(file) as f:
word = input("Enter a word: ")
num_line = 1
for line in f: # step through each line of the file
if word in line:
print("line {}: {}".format(num_line, line.strip()))
num_line +=1 # <-- increase the counter in the loop
print("Here are ___ I found with the word" + word)
# f.close() -- not needed
I will leave it for you to fix the print statement.

You can do something like this:
word = input("Enter a word: ")
with open(file) as f:
for idx, line in enumerate(f):
# we need to use lower() and split() because we don't want to count if word = 'aaa' and line 'AAAaaa' returns True
if word.lower() in line.lower().split():
print("line {}: {}".format(idx, line.strip()))

Read special characters from .txt file in python

The goal of this code is to find the frequency of words used in a book.
I am tying to read in the text of a book but the following line keeps throwing my code off:
precious protégés. No, gentlemen; he'll always show 'em a clean pair
specifically the é character
I have looked at the following documentation, but I don't quite understand it: https://docs.python.org/3.4/howto/unicode.html
Heres my code:
import string
# Create word dictionary from the comprehensive word list
word_dict = {}
def create_word_dict ():
# open words.txt and populate dictionary
word_file = open ("./words.txt", "r")
for line in word_file:
line = line.strip()
word_dict[line] = 1
# Removes punctuation marks from a string
def parseString (st):
st = st.encode("ascii", "replace")
new_line = ""
st = st.strip()
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)
new_line += ch
else:
new_line += ""
# now remove all instances of 's or ' at end of line
new_line = new_line.strip()
print (new_line)
if (new_line[-1] == "'"):
new_line = new_line[:-1]
new_line.replace("'s", "")
# Conversion from ASCII codes back to useable text
message = new_line
decodedMessage = ""
for item in message.split():
decodedMessage += chr(int(item))
print (decodedMessage)
return new_line
# Returns a dictionary of words and their frequencies
def getWordFreq (file):
# Open file for reading the book.txt
book = open (file, "r")
# create an empty set for all Capitalized words
cap_words = set()
# create a dictionary for words
book_dict = {}
total_words = 0
# remove all punctuation marks other than '[not s]
for line in book:
line = line.strip()
if (len(line) > 0):
line = parseString (line)
word_list = line.split()
# add words to the book dictionary
for word in word_list:
total_words += 1
if (word in book_dict):
book_dict[word] = book_dict[word] + 1
else:
book_dict[word] = 1
print (book_dict)
# close the file
book.close()
def main():
wordFreq1 = getWordFreq ("./Tale.txt")
print (wordFreq1)
main()
The error that I received is as follows:
Traceback (most recent call last):
File "Books.py", line 80, in <module>
main()
File "Books.py", line 77, in main
wordFreq1 = getWordFreq ("./Tale.txt")
File "Books.py", line 60, in getWordFreq
line = parseString (line)
File "Books.py", line 36, in parseString
decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try
word_file = open ("./words.txt", "r", encoding='utf-8')

The best way I could think of is to read each character as an ASCII value, into an array, and then take the char value. For example, 97 is ASCII for "a" and if you do char(97) it will output "a". Check out some online ASCII tables that provide values for special characters also.

Try:
def parseString(st):
st = st.encode("ascii", "replace")
# rest of code here
The new error you are getting is because you are calling isalpha on an int (i.e. a number)
Try this:
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0) if n in ch) or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing and search in a text file - python

Related

Finding longest word in a txt file

how to find the longest_word by using searching loop

Deleting a specific word form a text file in python

Parsing txt file for word occurrences python

Read special characters from .txt file in python

Categories

Resources