Python XML search for string - python

I tried to construct my own string.find() method/function in Python. I did this for a computer science class I'm in.
Basically, this program opens a text file, gets a user input on this the text they want to search for in the file, and outputs the line number on which the string resides, or outputs a 'not found' if the string doesn't exist in the file.
However, this takes about 34 seconds to complete 250,000 lines of XML.
Where is the bottleneck in my code? I made this in C# and C++ as well, and this runs in about 0.3 seconds for 4.5 million lines. I also performed this same search using the built-in string.find() from Python, and this takes around 4 seconds for 250,000 lines of XML. So, I'm trying to understand why my version is so slow.
https://github.com/zach323/Python/blob/master/XML_Finder.py
fhand = open('C:\\Users\\User\\filename')
import time
str = input('Enter string you would like to locate: ') #string to be located in file
start = time.time()
delta_time = 0
def find(str):
time.sleep(0.01)
found_str ='' #initialize placeholder for found string
next_index = 0 #index for comparison checking
line_count = 1
for line in fhand: #each line in file
line_count = line_count +1
for letter in line: #each letter in line
if letter == str[next_index]: #compare current letter index to beginning index of string you want to find
found_str += letter #if a match, concatenate to string placeholder
#print(found_str) #print for visualization of inline search per iteration
next_index = next_index + 1
if found_str == str: #if complete match is found, break out of loop.
print('Result is: ', found_str, ' on line %s '%(line_count))
print (line)
return found_str #return string to function caller
break
else:
#if a match was found but the next_index match was False, reset the indexes and try again.
next_index=0 # reset indext back to zero
found_str = '' #reset string back to empty
if found_str == str:
print(line)
if str != "":
result = find(str)
delta_time = time.time() - start
print(result)
print('Seconds elapsed: ', delta_time)
else:
print('sorry, empty string')

Try this:
with open(filename) as f:
for row in f:
if string in row:
print(row)

The following code runs on a text file of size comparable to the size of your file. Your code doesn't run too slowly on my computer.
fhand = open('test3.txt')
import time
string = input('Enter string you would like to locate: ') #string to be located in file
start = time.time()
delta_time = 0
def find(string):
next_index_to_match = 0
sl = len(string)
ct = 0
for line in fhand: #each line in file
ct += 1
for letter in line: #each letter in line
if letter == string[next_index_to_match]: #compare current letter index to beginning index of string you want to find
# print(line)
next_index_to_match += 1
if sl == next_index_to_match: #if complete match is found, break out of loop.
print('Result is: ', string, ' on line %s '%(ct))
print (line)
return True
else:
#if a match was found but the next_index match was False, reset the indexes and try again.
next_index_to_match=0 # reset indext back to zero
return False
if string != "":
find(string)
delta_time = time.time() - start
print('Seconds elapsed: ', delta_time)
else:
print('sorry, empty string')

Related

Can I find a line in a text file, if I know its number in python?

word = "some string"
file1 = open("songs.txt", "r")
flag = 0
index = 0
for line in file1:
index += 1
if word in line:
flag = 1
break
if flag == 0:
print(word + " not found")
else:
#I would like to print not only the line that has the string, but also the previous and next lines
print(?)
print(line)
print(?)
file1.close()
Use contents = file1.readlines() which converts the file into a list.
Then, loop through contents and if word is found, you can print contents[i], contents[i-1], contents[i+1]. Make sure to add some error handling if word is in the first line as contents[i-1] would throw and error.
word = "some string"
file1 = open("songs.txt", "r")
flag = 0
index = 0
previousline = ''
nextline = ''
for line in file1:
index += 1
if word in line:
finalindex = index
finalline = line
flag = 1
elsif flag==1
print(previousline + finalline + line)
print(index-1 + index + index+1)
else
previousline = line
You basically already had the main ingredients:
you have line (the line you currently evaluate)
you have the index (index)
the todo thus becomes storing the previous and next line in some variable and then printing the results.
have not tested it but code should be something like the above.....
splitting if you find the word, if you have found it and you flagged it previous time and if you have not flagged it.
i believe the else-if shouldnt fire unless flag ==1

How do I print words in a specific alphabetical range coming from lines in a file?

So I have to write a code that first reads in the name of an input file, followed by two strings representing the lower and upper bounds of a search range. The file should be read using the file.readlines() method. The input file contains a list of alphabetical, ten-letter strings, each on a separate line. The program should output all strings from the list that are within that range (inclusive of the bounds).
The text file (input1.txt) contains:
aspiration
classified
federation
graduation
millennium
philosophy
quadratics
transcript
wilderness
zoologists
so for example, if i input:
input1.txt
ammoniated
millennium
the output should be:
aspiration
classified
federation
graduation
millennium
So far, I tried:
# prompt user to enter input1.txt as filepath
filepath = input()
start = input()
end = input()
apending = False
out = ""
with open(filepath) as fp:
line = fp.readline()
while line:
txt = line.strip()
if(txt == end):
apending = False
# how do I make it terminate after end is printed??
if(apending):
out+=txt + '\n'
if(txt == start):
apending = True
line = fp.readline()
print(out)
However, it does not seem to output anything. Any help debugging or fixing my code is greatly appreciated~
here is the code:
# prompt user to enter input1.txt as filepath
filepath = input()
start = input()
end = input()
# apending = False
# out = ""
with open(filepath) as fp:
while True:
line = fp.readline()
if not line:
break
txt = line.strip()
if txt >= start and txt <= end:
print(txt)
if txt > end:
break
None of the strings in the input is equal to ammoniated, therefore txt == start is never true, therefore apending is never set to True, therefore out += txt + '\n' is never executed, therefore print(out) prints an empty string at the end.
You should use < or > to compare the strings from the input with start and end. They are not supposed to be exact matches.
This can be done by comparing strings using the <= and >= operators, and then filtering the words that match the lower and upper bounds criteria.
As said in this article, "Python string comparison is performed using the characters in both strings. The characters in both strings are compared one by one. When different characters are found then their Unicode value is compared. The character with lower Unicode value is considered to be smaller."
f = open('input.txt')
words = f.readlines()
lower_bound = input('Set the lower bound for query')
upper_bound = input('Set the upper bound for query')
result = [word.strip() for word in words if lower_bound <= word.strip() <= upper_bound]
print(result)
f.close()
output for print(result), with the input you provided:
['aspiration', 'classified', 'federation', 'graduation', 'millennium']
You can exit out of a for loop by using the keyword break
for example
for x in range(50):
if x > 20:
break;
else:
print(x);

Reading file and converting values "'list' object has no attribute 'find' on"

The main issue is I cannot identify what is causing the code to produce this value. It is supposed to read the values in the text file and then calculate the average confidence of the values. But I've recieved repeated errors. the one here and another which states 'could not convert string into float' if I have it tell me which line it will be the first one.
I'm using Repl.it to run python and it is v3 of it. I've tried doing this on my computer I get similar results, however, it is very hard to read the error so I've moved it there to see better.
# Asks usr input
usrin = input("Enter in file name: ")
# establishes variabls
count = 0
try:
fmbox = open(usrin, 'r')
rd = fmbox.readlines()
# loops through each line and reads the file
for line in rd:
# line that is being read
fmLen = len(rd)
srchD = rd.find("X-DSPAM-Confidence: ")
fmNum = rd[srchD + 1:fmLen] # extracts numeric val
fltNum = float(fmNum.strip().replace(' ', ''))
#only increments if there is a value
if (fltNum > 0.0):
count += 1
total = fltNum + count
avg = total / count
print("The average confiedence is: ", avg)
print("lines w pattern ", count)
The return should be the average of the numbers stripped from the file and the count of how many had values above 0.
if you need to view the txt file here it is http://www.pythonlearn.com/code3/mbox.txt
There are several problems with your code:
you're using string methods like find() and strip() on the list rd instead of parsing the individual line.
find() returns the lowest index of the substring if there's a match (since "X-DSPAM-Confidence: " seems to occur at the beginning of the line in the text file, it will return index 0), otherwise it returns -1. However, you're not checking the return value (so you're always assuming there's a match), and rd[srchD + 1:fmLen] should be line[srchD + len("X-DSPAM-Confidence: "):fmLen-1] since you want to extract everything after the substring till the end of the line.
count and total are not defined, although they might be somewhere else in your code
with total = fltNum + count, you're replacing the total at each iteration with fltNum + count... you should be adding fltNum to the total every time a match is found
Working implementation:
try:
fmbox = open('mbox.txt', 'r')
rd = fmbox.readlines()
# loops through each line and reads the file
count = 0
total = 0.0
for line in rd:
# line that is being read
fmLen = len(line)
srchD = line.find("X-DSPAM-Confidence: ")
# only parse the confidence value if there is a match
if srchD != -1:
fmNum = line[srchD + len("X-DSPAM-Confidence: "):fmLen-1] # extracts numeric val
fltNum = float(fmNum.strip().replace(' ', ''))
#only increment if value if non-zero
if fltNum > 0.0:
count += 1
total += fltNum
avg = total / count
print("The average confidence is: ", avg)
print("lines w pattern ", count)
except Exception as e:
print(e)
Output:
The average confidence is: 0.8941280467445736
lines w pattern 1797
Demo: https://repl.it/#glhr/55679157

If-else statement not functioning properly in for loop

I'm trying to count the different characters in two individual strings using an if-else statement in a for-loop. However, it never counts the different characters.
for char in range(len(f1CurrentLine)): # Compare line by line
if f1CurrentLine[char] != f2CurrentLine[char]: # If the lines have different characters
print("Unmatched characters: ", count, ":", char)
diffCharCount = diffCharCount + 1 # add 1 to the difference counter
count = count + 1
text1Count = text1Count + len(f1CurrentLine)
text2Count = text2Count + len(f2CurrentLine)
return CharByChar(count=count, text2Count=text2Count, text1Count=text1Count,
diffCharCount=diffCharCount) # return difference count
else:
print("Characters matched in line:", count, ". Moving to next line.")
text1Count = text1Count + len(f1CurrentLine)
text2Count = text2Count + len(f2CurrentLine)
count = count + 1
return CharByChar(count, diffCharCount=diffCharCount, text1Count=text1Count,
text2Count=text2Count,
diffLineCount=diffLineCount)
I have two files with the following in them
File 1:
1 Hello World
2 bazzle
3 foobar
File 2:
1 Hello world
2 bazzle
3 fooBar
It should return 2 different characters, but it does not. If you want to take a look at the entire function I have linked it here: Pastebin. Hopefully you can see something I have missed.
Your code is too complicated for this sort of application. I've tried my best to understand the code and I've come up with a better solution.
text1 = open("file1.txt")
text2 = open("file2.txt")
# Difference variables
diffLineCount = diffCharCount = line_num = 0
# Iterate through both files line by line
for line1, line2 in zip(text1.readlines(), text2.readlines()):
if line1 == "\n" or line2 == "\n": continue # If newline, go to next line
if len(line1) != len(line2): # If lines are of different length
diffLineCount += 1
continue # Go to next line
for c1, c2 in zip(list(line1.strip()), list(line2.strip())): # Iterate through both lines character by character
if c1 != c2: # If they do not match
print("Unmatched characters: ", line_num, ":", c1)
diffCharCount += 1
line_num += 1
# Goes back to the beginning of each file
text1.seek(0)
text2.seek(0)
# Prints the stats
print("Number of characters in the first file: ", len(text1.read()))
print("number of characters in the second file: ", len(text2.read()))
print("Number of characters that do not match in lines of the same length: ", diffCharCount)
print("Number of lines that are not the same length: ", diffLineCount)
# Closes the files
text1.close()
text2.close()
I hope you understand how this works and are able to make it fit your needs specifically. Good luck!
Unlike the other solution I edited your code, so that you can understand what was going wrong. I agree with him that you should anyway organize better your code because it is complex
text1 = open("file1.txt")
text2 = open("file2.txt")
def CharByChar(count, diffCharCount, text1Count, text2Count, diffLineCount):
"""
This function compares two files character by character and prints the number of characters that are different
:param count: What line of the file the program is comparing
:param diffCharCount: The sum of different characters
:param text1Count: Sum of characters in file 1
:param text2Count: Sum of characters in file 2
:param diffLineCount: Sum of different lines
"""
# see comment below for strip removal
f1CurrentLine = text1.readline()
f2CurrentLine = text2.readline()
while f1CurrentLine != '' or f2CurrentLine != '':
count = count + 1
print(f1CurrentLine)
print(f2CurrentLine)
#if f1CurrentLine != '' or f2CurrentLine != '':
if len(f1CurrentLine) != len(f2CurrentLine): # If the line lengths are not equal return the line number
print("Lines are a different length. The line number is: ", count)
diffLineCount = diffLineCount + 1
count = count + 1
#text1Count = text1Count + len(f1CurrentLine)
#text2Count = text2Count + len(f2CurrentLine)
# return CharByChar(count)
elif len(f1CurrentLine) == len(f2CurrentLine): # If the lines lengths are equal
for char in range(len(f1CurrentLine)): # Compare line by line
print(char)
if f1CurrentLine[char] != f2CurrentLine[char]: # If the lines have different characters
print("Unmatched characters: ", count, ":", char)
diffCharCount = diffCharCount + 1 # add 1 to the difference counter
#count = count + 1
text1Count = text1Count + len(f1CurrentLine)
text2Count = text2Count + len(f2CurrentLine)
# return CharByChar(count=count, text2Count=text2Count, text1Count=text1Count,diffCharCount=diffCharCount) # return difference count
else:
print("Characters matched in line:", count, ". Moving to next char.")
#text1Count = text1Count + len(f1CurrentLine)
#text2Count = text2Count + len(f2CurrentLine)
#count = count + 1
#return CharByChar(count, diffCharCount=diffCharCount, text1Count=text1Count,text2Count=text2Count,diffLineCount=diffLineCount)
#elif len(f1CurrentLine) == 0 or len(f2CurrentLine) == 0:
#print(count, "lines are not matching")
#diffLineCount = diffLineCount + 1
#return CharByChar(diffLineCount=diffLineCount)
else:
print("Something else happened!")
f1CurrentLine = text1.readline()
f2CurrentLine = text2.readline()
print("Number of characters in the first file: ", text1Count)
print("number of characters in the second file: ", text2Count)
print("Number of characters that do not match in lines of the same length: ", diffCharCount)
print("Number of lines that are not the same length: ", diffLineCount)
def main():
"Calls the primary function"
CharByChar(count=0, diffCharCount=0, text1Count=0, text2Count=0, diffLineCount=0)
input("Hit enter to close the program...")
main() #Runs this bad boi
I think the general trouble is organizing your CharByChar() function to scan all the lines in the file [which is something we maintain in this solution] but then asking to call the same function at then end of every character check
some parts have no reasons to be there: for example you set count in the main when calling CharByChar() and then you create a branch with if(count == 0). You can cut this out, the code will look cleaner
some variables as well should be removed to keep the code as clean as possible: you never use text1Count and text2Count
you enter with a condition on the while and the next if has the same condition: if you entered the while you will enter also the if [or none of them] so you can cut one of them out
I suggest you to remove the branch with if len(f1CurrentLine) == 0 or len(f2CurrentLine) == 0 because both the files can have length 0 for the same line and then the lines would be equal [see the very next example below]
I suggest you to remove the strip() to avoid troubles to interrupt the check earlier for files where you have newlines in the middle, e.g.
1 Hello
3 foobar

Run Length Encoding in python

i got homework to do "Run Length Encoding" in python and i wrote a code but it is print somthing else that i dont want. it prints just the string(just like he was written) but i want that it prints the string and if threre are any characthers more than one time in this string it will print the character just one time and near it the number of time that she appeard in the string. how can i do this?
For example:
the string : 'lelamfaf"
the result : 'l2ea2mf2
def encode(input_string):
count = 1
prev = ''
lst = []
for character in input_string:
if character != prev:
if prev:
entry = (prev, count)
lst.append(entry)
#print lst
count = 1
prev = character
else:
count += 1
else:
entry = (character, count)
lst.append(entry)
return lst
def decode(lst):
q = ""
for character, count in lst:
q += character * count
return q
def main():
s = 'emanuelshmuel'
print decode(encode(s))
if __name__ == "__main__":
main()
Three remarks:
You should use the existing method str.count for the encode function.
The decode function will print count times a character, not the character and its counter.
Actually the decode(encode(string)) combination is a coding function since you do not retrieve the starting string from the encoding result.
Here is a working code:
def encode(input_string):
characters = []
result = ''
for character in input_string:
# End loop if all characters were counted
if set(characters) == set(input_string):
break
if character not in characters:
characters.append(character)
count = input_string.count(character)
result += character
if count > 1:
result += str(count)
return result
def main():
s = 'emanuelshmuel'
print encode(s)
assert(encode(s) == 'e3m2anu2l2sh')
s = 'lelamfaf'
print encode(s)
assert(encode(s) == 'l2ea2mf2')
if __name__ == "__main__":
main()
Came up with this quickly, maybe there's room for optimization (for example, if the strings are too large and there's enough memory, it would be better to use a set of the letters of the original string for look ups rather than the list of characters itself). But, does the job fairly efficiently:
text = 'lelamfaf'
counts = {s:text.count(s) for s in text}
char_lst = []
for l in text:
if l not in char_lst:
char_lst.append(l)
if counts[l] > 1:
char_lst.append(str(counts[l]))
encoded_str = ''.join(char_lst)
print encoded_str

Categories

Resources