CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program? - python

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x

Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Related

Runtime Error (Python3) when you manipulate lists with very long strings

I wrote a Python3 code to manipulate lists of strings but the code gives Runtime Error for long strings. Here is my code for the problem:
string = "BANANA"
slist= list (string)
mark = list(range(len(slist)))
vowel_substrings = list()
consonants_substrings = list()
#print(mark)
for i in range(len(slist)):
if slist[i]=='A' or slist[i]=='E' or slist[i]=='I' or slist[i]=='O' or mark[i]=='U':
mark[i] = 1
else:
mark[i] = 0
#print(mark)
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings.append(string[j:l+1])
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings.append(string[j:l+1])
#print(consonants_substrings)
unique_consonants = list(set(consonants_substrings))
unique_vowels = list(set(vowel_substrings))
##add two lists
all_substrings = consonants_substrings+(vowel_substrings)
#print(all_substrings)
##Find points earned by vowel guy and consonant guy
vowel_guy_score = 0
consonant_guy_score = 0
for strng in unique_vowels:
vowel_guy_score += vowel_substrings.count(strng)
for strng in unique_consonants:
consonant_guy_score += consonants_substrings.count(strng)
#print(vowel_guy_score) #Kevin
#print(consonant_guy_score) #Stuart
if vowel_guy_score > consonant_guy_score:
print("Kevin ",vowel_guy_score)
elif vowel_guy_score < consonant_guy_score:
print("Stuart ",consonant_guy_score)
else:
print("Draw")
gives the right answer. But if you have a long string, shown below, it fails.
NANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANAN
I think initialization or memory allocation might be a problem but I don't know how to allocate memory before even knowing how much memory the code will need. Thank you in advance for any help you can provide.

In the middle there, you generate a data structure of size O(n³): for each starting position × each ending position × length of the substring. That's probably where your memory problems appear (you haven't posted a traceback).
One possible optimisation would be, instead of having a list of substrings and then generating the set, use instead a Counter class. That would let you know how many times each substring appears without storing all the copies:
vowel_substrings = collections.Counter()
consonant_substrings = collections.Counter()
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings[string[j:l+1]] += 1
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings[string[j:l+1]] += 1
Even better would be to calculate the scores as you go along, without storing any of the substrings. If I'm reading the code correctly, the substrings aren't actually used for anything — each letter is effectively scored based on its distance from the end of the string, and the scores are added up. This can be calculated in a single pass through the string, without making any additional copies or keeping track of anything other than the cumulative scores and the length of the string.

How to read a text file and convert into a list for use with statistics package in Python

The code I am running so far is as follows
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
index = 0
while index < len(values):
values(index) = int(values(index))
index += 1
print(values)
main()
The text file contains 41 rows of numbers each entered on a single line like so:
151868
153982
156393
158956
161884
165069
168088
etc.
My tasks is to create a program which shows average change in population during the time period. The year with the greatest increase in population during the time period. The year with the smallest increase in population (from the previous year) during the time period.
The code will print each of the text files entries on a single line, but upon trying to convert to int for use with the statistics package I am getting the following error:
values(index) = int(values(index))
SyntaxError: can't assign to function call
The values(index) = int(values(index)) line was taken from reading as well as resources on stack overflow.

You can change values = infile.read() to values = list(infile.read())
and it will have it ouput as a list instead of a string.
One of the things that tends to happen whenever reading a file like this is, at the end of every line there is an invisible '\n' that declares a new line within the text file, so an easy way to split it by lines and turn them into integers would be, instead of using values = list(infile.read()) you could use values = values.split('\n') which splits the based off of lines, as long as values was previously declared.
and the while loop that you have can be easily replace with a for loop, where you would use len(values) as the end.
the values(index) = int(values(index)) part is a decent way to do it in a while loop, but whenever in a for loop, you can use values[i] = int(values[i]) to turn them into integers, and then values becomes a list of integers.
How I would personally set it up would be :
import os
import math
import statistics
def main ():
infile = open('USPopulation.txt', 'r')
values = infile.read()
infile.close()
values = values.split('\n') # Splits based off of lines
for i in range(0, len(values)) : # loops the length of values and turns each part of values into integers
values[i] = int(values[i])
changes = []
# Use a for loop to get the changes between each number.
for i in range(0, len(values)-1) : # you put the -1 because there would be an indexing error if you tried to count i+1 while at len(values)
changes.append(values[i+1] - values[i]) # This will get the difference between the current and the next.
print('The max change :', max(changes), 'The minimal change :', min(changes))
#And since there is a 'change' for each element of values, meaning if you print both changes and values, you would get the same number of items.
print('A change of :', max(changes), 'Happened at', values[changes.index(max(changes))]) # changes.index(max(changes)) gets the position of the highest number in changes, and finds what population has the same index (position) as it.
print('A change of :', min(changes), 'Happened at', values[changes.index(min(changes))]) #pretty much the same as above just with minimum
# If you wanted to print the second number, you would do values[changes.index(min(changes)) + 1]
main()
If you need any clarification on anything I did in the code, just ask.

I personally would use numpy for reading a text file.
in your case I would do it like this:
import numpy as np
def main ():
infile = np.loadtxt('USPopulation.txt')
maxpop = np.argmax(infile)
minpop = np.argmin(infile)
print(f'maximum population = {maxpop} and minimum population = {minpop}')
main()

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

So I am currently preparing for a competition (Australian Informatics Olympiad) and in the training hub, there is a problem in AIO 2018 intermediate called Castle Cavalry. I finished it:
input = open("cavalryin.txt").read()
output = open("cavalryout.txt", "w")
squad = input.split()
total = squad[0]
squad.remove(squad[0])
squad_sizes = squad.copy()
squad_sizes = list(set(squad))
yn = []
for i in range(len(squad_sizes)):
n = squad.count(squad_sizes[i])
if int(squad_sizes[i]) == 1 and int(n) == int(total):
yn.append(1)
elif int(n) == int(squad_sizes[i]):
yn.append(1)
elif int(n) != int(squad_sizes[i]):
yn.append(2)
ynn = list(set(yn))
if len(ynn) == 1 and int(ynn[0]) == 1:
output.write("YES")
else:
output.write("NO")
output.close()
I submitted this code and I didn't pass because it was too slow, at 1.952secs. The time limit is 1.000 secs. I wasn't sure how I would shorten this, as to me it looks fine. PLEASE keep in mind I am still learning, and I am only an amateur. I started coding only this year, so if the answer is quite obvious, sorry for wasting your time 😅.
Thank you for helping me out!

One performance issue is calling int() over and over on the same entity, or on things that are already int:
if int(squad_sizes[i]) == 1 and int(n) == int(total):
elif int(n) == int(squad_sizes[i]):
elif int(n) != int(squad_sizes[i]):
if len(ynn) == 1 and int(ynn[0]) == 1:
But the real problem is your code doesn't work. And making it faster won't change that. Consider the input:
4
2
2
2
2
Your code will output "NO" (with missing newline) despite it being a valid configuration. This is due to your collapsing the squad sizes using set() early in your code. You've thrown away vital information and are only really testing a subset of the data. For comparison, here's my complete rewrite that I believe handles the input correctly:
with open("cavalryin.txt") as input_file:
string = input_file.read()
total, *squad_sizes = map(int, string.split())
success = True
while squad_sizes:
squad_size = squad_sizes.pop()
for _ in range(1, squad_size):
try:
squad_sizes.remove(squad_size) # eliminate n - 1 others like me
except ValueError:
success = False
break
else: # no break
continue
break
with open("cavalryout.txt", "w") as output_file:
print("YES" if success else "NO", file=output_file)
Note that I convert all the input to int early on so I don't have to consider that issue again. I don't know whether this will meet AIO's timing constraints.

I can see some things in there that might be inefficient, but the best way to optimize code is to profile it: run it with a profiler and sample data.
You can easily waste time trying to speed up parts that don't need it without having much effect. Read up on the cProfile module in the standard library to see how to do this and interpret the output. A profiling tutorial is probably too long to reproduce here.
My suggestions, without profiling,
squad.remove(squad[0])
Removing the start of a big list is slow, because the rest of the list has to be copied as it is shifted down. (Removing the end of the list is faster, because lists are typically backed by arrays that are overallocated (more slots than elements) anyway, to make .append()s fast, so it only has to decrease the length and can keep the same array.
It would be better to set this to a dummy value and remove it when you convert it to a set (sets are backed by hash tables, so removals are fast), e.g.
dummy = object()
squad[0] = dummy # len() didn't change. No shifting required.
...
squad_sizes = set(squad)
squad_sizes.remove(dummy) # Fast lookup by hash code.
Since we know these will all be strings, you can just use None instead of a dummy object, but the above technique works even when your list might contain Nones.
squad_sizes = squad.copy()
This line isn't required; it's just doing extra work. The set() already makes a shallow copy.
n = squad.count(squad_sizes[i])
This line might be the real bottleneck. It's effectively a loop inside a loop, so it basically has to scan the whole list for each outer loop. Consider using collections.Counter for this task instead. You generate the count table once outside the loop, and then just look up the numbers for each string.
You can also avoid generating the set altogether if you do this. Just use the Counter object's keys for your set.
Another point unrelated to performance. It's unpythonic to use indexes like [i] when you don't need them. A for loop can get elements from an iterable and assign them to variables in one step:
from collections import Counter
...
count_table = Counter(squad)
for squad_size, n in count_table.items():
...

You can collect all occurences of the preferred number for each knight in a dictionary.
Then test if the number of knights with a given preferred number is divisible by that number.
with open('cavalryin.txt', 'r') as f:
lines = f.readlines()
# convert to int
list_int = [int(a) for a in lines]
#initialise counting dictionary: key: preferred number, item: empty list to collect all knights with preferred number.
collect_dict = {a:[] for a in range(1,1+max(list_int[1:]))}
print(collect_dict)
# loop though list, ignoring first entry.
for a in list_int[1:]:
collect_dict[a].append(a)
# initialise output
out='YES'
for key, item in collect_dict.items():
# check number of items with preference for number is divisilbe
# by that number
if item: # if list has entries:
if (len(item) % key) > 0:
out='NO'
break
with open('cavalryout.txt', 'w') as f:
f.write(out)

Conversion to Logn Python 3.7

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
big file has millions of lines of int data.
small file has hundreds of lines of int data.
compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.
What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.
This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb), if you do the following:
implement binary search, searching not on the lines, but on the characters of the big file
for each position, backtrack to the last line break, then read the line to get the number
try again in the lower/upper half as usual with binary search
if the number is found, replace with as many spaces as there are digits in the number
repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).
The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

Python Min-Max Function - List as argument to return min and max element

Question: write a program which first defines functions minFromList(list) and maxFromList(list). Program should initialize an empty list and then prompt user for an integer and keep prompting for integers, adding each integer to the list, until the user enters a single period character. Program should than call minFromList and maxFromList with the list of integers as an argument and print the results returned by the function calls.
I can't figure out how to get the min and max returned from each function separately. And now I've added extra code so I'm totally lost. Anything helps! Thanks!
What I have so far:
def minFromList(list)
texts = []
while (text != -1):
texts.append(text)
high = max(texts)
return texts
def maxFromList(list)
texts []
while (text != -1):
texts.append(text)
low = min(texts)
return texts
text = raw_input("Enter an integer (period to end): ")
list = []
while text != '.':
textInt = int(text)
list.append(textInt)
text = raw_input("Enter an integer (period to end): ")
print "The lowest number entered was: " , minFromList(list)
print "The highest number entered was: " , maxFromList(list)

I think the part of the assignment that might have confused you was about initializing an empty list and where to do it. Your main body that collects data is good and does what it should. But you ended up doing too much with your max and min functions. Again a misleading part was that assignment is that it suggested you write a custom routine for these functions even though max() and min() exist in python and return exactly what you need.
Its another story if you are required to write your own max and min, and are not permitted to use the built in functions. At that point you would need to loop over each value in the list and track the biggest or smallest. Then return the final value.
Without directly giving you too much of the specific answer, here are some individual examples of the parts you may need...
# looping over the items in a list
value = 1
for item in aList:
if item == value:
print "value is 1!"
# basic function with arguments and a return value
def aFunc(start):
end = start + 1
return end
print aFunc(1)
# result: 2
# some useful comparison operators
print 1 > 2 # False
print 2 > 1 # True
That should hopefully be enough general information for you to piece together your custom min and max functions. While there are some more advanced and efficient ways to do min and max, I think to start out, a simple for loop over the list would be easiest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program? - python

Related

Runtime Error (Python3) when you manipulate lists with very long strings

How to read a text file and convert into a list for use with statistics package in Python

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

Conversion to Logn Python 3.7

Python Min-Max Function - List as argument to return min and max element

Categories

Resources