I'm learning Python and for practicing purposes I'm writing a script that reads a file (containing a graph in Trivial Graph Format) and runs a couple of graph algorithms on the graph.
I thought about storing the graph in a list of n dictionaries, where n is the number of vertexes and all the edges of a vertex would be stored in a dictionary.
I tried this
edges = [{} for i in xrange(num_vertexes)]
for line in file:
args = line.split(' ')
vertex1 = int(args[0])
vertex2 = int(args[1])
label = int(args[2])
edges[vertex1][vertex2] = label
but I'm getting this error for the last line:
IndexError: list index out of range
It looks like vertex1 is probably greater than num_vertexes. Given that python indexes from 0 and the example on the wiki of the format goes from 1, the last line's vertex number is probably 1 higher than the length of the index (I'd need to see the file to know for sure, of course). So in the python case lst[0] is the first element, and lst[n-1] is the last element where for the vertexes 1 is the first element and n is the last element.
So the fix here is to use vertex1 = int(args[0])-1
The issue is somewhere with your data, add some validation to make sure your code doesn't choke on bad data. Currently your code will fail if a line has non-numbers, less than three numbers, or if vertex1 >= len(edges).
edges = [{} for i in xrange(num_vertexs)]
for line in file:
args = line.split(' ')
if len(args) >= 3:
try:
vertex1 = int(args[0])
vertex2 = int(args[1])
label = int(args[2])
if vertex1 < len(edges):
edges[vertex1][vertex2] = label
else:
# value for vertex1 is too large
pass
except ValueError:
# you got some non-number data
pass
else:
# you got a line with not enough data
pass
Replace any of those pass statements with logging if needed (you can also remove the two else blocks if you don't intend to use them).
Related
So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
big file has millions of lines of int data.
small file has hundreds of lines of int data.
compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.
I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.
What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.
This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb), if you do the following:
implement binary search, searching not on the lines, but on the characters of the big file
for each position, backtrack to the last line break, then read the line to get the number
try again in the lower/upper half as usual with binary search
if the number is found, replace with as many spaces as there are digits in the number
repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).
The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.
I have a data structure Line whose outline is :
class Line :
x1
y1
x2
y2
m
c
id
# other functions pertaining to the class
In the main loop I have a list of lines which are already populated at this point.
What I want to do is consolidate lines which have m and c values very close so I get single line instead of multiple lines from detection
for line1 in allLines:
consolidateLines = []
for line2 in allLines:
if line1.id() == line2.id():
continue;
if abs(line1.m() - line2.m()) < SomeValue:
if abs(line1.c() - line2.c()) < someOtherValue:
consolidateLines.append(line2);
consolidateLines.append(line1);
# I want to remove all the lines in consolidatedLines.
# But since this is already in the loop, that is a problem.
# How do I accomplish this.
Explaining the problem :
I have a list of lines. Since these lines are detected using a computer vision algorithm (Hough Transforms), some of the lines are very close to each other. That is not ideal. So I am trying to consolidate all the lines that are very close and have close orientation. If one line is represented by y=mx + c, i'm trying to :
consolidate all lines (may have 5 lines which are close by) within the list with nearly same values of m and c and get one line for those.
remove all the consolidated lines
add the new line that i get in the list.
To remove duplicates from a list you basically need to compare every element with every other element from the list. In order to not compare twice you need to start the second loop at the position of the first loop + 1.
The following code does that, and if it finds a duplicate skips the first of the two values (break) command:
consolidateLines = []
for i, line1 in enumerate(allLines):
for j, line2 in enumerate(allLines[i+1:]):
if abs(line1.m() - line2.m()) < SomeValue and
abs(line1.c() - line2.c()) < someOtherValue:
break # found a duplicate later in the list, skipping first occurrence
else:
# no duplicagte found -> add to list
consolidateLines.append(line1);
Question: write a program which first defines functions minFromList(list) and maxFromList(list). Program should initialize an empty list and then prompt user for an integer and keep prompting for integers, adding each integer to the list, until the user enters a single period character. Program should than call minFromList and maxFromList with the list of integers as an argument and print the results returned by the function calls.
I can't figure out how to get the min and max returned from each function separately. And now I've added extra code so I'm totally lost. Anything helps! Thanks!
What I have so far:
def minFromList(list)
texts = []
while (text != -1):
texts.append(text)
high = max(texts)
return texts
def maxFromList(list)
texts []
while (text != -1):
texts.append(text)
low = min(texts)
return texts
text = raw_input("Enter an integer (period to end): ")
list = []
while text != '.':
textInt = int(text)
list.append(textInt)
text = raw_input("Enter an integer (period to end): ")
print "The lowest number entered was: " , minFromList(list)
print "The highest number entered was: " , maxFromList(list)
I think the part of the assignment that might have confused you was about initializing an empty list and where to do it. Your main body that collects data is good and does what it should. But you ended up doing too much with your max and min functions. Again a misleading part was that assignment is that it suggested you write a custom routine for these functions even though max() and min() exist in python and return exactly what you need.
Its another story if you are required to write your own max and min, and are not permitted to use the built in functions. At that point you would need to loop over each value in the list and track the biggest or smallest. Then return the final value.
Without directly giving you too much of the specific answer, here are some individual examples of the parts you may need...
# looping over the items in a list
value = 1
for item in aList:
if item == value:
print "value is 1!"
# basic function with arguments and a return value
def aFunc(start):
end = start + 1
return end
print aFunc(1)
# result: 2
# some useful comparison operators
print 1 > 2 # False
print 2 > 1 # True
That should hopefully be enough general information for you to piece together your custom min and max functions. While there are some more advanced and efficient ways to do min and max, I think to start out, a simple for loop over the list would be easiest.
I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python
You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
To resolve it, you also can use binary search, but you need to change it a bit:
Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.
The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.