Binary search over a huge file with unknown line length - python

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search

To resolve it, you also can use binary search, but you need to change it a bit:
Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

Related

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Conversion to Logn Python 3.7

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
big file has millions of lines of int data.
small file has hundreds of lines of int data.
compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.
I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.
What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.
This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb), if you do the following:
implement binary search, searching not on the lines, but on the characters of the big file
for each position, backtrack to the last line break, then read the line to get the number
try again in the lower/upper half as usual with binary search
if the number is found, replace with as many spaces as there are digits in the number
repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).
The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

Python: Extract substring from text file based on character index

So I have a File with some thousand entries of the form (fasta format, if anyone wants to know):
>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...
And I have a python-dictionary from part of my code that contains information like the following for a number of headers:
{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}
This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). [start, end, left]
What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. (like when the start position is 8, I cant include 25 chars in front but only 8.)
And that for every entry in my dict.
Sounds not too hard probably but I am struggling to come up with a clever way to do it.
For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X.
for line in genomeFile:
line = line.strip()
if(line[0] == ">"):
header = line
currentCluster = foundClusters.get(header[1:])
if(currentCluster is not None):
outputFile.write(header + "\n")
if(currentCluster is not None):
charCount += len(line)
# *crazy calculations to find the actual part i want to extract*
I am quite the python beginner so maybe someone has a better idea how to solve this?
-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. If this works it seems pretty easy to accomplish what I want.
I'll post this anyway, maybe someone has an even better way or maybe my idea wont work.
thanks for any input.
EDIT:
turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. also file.read(x) probably counts newlines too, which my data for start and end position do not.
(also fixed some stupid errors in my posted code)
Perhaps you could use a function to generate your needed splice indices.
def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
newStart = start - frameSize
if newStart < 0 :
newStart = 0
if frameSize > left :
newEnd = left
else :
newEnd = end + frameSize
return newStart, newEnd
With that function, you can add something like the following to your code.
for indices in currentCluster :
slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
outputFile.write(line[slice:dice] + '\n')

Python algorithm error when trying to find the next largest value

I've written an algorithm that scans through a file of "ID's" and compares that value with the value of an integer i (I've converted the integer to a string for comparison, and i've trimmed the "\n" prefix from the line). The algorithm compares these values for each line in the file (each ID). If they are equal, the algorithm increases i by 1 and uses reccurtion with the new value of i. If the value doesnt equal, it compares it to the next line in the file. It does this until it has a value for i that isn't in the file, then returns that value for use as the ID of the next record.
My issue is i have a file of ID's that list 1,3,2 as i removed a record with ID 2, then created a new record. This shows the algorithm to be working correctly, as it gave the new record the ID of 2 which was previously removed. However, when i then create a new record, the next ID is 3, resulting in my ID list reading: 1,3,2,3 instead of 1,3,2,4. Bellow is my algorithm, with the results of the print() command. I can see where its going wrong but can't work out why. Any ideas?
Algorithm:
def _getAvailableID(iD):
i = iD
f = open(IDFileName,"r")
lines = f.readlines()
for line in lines:
print("%s,%s,%s"%("i=" + str(i), "ID=" + line[:-1], (str(i) == line[:-1])))
if str(i) == line[:-1]:
i += 1
f.close()
_getAvailableID(i)
return str(i)
Output:
(The output for when the algorithm was run for finding an appropriate ID for the record that should have ID of 4):
i=1,ID=1,True
i=2,ID=1,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False
i=2,ID=3,False
i=2,ID=2,True
i=3,ID=1,False
i=3,ID=3,True
i=4,ID=1,False
i=4,ID=3,False
i=4,ID=2,False
i=4,ID=2,False
I think your program is failing because you need to change:
_getAvailableID(i)
to
return _getAvailableID(i)
(At the moment the recursive function finds the correct answer which is discarded.)
However, it would probably be better to simply put all the ids you have seen into a set to make the program more efficient.
e.g. in pseudocode:
S = set()
loop over all items and S.add(int(line.rstrip()))
i = 0
while i in S:
i += 1
return i
In case you are simply looking for the max ID in the file and then want to return the next available value:
def _getAvailableID(IDFileName):
iD = '0'
with open(IDFileName,"r") as f:
for line in f:
print("ID=%s, line=%s" % (iD, line))
if line > iD:
iD = line
return str(int(iD)+1)
print(_getAvailableID("IDs.txt"))
with an input file containing
1
3
2
it outputs
ID=1, line=1
ID=1
, line=3
ID=3
, line=2
4
However, we can solve it in a more pythonic way:
def _getAvailableID(IDFileName):
with open(IDFileName,"r") as f:
mx_id = max(f, key=int)
return int(mx_id)+1

python mmap regex searching common entries in two files

I have 2 huge xml files. One is around 40GB, the other is around 2GB. Assume the xml format is
something like this
< xml >
...
< page >
< id > 123 < /id >
< title > ABC < /title >
< text > .....
.....
.....
< /text >
< /page >
...
< /xml >
I have created an index file for both file 1 and file 2 using mmap.
Each of the index files complies with this format:
Id <page>_byte_position </page>_byte_position
So, basically given an Id, from the index files, I know where the tag starts for that Id and where it ends i.e. tag byte pos.
Now, what I need to do is:
- I need to be able to figure out for each id in the smaller index file (for 2GB),
if the id exists in the larger index file
- If the id exists, I need to be able to get the _byte_pos and _byte_pos for
that id from the larger index file (for 40GBfile )
My current code is awfully slow. I guess I am doing an O(m*n) algorithm assuming m is size of
larger file and n of smaller file.
with open(smaller_idx_file, "r+b") as f_small_idx:
for line in f_small_idx.readlines():
split = line.split(" ")
with open(larger_idx_file, "r+b") as f_large_idx:
for line2 in f_large_idx.readlines():
split2 = line2.split(" ")
if split[0] in split2:
print split[0]
print split2[1] + " " + split2[2]
This is AWFULLY slow !!!!
Any better suggestions ??
Basically, given 2 huge files, how do you search if each word in a particular column in smaller file exists in the huge file and if it does, you need to extract other relevant fields as well.
Any suggestions would be greatly appreciated!! : )
Don't have time for an elaborate answer right now but this should work (assuming the temporary dict will fit into memory):
Iterate over smaller file and put all the words of the relevant column in a dict (lookup in a dict has an average case performance of O(1))
Iterate over larger file and look up each word in the dict storing the relevant information either directly with the dict entries or elsewhere.
If this does not work I would suggest sorting (or filtering) the files first so that chunks can then be processed independently (i.e. compare only everything that starts with A then B...)

Categories

Resources