I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.
What I am trying to do
I am fuzzy searching for keywords, stored in a dictionary d (with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings, stored in a file testFile (one string each line, ~150 characters each) - 4 mismatches max.
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords
How I do it
For this, I first compile my regex and store them in a dictionary compd. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.
Here is how I am doing it:
#/usr/bin/env python3
import argparse
import regex
parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()
#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}
#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []} #to store regex
for k, v in d.items(): #for each keyword
compd[k].append(regex.compile(r'(?b)(' + v[0] + '){s<=4}')) #compile 1st elt of list
compd[k].append(regex.compile(r'(?b)(' + v[1] + '){s<=4}')) #compile second
#Search keywords
with open(args.file) as f: #open file with strings
line = f.readline() #first line/string
while line: #go through each line
for k, v in compd.items(): #for each keyword (ID, regex)
for val in [v[0], v[1]]: #for each elt of list
found = val.search(line) #regex search
if found != None: #if match
print("Keyword " + k + " found as " + found[0]) #print match
if val == v[0]: #if 1st elt of list
break #don't search 2nd
line = f.readline() #next line
I have tested the script using the testFile:
AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT
and get the following expected result:
Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC
Efficiency
With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.
Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:
patterns = []
for k, v in d.items(): #for each keyword
patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)(?:' + '|'.join(patterns) + '){s<=4}'
reSeqs = regex.compile(pattern)
With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over all the regexes in compd with loops over matches in a line (in case there is more than 1 match) and dictionary items in each match (which could be implemented as a comprehension):
for matched in reSeqs.finditer(line):
try:
keyword = [kw for kw, val in matched.groupdict().items() if val][0]
# perform further processing of keyword
except: # no match
pass
(Note that you don't need to call readline on a file object to loop over lines; instead, you can loop over the file object directly: for line in f:.)
If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension over matches is more performant:
with open(args.file) as f:
contents = f.read() # slurp entire file
matches = [{
val:kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
This solution doesn't distinguish between repetitions of a given sequence; in particular, repetitions on a single line are lost. You could merge entries having the same keys into lists, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:
matches = [{
(val, found.span()):kw for kw, val in found.groupdict().items() if val
} for found in reSeqs.finditer(contents)
]
To merge:
results = {}
for match in matches:
results.update(match)
# or:
results = {k:v for d in matches for k,v in d.items()}
If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer on each chunk:
# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
with open(path) as file:
yield from chunks(file, size=size)
results = {}
for block in file_chunks(args.file, size=2**20):
for found in reSeqs.finditer(block):
results.update({
(val, found.span()):kw for kw, val in found.groupdict().items() if val
})
Related
I'm very new to Python and I'm stuck on a task. First I made a file containing a number of fasta files with sequence names into a dictionary, then managed to select only those I want, based on substrings included in the keys which are defined in list "flu_genes".
Now I'm trying to reorder the items in this dictionary based on the order of substrings defined in the list "flu_genes". I'm completely stuck; I found a way of reordering based on the key order in a list BUT it is not my case, as the order is defined not by the keys but by a substring within the keys.
Should also add that in this case the substring its at the end with format "_GENE", however it could be in the middle of the string with the same format, perhaps "GENE", therefore I'd rather not rely on a code to find the substring at the end of the string.
I hope this is clear enough and thanks in advance for any help!
"full_genome.fasta"
>A/influenza/1/1_NA
atgcg
>A/influenza/1/1_NP
ctgat
>A/influenza/1/1_FluB
agcta
>A/influenza/1/1_HA
tgcat
>A/influenza/1/1_FluC
agagt
>A/influenza/1/1_M
tatag
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
key = line[1:]
else:
if key in consensus:
consensus[key] += line
else:
consensus[key] = line
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
>>>Dictionary after removal of keys : {'>A/influenza/1/1_NA': 'atgcg', '>A/influenza/1/1_NP': 'ctgat', '>A/influenza/1/1_HA': 'tgcat', '>A/influenza/1/1_M': 'tatag'}
#reordering by keys order (not going to work!) as in: https://try2explore.com/questions/12586065
reordered_dict = {k: flu_fas[k] for k in flu_genes}
A dictionary is fundamentally unsorted, but as an implementation detail of python3 it remembers its insertion order, and you're not going to change anything later, so you can do what you're doing.
The problem is, of course, that you're not working with the actual keys. So let's just set up a list of the keys, and sort that according to your criteria. Then you can do the other thing you did, except using the actual keys.
flu_genes = ['_HA', '_NP', '_NA', '_M']
def get_gene_index(k):
for index, gene in enumerate(flu_genes):
if k.endswith(gene):
return index
raise ValueError('I thought you removed those already')
reordered_keys = sorted(flu_fas.keys(), key=get_gene_index)
reordered_dict = {k: flu_fas[k] for k in reordered_keys}
for k, v in reordered_dict.items():
print(k, v)
A/influenza/1/1_HA tgcat
A/influenza/1/1_NP ctgat
A/influenza/1/1_NA atgcg
A/influenza/1/1_M tatag
Normally, I wouldn't do an n-squared sort, but I'm assuming the lines in the data file is much larger than the number of flu_genes, making that essentially a fixed constant.
This may or may not be the best data structure for your application, but I'll leave that to code review.
It's because you are trying to reorder it with non-existent dictionary keys. Your keys are
['>A/influenza/1/1_NA', '>A/influenza/1/1_NP', '>A/influenza/1/1_HA', '>A/influenza/1/1_M']
which doesn't match the list
['_HA', '_NP', '_NA', '_M']
you first need to get transform them to make them match and since we know the pattern that it's at the end of the string starting with an underscore, we can split at underscores and get the last match.
consensus = {}
flu_genes = ['_HA', '_NP', '_NA', '_M']
with open("full_genome.fasta", 'r') as myseq:
for line in myseq:
line = line.rstrip()
if line.startswith('>'):
sequence = line
gene = line.split('_')[-1]
key = f"_{gene}"
else:
consensus[key] = {
'sequence': sequence,
'data': line
}
flu_fas = {key : val for key, val in consensus.items() if any(ele in key for ele in flu_genes)}
print("Dictionary after removal of keys : " + str(flu_fas))
reordered_dict = {k: flu_fas[k] for k in flu_genes}
dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
seq = sequence.readlines()
for line in seq:
if line.startswith(">"):
tag = line.replace("\n", "")
else:
seq = "".join(seq[1:])
dict[tag] = seq.replace("\n", "")
print(dict)
Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?
I am purposefully steering away from the biopython module.
UPDATE: the code below now works for multiple-line sequences.
The following code works fine for me:
import re
from collections import defaultdict
sequences = defaultdict(str)
with open('fasta.txt') as f:
lines = f.readlines()
current_tag = None
for line in lines:
m = re.match('^>(.+)', line)
if m:
current_tag = m.group(1)
else:
sequences[current_tag] += line.strip()
for k, v in sequences.items():
print(f"{k}: {v}")
It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.
For reference, this is the content of the example FASTA file fasta.txt I used in this instance:
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
I am trying to match a token multiple times, but I only get back the last occurrence, which I understand is the normal behavior as per this answer, but I haven't been able to get the solution presented there in my example.
My text looks something like this:
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
So basically multiple lines, each with a starting string, spaces, then a variable number of key pairs. If you are wondering where this comes from, it is a robot framework variables file that I am trying to transform into a python variables file.
I will be iterating per line to match the key pairs and construct a python dictionary from them.
My current regex pattern is:
&{([^ ]+)}=[ ]{2,}(?:[ ]{2,}([^\s=]+)=([^\s=]+))+
This correctly gets me the dict name but the key pairs only match the last occurrence, as mentioned above. How can I get it to return a tuple containing: ("dict1_name","key1","key1value"..."keyn","keynvalue") so that I can then iterate over this and construct the python dictionary like so:
dict1_name= {"key1": "key1value",..."keyn": "keynvalue"}
Thanks!
As you point out, you will need to work around the fact that capture groups will only catch the last match. One way to do so is to take advantage of the fact that lines in a file are iterable, and to use two patterns: one for the "line name", and one for its multiple keyvalue pairs:*
import re
dname = re.compile(r'^&{(?P<name>\w+)}=')
keyval = re.compile(r'(?P<key>\w+)=(?P<val>\w+)')
data = {}
with open('input/keyvals.txt') as f:
for line in f:
name = dname.search(line)
if name:
name = name.group('name')
data[name] = dict(keyval.findall(line))
*Admittedly, this is a tad inefficient since you're conducting two searches per line. But for moderately sized files, you should be fine.
Result:
>>> from pprint import pprint
>>> pprint(data)
{'d5': {'key1': '28f_s', 'key2': 'key2value'},
'name1': {'key1': '5', 'key2': 'x'},
'othername2': {'key1': 'key1value', 'key2': '7'}}
Note that \w matches Unicode word characters.
Sample input, keyvals.txt:
&{name1}= key1=5 key2=x
&{othername2}= key1=key1value key2=7
&{d5}= key1=28f_s key2=aaa key2=key2value
You could use two regexes one for the names and other for the items, applying the one for the items after the first space:
import re
lines = ['&{dict1_name}= key1=key1value key2=key2value',
'&{dict2_name}= key1=key1value']
name = re.compile('^&\{(\w+)\}=')
item = re.compile('(\w+)=(\w+)')
for line in lines:
n = name.search(line).group(1)
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
exec('{} = {}'.format(n, i))
print(locals()[n])
Output
{'key2': 'key2value', 'key1': 'key1value'}
{'key1': 'key1value'}
Explanation
The '^&\{(\w+)\}=' matches an '&' followed by a word (\w+) surrounded by curly braces '\{', '\}'. The second regex matches any words joined by a '='. The line:
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
creates a dictionary literal, finally you create a dictionary with the required name using exec. You can access the value of the dictionary querying locals.
Use two expressions in combination with a dict comprehension:
import re
junkystring = """
lorem ipsum
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
lorem ipsum
"""
rx_outer = re.compile(r'^&{(?P<dict_name>[^{}]+)}(?P<values>.+)', re.M)
rx_inner = re.compile(r'(?P<key>\w+)=(?P<value>\w+)')
result = {m_outer.group('dict_name'): {m_inner.group('key'): m_inner.group('value')
for m_inner in rx_inner.finditer(m_outer.group('values'))}
for m_outer in rx_outer.finditer(junkystring)}
print(result)
Which produces
{'dict1_name': {'key1': 'key1value', 'key2': 'key2value'},
'dict2_name': {'key1': 'key1value'}}
With the two expressions being
^&{(?P<dict_name>[^{}]+)}(?P<values>.+)
# the outer format
See a demo on regex101.com. And the second
(?P<key>\w+)=(?P<value>\w+)
# the key/value pairs
See a demo for the latter on regex101.com as well.
The rest is simply sorting the different expressions in the dict comprehension.
Building off of Brad's answer, I made some modifications. As mentioned in my comment on his reply, it failed at empty lines or comment lines. I modified it to ignore these and continue. I also added handling of spaces: it now matches spaces in dictionary names but replaces them with underscore since python cannot have spaces in variable names. Keys are left untouched since they are strings.
import re
def robot_to_python(filename):
"""
This function can be used to convert robot variable files containing dicts to a python
variables file containing python dict that can be imported by both python and robot.
"""
dname = re.compile(r"^&{(?P<name>.+)}=")
keyval = re.compile(r"(?P<key>[\w|:]+)=(?P<val>[\w|:]+)")
data = {}
with open(filename + '.robot') as f:
for line in f:
n = dname.search(line)
if n:
name = dname.search(line).group("name").replace(" ", "_")
if name:
data[name] = dict(keyval.findall(line))
with open(filename + '.py', 'w') as file:
for dictionary in data.items():
dict_name = dictionary[0]
file.write(dict_name + " = { \n")
keyvals = dictionary[1]
for k in sorted(keyvals.keys()):
file.write("'%s':'%s', \n" % (k, keyvals[k]))
file.write("}\n\n")
file.close()
Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)
The text file contains two columns- index number(5 spaces) and characters(30 spaces).
It is arranged in lexicographic order. I want to perform binary search to search for the keyword.
Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located # line: ", bisect.bisect(searchable_file, wrapped_query)
Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345
If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set() as #kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict() above instead of set() to preserve index information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set() variant.
Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by #Mahmoud Abdelkader don't require to load the whole file in memory.
I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]
It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.
Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.
Hope this helps
========================================
# See if the input loc is in file
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
# Find midpoint and seek to it
loc = int((hi+lo)/2)
# print " hi = ",hi," lo = ",lo
# print "seek to: ",loc
f.seek(loc)
# Skip to beginning of line
while f.read(1) != '\n':
pass
# Now skip past lines that are headers
while 1:
# read line
line = f.readline()
# print "read_line: ", line
# Crude csv parsing, remove quotes, and split on ,
row=line.replace('"',"")
row=row.split(',')
# Make sure 1st fields is numeric
if row[0].isdigit():
break
s=int(row[0])
if lookfor < s:
# Split into lower half
hi=loc
continue
if lookfor > s:
# Split into higher half
lo=loc
continue
return row # Found
# If not found
return False
Consider using a set instead of a binary search for finding a keyword in your file.
Set:
O(n) to create, O(1) to find, O(1) to insert/delete
If your input file is separated by a space then:
f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()
my_word in keywords
<returns True or False>
Dictionary:
f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] )
f.close()
keywords[my_word]
<returns index of my_word>
Binary Search is:
O(n log n) create, O(log n) lookup
edit: for your case of 5 characters and 30 characters you can just use string slicing
f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()
myword_ in keywords
or
f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()
keywords[my_word]