Fastest way to grep multiple values from file in python - python

I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
I also have a list of 1000 unique items (vals).
I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.
What is the fastest way to achieve this?
My best try till now:
newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
info = line.split()
foundVals.append(info[0])
newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]
Example
inputFile:
2 9913
3 9913
4 9646
...
594592886 32630
594592888 32630
594592890 32630
vals:
[1,2,594592888]
wanted dictionary:
{2:9913,594592888:32630}
And in notFound:
[1]

You clarified in a comment that each key occurs at most once in the data. It follows from that and the fact that there are only 1000 keys that the amount of work being done in Python is trivial; almost all your time is spent waiting for output from grep. Which is fine; your strategy of delegating line extraction to a specialized utility remains sound. But it means that performance gains have to be found on the line-extraction side.
You can speed things up some by optimizing your regex. E.g., instead of
^266[[:space:]]\|^801[[:space:]]\|^810[[:space:]]
you could use:
^\(266\|801\|810\)[[:space:]]
so that the anchor doesn't have to be separately matched for each alternative. I see about a 15% improvement on test data (10 million rows, 25 keys) with that change.
A further optimization is to unify common prefixes in the alternation: 266\|801\|810 can be replaced with the equivalent 266\|8\(01\|10\). Rewriting the 25-key regex in this way gives close to a 50% speedup on test data.
At this point grep is starting to show its limits. It seems that it's CPU-bound: iostat reveals that each successive improvement in the regex increases the number of IO requests per second while grep is running. And re-running grep with a warmed page cache and the --mmap option doesn't speed things up (as it likely would if file IO were a bottleneck). Greater speed therefore probably requires a utility with a faster regex engine.
One such is ag(source here), whose regex implementation also performs automatic optimization, so you needn't do much hand-tuning. While I haven't been able to get grep to process the test data in less than ~12s on my machine, ag finishes in ~0.5s for all of the regex variants described above.

If I understand you correctly, you don't want any file row that doesn't match you vals
Since you're talking about huge files and quite smaller number of wanted values, I would go for something like:
vals_set = set(vals)
found_vals = {}
with open(inputfile,"r") as in_file:
for line in in_file:
line = line.split() # Assuming tabs or whitespaces
if line[0] in vals_set:
found_vals[line[0]] = line[1]
not_found_vals = vals_set.difference(found_vals)
It will be memory conservative, and you'll have you dict in found_vals and your list in not_found_vals. In fact, memory usage, AFAIK will depend only on the amount of vals you want to search for, not the size of the files.
EDIT:
I think that the easiest way to parallelize this task would be just by splitting the file and searching separately in each piece with a different process. This way you don't need to deal with communicating between threads (easier and faster, I think).
A good way to do it, since I deduce you're using BASH (you used grep :P ) is what is mentioned in this answer:
split -l 1000000 filename
will generate files with 1000000 lines each.
You could easily modify you script to save your matches into a new file for each process, and then merge the different output file.

This is not terribly memory efficient (for a file of 300 million lines, that may amount to a problem). I can't think of a way to save the not-found values within a comprehension except by saving all the values (or reading the file twice). I don't think threads will help much, since the file I/O is likely going to be the performance bottleneck.
I'm assuming that a tab is the delimiting character in the file. (You didn't say, but the example data looks to have a tab.)
vals = [1,2,594592888]
with open(self.inputfile,'r') as i_file:
all_vals = {
int(t[0]):int(t[1])
for t in (
line.strip().split('\t')
for line in i_file
)
}
newDict = {
t[0]:t[1] for t in filter(lambda t: t[0] in vals, all_vals.items())
}
notFound = list(set(all_vals.keys()).difference(newDict.keys()))

Related

Get unique values of every column from a gz file

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.
import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique
I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.
any shell script solution?
for example i have the data in my file as
5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,
and in want all unique values from each column.
With the gunzipped file, you could do:
awk -F, 'END { for (i=1;i<=NF;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh
Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.
A shell built pipeline could indeed do this job faster, though likely less memory efficient. The primary reasons are two: parallellism and native code.
First, since we have little description of the task, I'll have to read the Python code and figure out what it does.
from sets import Set is an odd line; sets are part of the standard library, and I don't know what your sets module contains. I'll have to guess it's at best another name for the standard set type, or at least a less efficient variant of the same concept.
gzip.open lets the script read a gzipped file. We can replace this with a zcat process.
csv.readerreads character separated values, in this case splitting on '|'. Deeper inside the code we find only one column (line[i]) is read, so we can replace it with cut or awk ... until i changes. awk can handle that case too, but it's a little trickier.
The trickiest part is the end logic. Every time 10 unique values are found in a column, the program outputs those values and switches to the next column. By the way, Python's for has an else clause specifically for this case, so you don't need a flag variable.
One of the odder parts of the code is how you catch all exceptions from the inner data processing block. Why is this? There are basically only two sources of exceptions in there: Firstly, the indexing could fail if there aren't that many columns. Secondly, the unknown Set type could be throwing exceptions; the standard set type would not.
So, the analysis of your function is: in a diagonal manner (since the file is never rewound, and columns are not processed in parallel), collect unique values from each column until ten are found, and print them. This means, for instance, that if the first column had less than ten unique items nothing is ever printed for any other columns. I'm not sure this is the logic you intended.
With such complicated logic, Python's set functionality actually is a good choice; if we could partition the data more easily then uniq might have been better. What throws us off is how the program moves from column to column and only wants a specific number of values.
Thus, the two big time wasters in the Python program are decompressing in the same thread as we do all the parsing, and splitting into all columns when we only need one. The former can be addressed using a thread, and the latter is probably best done using a regular expression such as r'^(?:[^|]*\|){3}([^|]*)'. That expression would skip three columns and the fourth can be read as group 1. It gets more complicated if the CSV has quoting to contain the separator within some column. We could do the line parsing itself in a separate thread, but that wouldn't solve the issue of the many unneeded string allocations.
Note that the problem actually becomes considerably different if what you really want is to process all columns from the start of the file. I also don't know why you specifically process 400 columns regardless of the amount that exist. If we remove those two constraints, the logic would be more like:
firstline=next(reader)
sets = [{column} for column in firstline]
for line in reader:
for column,columnset in zip(line,sets):
columnset.add(column)
this is a pure python version based on your idea:
from io import StringIO
from csv import reader
txt = '''5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,'''
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
s.add(item)
which yields for your input:
[{'129DC8', '41C528', '4DE8CD', '5C4423', '9E7F41', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094',
'CA39260W1023',
'NL0000344265',
'QA000A0NCQB1',
'US2333774071',
'US37253A1034'},
{'2000-01-01', '2008-03-06', '2012-09-07', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
oops, now that i have posted my answer i see, that this is exactly what Yann Vernier proposes at the end of his answer. please upvote this answer which was here way earlier than mine...
if you want to limit the number of unique values, you could use a deque as data structure:
from io import StringIO
from csv import reader
MAX_LEN = 3
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
if len(s) < MAX_LEN:
s.add(item)
print(unique)
with the result:
[{'41C528', '5C4423', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094', 'NL0000344265', 'US2333774071'},
{'2000-01-01', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
this way you would save some memory if one of your columns holds only unique values.

python script for removing duplicates taking 24 hrs+ to loop through 10^7 records

input t1
P95P,71655,LINC-JP,pathogenic
P95P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
output op
P95P,71655,LINC-JP,pathogenic
P71P,71655,LINC-JP,pathogenic
myCode
def dup():
fi=open("op","a")
l=[];final="";
q=[];dic={};
for i in open("t1"):
k=i.split(",")
q.append(k[1])
q.append(k[0])
if q in l:
pass
else:
final= final + i.strip() + "\n"
fi.write(str(i.strip()))
fi.write("\n")
l.append(q)
q=[]
#print i.strip()
fi.close()
return final.strip()
d=dup()
In the above input line1,2 and line 3,4 are duplicates. Hence in output these duplicates are removed the entries in my input files are around 10^7.
Why is my code running since past 24 hrs for an input of 76Mb file. Also it has yet to complete one iteration of entire input file.It works fine for small files.
Can anyone please point out the reason for this long time.How can I optimize my program ? thnx
You're using an O(n2) algorithm, which scales poorly for larger files:
for i in open("t1"): # linear pass of file takes O(n) time
...
if q in l: # linear pass of list l takes O(n) time
...
...
You should consider using a set (i.e. make l a set) or itertools.groupby if duplicates will always be next to each other. These approaches will be O(n).
if you have access to a Unix system, uniq is a nice utility that is made for your problem.
uniq input.txt output.txt
see https://www.cs.duke.edu/csl/docs/unix_course/intro-89.html
I know this is a Python question, but sometimes Python is not the tool for the task.
And you can always embed a system call in your python script.
It's not clear why you're building a huge string (final) that holds the same thing the file does, or what dic is for. In terms of performance, you can look up x in y much faster if y is a set than if y is a list. Also, a minor point; shorter variable names don't improve performance, so use good ones instead. I would suggest:
def deduplicate(infile, outfile):
seen = set()
#final = []
with open(outfile, "a") as out, open(infile) as in_:
for line in in_:
check = tuple(line.split(",")[:2])
if check not in seen:
#final.append(line.strip())
out.write(line) # why 'strip' the '\n' then 'write' a new one?
seen.add(check)
#return "\n".join(final)
If you do really need final, make it a list until the last moment (see commented-out lines) - gradual string concatenation means the creation of lots of unnecessary objects.
There are a couple things that you are doing very inefficiently. The largest is that you made l a list, so the line if q in l has to search through everything in the list already in order to check if q matches it. If you make l a set, the membership check can be done using a hash calculation and array lookup, which take the same (small) amount of time no matter how much you add to the set (though it will cause l not to be read in the order that it was written).
Other little speedups that you can do include:
Using a tupple (k[1], k[0]) instead of a list for q.
You are writing your output file fi every loop. Your OS will try to batch and background the writes, but it may be faster to just do one big write at the end. I am not sure on this point but try it.

Removing duplicates in a huge .csv file

I have a csv file of this format
testname unitname time data
test1 1 20131211220159 123123
test1 1 20131211220159 12345
test1 1 20131211230180 1234
I am trying to remove all old data from this file and retain only the data with the latest timestamp.(First two of the abovv should be deleted because the last time stamp is greater than the first two timestamps). I want to keep all test data unless the same test and same unit was repeated at a later time. The input file is sorted by time (so older data goes down below).
The file is about 15 Mb.(output_Temp.csv). I copied it as output_temp2.csv
This is what i have.
file1=open("output_temp.csv","r")
file2=open("output_temp2.csv","r")
file3=open("output.csv","w")
flag=0
linecounter=0
for line in file1:
testname=line[0]
vid=line[1]
tstamp=line[2]
file2.seek(0) #reset
for i in range(linecounter):
file2.readline() #came down to the line #
for line2 in file2:
if testname==line2.split(",")[0] and vid==line2.split(",")[1] and tstamp!=line2.split(",")[2]:
flag==1
print line
if flag==1:
break
if flag==0:
file3.write(line)
linecounter=linecounter+1 #going down is ok dont go up.
flag=0
This is taking really long to process, I think it might be ok but its literally taking 10 minutes per 100kb and I have a long way to go.
The main reason this is slow is that you're reading the entire file (or, rather, a duplicate copy of it) for each line in the file. So, if there are 10000 lines, you're reading 10000 lines 10000 times, meaning 10000000 total line reads!
If you have enough memory to save the lines read so far, there's a really easy solution: Store the lines seen so far in a set. (Or, rather, for each line, store the tuple of the three keys that count for being a duplicate.) For each line, if it's already in the set, skip it; otherwise, process it and add it to the set.
For example:
seen = set()
for line in infile:
testname, vid, tstamp = line.split(",", 3)[:3]
if (testname, vid, tstamp) in seen:
continue
seen.add((testname, vid, tstamp))
outfile.write(line)
The itertools recipes in the docs have a function unique_everseen that lets you wrap this up even more nicely:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_everseen(infile, key=keyfunc):
outfile.write(line)
If the set takes too much memory, you can always fake a set on top of a dict, and you can fake a dict on top of a database by using the dbm module, which will do a pretty good job of keeping enough in memory to make things fast but not enough to cause a problem. The only problem is that dbm keys have to be strings, not tuples of three strings… but you can always just keep them joined up (or re-join them) instead of splitting, and then you've got a string.
I'm assuming that when you say the file is "sorted", you mean in terms of the timestamp, not in terms of the key columns. That is, there's no guarantee that two rows that are duplicates will be right next to each other. If there were, this is even easier. It may not seem easier if you use the itertools recipes; you're just replacing everseen with justseen:
def keyfunc(line):
return tuple(line.split(",", 3)[:3])
for line in unique_justseen(infile, key=keyfunc):
outfile.write(line)
But under the covers, this is only keeping track of the last line, rather than a set of all lines. Which is not only faster, it also saves a lot of memory.
Now that (I think) I understand your requirements better, what you actually want to get rid of is not all but the first line with the same testname, vid, and tstamp, but rather all lines with the same testname and vid except the one with the highest tstamp. And since the file is sorted in ascending order of tstamp, that means you can ignore the tstamp entirely; you just want the last match for each.
This means the everseen trick won't work—we can't skip the first one, because we don't yet know there's a later one.
If we just iterated the file backward, that would solve the problem. It would also double your memory usage (because, in addition to the set, you're also keeping a list so you can stack up all of those lines in reverse order). But if that's acceptable, it's easy:
def keyfunc(line):
return tuple(line.split(",", 2)[:2])
for line in reversed(list(unique_everseen(reversed(list(infile)), key=keyfunc))):
outfile.write(line)
If turning those lazy iterators into lists so we can reverse them takes too much memory, it's probably fastest to do multiple passes: reverse the file on disk, then filter the reversed file, then reverse it again. It does mean two extra file writes, but that can be a lot better than, say, your OS's virtual memory swapping to and from disk hundreds of times (or your program just failing with a MemoryError).
If you're willing to do the work, it wouldn't be that hard to write a reverse file iterator, which reads buffers from the end and splits on newlines and yields the same way the file/io.Whatever object does. But I wouldn't bother unless you turn out to need it.
If you ever do need to repeatedly read particular line numbers out of a file, the linecache module will usually speed things up a lot. Nowhere near as fast as not re-reading at all, of course, but a lot better than reading and parsing thousands of newlines.
You're also wasting time repeating some work in the inner loop. For example, you call line2.split(",") three times, instead of just splitting it once and stashing the value in a variable, which would be three times as fast. A 3x constant gain is nowhere near as important as a quadratic to linear gain, but when it comes for free by making your code simpler and more readable, might as well take it.
For this much file size(~15MB) Pandas would be excellent choice.
Like this:
import pandas as pd
raw_data = pd.read_csv()
clean_data = raw_data.drop_duplicates()
clean_data.to_csv('/path/to/clean_csv.csv')
I was able to process a CSV file about 151MB of size containing more than 5.9Million rows in less than a second with the above snippet.
Please note that the duplicate checking can be a conditional operation or a subset of fields to be matched for duplicate checking.
Pandas does provide lot of these features out of the box. Documentation here

Finding common phrases between files with millions of lines

I have two files with the following number of lines :
file1 - 110433003
file2 - 4838810
I need to find common phrases between these . Each line is of the form :
p1 ||| p2 ||| .......
The p1 of file1 can be the p2 in file2. Unfortunately, the code I have written is taking way too long to do this.
import sys
import os
if(len(sys.argv)<2):
print 'python CommonPhrases.py enFr hrEn commonFile'
sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')
sethrEn = set([])
setenFr= set([])
for line in hrEn:
englishPhrase = line.split(' ||| ')[1]
sethrEn.add(englishPhrase)
for line in enFr:
englishPhrase = line.split(' ||| ')[0]
if(englishPhrase in sethrEn):
common.write(englishPhrase+'\n')
Is there a faster way to do this ?
Thanks
This sounds like a tree problem. Maybe this ideas can help you.
Using a tree can help find the common word. I think there can be two solutions based on the idea of creating a tree.
A tree, once implemented, will need to store every word of one file (just one file). Then, starts reading the second file and searching every word on that file in the tree.
This solution have some problems, of course. Storing a tree on memory of such amount of words (or lines) can need lots of MB of RAM.
Lets suppose you manage to use a fixed amount of RAM to store the data, so, there is only counted the data itself (the characters of the lines). In the worst case scenario, you will need 255^N bytes, where N is the lenght of the longest line (supossing that you are using almos every word combination of N extention). So, storing every combination possible of words of length 10, you will need 1.16252367019e+24 bytes of RAM. That is a lot. Remember, this solution (as far as I know) is "fast", but need more RAM than maybe you can find.
So, other solution, very very slow, is reading one line of file A, and then compare it with every single line of file B. It needs almost none RAM, but will need too much time, that maybe you will no be able of really wait for it.
So, maybe another solution is divide the problem.
You say you have a list of lines, we don't know of they are alphabetically sorted or not. So, maybe you can start reading file A, and create new files. Every new file will store, for example, the 'A' starting lines, other the lines that start with 'B', etc. Then, do the same to file B, and having as result two files that have the 'A' starting lines, one for original A file, and another for original B file. Then, compare them with a tree.
In the best case scenario, the lines will be divided equally, letting you use a tree on memory. In the worst case scenario, you will finish with only one file, the same as the starting A file, since maybe all lines starts with 'A'.
So, maybe, you can implement a way to divide more the files if they are still too big, first, by first character on the lines. Then, the 'A' starting lines, divide them in 'AA', 'AB', 'AC', etc, of course, deleting the previous division files, until you get files small enough to use a better method to search the same lines (maybe using a tree on memory).
This solution also can take a long time, but maybe not so long as the low-ram option, and also, don't need too much ram to work.
These are the solutions that I can think in this moment. Maybe they work, maybe not.
You definitely need a trie for something like this. It seems like you will be spending most of your time searching the set for a match.
Also every time I find myself trying to make python go faster, I switch to pypy. (http://pypy.org/)
It is extremely easy to setup (just download the binaries, put it in your path and change #!/usr/bin/env python to #!/usr/bin/env pypy) and gives speedups in the range of 5-10x for such tasks.
For a reference implementation using PyTrie see below.
#!/usr/bin/env pypy
import sys
import os
sys.path.append('/usr/local/lib/python2.7/dist-packages/PyTrie-0.1-py2.7.egg/')
from pytrie import SortedStringTrie as trie
if(len(sys.argv)<2):
print 'python CommonPhrases.py enFr hrEn commonFile'
sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')
sethrEn = trie()
for line in hrEn:
englishPhrase = line.strip().split(' ||| ')[1]
sethrEn[englishPhrase] = None
for line in enFr:
englishPhrase = line.strip().split(' ||| ')[0]
if(englishPhrase in sethrEn):
common.write(englishPhrase+'\n')
Note that it requires minimum changes (4 lines) and you will need to install PyTrie 0.1. On my ubuntu system "sudo easy_install PyTrie" did the trick.
Hope that helps.

Searching for a string in a large text file - profiling various methods in python

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...
I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).
The entry on each line is unique.
I want to load the file once & keep searching for matches in the data
The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager
1) set :
(i) data = set(f.read().splitlines())
(ii) result = search_str in data
Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB
2) list :
(i) data = f.read().splitlines()
(ii) result = search_str in data
Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB
3) mmap :
(i) data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
(ii) result = data.find(search_str)
Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA
4) Hash lookup (using code from #alienhard below):
Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB
5) File search (using code from #EOL below):
with open('input.txt') as f:
print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file
Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA
6) sqlite (with primary index on url):
Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA
For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :
A better alternative e.g. sqlite ?
Ways to improve the search time using mmap. I have a 64-bit setup.
[edit] e.g. bloom filters
As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..
[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.
Any comments/suggestions are welcome !
[edit 2] Update with results from methods suggested in answers
[edit 3] Update with sqlite results
Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url
Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.
Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.
SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what #alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.
Custom hash table search with externalized strings
To get fast access time and a lower memory consumption you could do the following:
for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.
Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)
Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:
from collections import namedtuple
Node = namedtuple('Node', ['pos', 'next'])
def build_table(f, size):
table = [ None ] * size
while True:
pos = f.tell()
line = f.readline()
if not line: break
i = hash(line) % size
if table[i] is None:
table[i] = pos
else:
table[i] = Node(pos, table[i])
return table
def search(string, table, f):
i = hash(string) % len(table)
entry = table[i]
while entry is not None:
pos = entry.pos if isinstance(entry, Node) else entry
f.seek(pos)
if f.readline() == string:
return True
entry = entry.next if isinstance(entry, Node) else None
return False
SIZE = 2**24
with open('data.txt', 'r') as f:
table = build_table(f, SIZE)
print search('Some test string\n', table, f)
The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).
For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)
Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes: 64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints: 138MB (6000000 * 24)
-----------------
TOTAL: 330MB (real memory usage of python process was ~350MB)
You could also try
with open('input.txt') as f:
# search_str is matched against each line in turn; returns on the first match:
print search_str in f
with search_str ending with the proper newline sequence('\n' or '\r\n'). This should use little memory, as the file is read progressively. It should also be quite fast, since only part of the file is read.
I would guess many of the paths start out the same on DMOZ.
You should use a trie data structure and store the individual characters on nodes.
Tries have O(m) lookup time (where m is the key length) also save a lot of space, when saving large dictionaries or tree like data.
You could also store path parts on nodes to reduce node count — this is called Patricia Trie. But that makes the lookup slower by the average string length comparison time. See SO question Trie (Prefix Tree) in Python for more info about implementations.
There are a couple of trie implementations on Python Package Index, but they are not very good. I have written one in Ruby and in Common Lisp, which is especially well suited for this task – if you ask nicely, I could maybe publish it as open source... :-)
what about a text indexing solution ?
I would use Lucene in the Java world but there is a python engine called Whoosh
https://bitbucket.org/mchaput/whoosh/wiki/Home
Without building an index file your searching will be to slow, and this is not so simple task. So better to use already developed software. The best way will be use Sphinx Search Engine.

Categories

Resources