I'm going to have 1 small dictionary (between 5 and 20 keys) that will be referenced up to a hundred times or so for one page load in python 2.5.
I'm starting to name the keys which it will be looking up and I was wondering if there is a key naming convention I could follow to help dict lookup times.
I had to test ;-)
using
f1, integer key 1
f2 short string, "one"
f3 long string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
as one of the keys into a dictionary of length 4. Iterating 10,000,000 times and measuring the times. I get this result:
<function f1 at 0xb779187c>
f1 3.64
<function f2 at 0xb7791bfc>
f2 3.48
<function f3 at 0xb7791bc4>
f3 3.65
I.e no difference...
My code
There may be sensible names for them that just so happen to produce names whose hashes aren't clashing. However, CPython dicts are already one of the most optimized data structures in the known universe, producing few collisions for most inputs, working well with the hash schemes of other builtin types, resolving clashes very fast, etc. It's extremely unlikely you'll see any benefit at all even if you found something, especially since a hundred lookups aren't really that many.
Take, for example, this timeit benchmark run on my 4 years old desktop machine (sporting a laughablely low-budget dual core CPU with 3.1 GHz):
...>python -mtimeit --setup="d = {chr(i)*100: i for i in range(15)};\
k = chr(7)*100" "d[k]"
1000000 loops, best of 3: 0.222 usec per loop
And those strings are a dozen times larger than everything that's remotely sensible to type out manually as a variable name. Cutting down the length from 100 to 10 leads to 0.0778 microseconds per lookup. Now measure your page's load speed and compare those (alternatively, just ponder how long the work you're actually doing when building the page will take); and take into account caching, framework overhead, and all these things.
Nothing you do in this regard can make a difference performance-wise, period, full stop.
Because the Python string hash function iterates over the chars (at least if this is still applicable), I'd opt for short strings.
To add another aspect:
for very small dictionaries and heavy timing constraints, the time to compute hashes might be a substancial fraction of the overall time. Therefore, for (say) 5 elements, it might be faster to use an array and a sequential search (of course, wrapped up into some MiniDictionary object), maybe even augmented by a binary search. This might find the element with 2-3 comparisons, which may or may not be faster than hash-computation plus one compare.
The break-even depends on the hash speed, the average number of elements and the number of hash collisions to expect, so some measurements is required, and there is no "one-size-fits-all" answer.
Python dictionaries have a fast path for string keys, so use these (rather than, say, tuples). The hash value of a string is cached in that string, so it's more important that the strings remain the same ones than their actual value; string constants (i.e., strings that appear verbatim in the program and are not the result of a calculation) always remain exactly the same, so as long as you use those, there's no need to worry.
Related
I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.
With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):
S = set(b"a%09i" % i for i in range(10_000_000)) # { b"a000000000", b"a000000001", ... }
whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).
When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.
Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
Or would using a bisect or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)
Up to now here are the methods that I tested thanks to comments, and that seem working.
Sorted list + bisection search (+ bloom filter)
Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.
(optional) Create a Bloom filter, here is a very small code to do it.
(optional) First test membership with Bloom filter (fast).
Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.
If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).
In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.
Sqlite3
As suggested by #tomalak's comment, inserting all the data in a Sqlite3 database works very well.
Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.
Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.
Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.
In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!
Bisection in file with sorted lines
Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See
Bisection search in the sorted lines of an opened file (not loaded in memory)
Dict with prefixes as keys and concatenation of suffixes as values
This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.
The idea is: we have a dict D and, for a given word,
prefix, suffix = word[:4], word[4:]
D[prefix] += suffix + b' '
With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.
I also tried with dict values = list of suffixes, but it's much more RAM-consuming.
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
While not being a set data-structure but rather a list, I think pyarrow has a quite optimized way of storing a large number of small strings. There is a pandas integration as well which should make it easy to try it out:
https://pythonspeed.com/articles/pandas-string-dtype-memory/
EDIT: Python 2.7.8
I have two files. p_m has a few hundred records that contain acceptable values in column 2. p_t has tens of millions of records in which I want to make sure that column 14 is from the set of acceptable values already mentioned. So in the first while loop I'm reading in all the acceptable values, making a set (for de-duping), and then turning that set into a list (I didn't benchmark to see if a set would have been faster than a list, actually...). I got it down to about as few lines as possible in the second loop, but I don't know if they are the FASTEST few lines (I'm using the [14] index twice because exceptions are so very rare that I didn't want to bother with an assignment to a variable). Currently it takes about 40 minutes to do a scan. Any ideas on how to improve that?
def contentScan(p_m,p_t):
""" """
vcont=sets.Set()
i=0
h = open(p_m,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
vcont.add(line.split("|")[2])
h.close()
vcont = list(vcont)
vcont.sort()
i=0
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
h.close()
print "PASS All variable_content_id values exist in the matrix." %rem
return 0
Checking for membership in a set of a few hundred items is much faster than checking for membership in the equivalent list. However, given your staggering 40-minutes running time, the difference may not be that meaningful. E.g:
ozone:~ alex$ python -mtimeit -s'a=list(range(300))' '150 in a'
100000 loops, best of 3: 3.56 usec per loop
ozone:~ alex$ python -mtimeit -s'a=set(range(300))' '150 in a'
10000000 loops, best of 3: 0.0789 usec per loop
so if you're checking "tens of millions of times" using the set should save you tens of seconds -- better than nothing, but barely measurable.
The same consideration applies for other very advisable improvements, such as turning the loop structure:
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
...
h.close()
into a much-sleeker:
with open(p_t, 'rb') as h:
for line in h:
...
again, this won't save you as much as a microsecond per iteration -- so, over, say, 50 million lines, that's less than one of those 40 minutes. Ditto for the removal of the completely unused i += 1 -- it makes no sense for it to be there, but taking it way will make little difference.
One answer focused on the cost of the split operation. That depends on how many fields per record you have, but, for example:
ozone:~ alex$ python -mtimeit -s'a="xyz|"*20' 'a.split("|")[14]'
1000000 loops, best of 3: 1.81 usec per loop
so, again, whatever optimization here could save you maybe at most a microsecond per iteration -- again, another minute shaved off, if that.
Really, the key issue here is why reading and checking e.g 50 million records should take as much as 40 minutes -- 2400 seconds -- 48 microseconds per line; and no doubt still more than 40 microseconds per line even with all the optimizations mentioned here and in other answers and comments.
So once you have applied all the optimizations (and confirmed the code is still just too slow), try profiling the program -- per e.g http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html -- to find out exactly where all of the time is going.
Also, just to make sure it's not just the I/O to some peculiarly slow disk, do a run with the meaty part of the big loop "commented out" - just reading the big file and doing no processing or checking at all on it; this will tell you what's the "irreducible" I/O overhead (if I/O is responsible for the bulk of your elapsed time, then you can't do much to improve things, though changing the open to open(thefile, 'rb', HUGE_BUFFER_SIZE) might help a bit) and may want to consider improving the hardware set-up instead -- defragment a disk, use a local rather than remote filesystem, whatever...
The list lookup was the issue (as you correctly noticed). Searching the list has O(n) time complexity, where n is the number of items stored in the list. on the other hand, finding a value in a hashtable (this is what the python dictionary actually is) has O(1) complexity. As you have hundreds of items in the list, the list lookup is about two orders of magnitude more expensive than the dictionary lookup. This is in line with the 34x improvement you saw when replacing the list with the dictionary.
To further reduce execution time by 5-10x you can use a Python JIT. I personally like Pypy http://pypy.org/features.html . You do not need to modify your script, just install pypy and run:
pypy [your_script.py]
EDIT: Made more pythony.
EDIT 2: Using set builtin rather than dict.
Based on the comments, I decided to try using a dict instead of a list to store the acceptable values against which I'd be checking the big file (I did keep a watchful eye on .split but did not change it). Based on just changing the list to a dict, I saw an immediate and HUGE improvement in execution time.
Using timeit and running 5 iterations over a million-line file, I get 884.2 seconds for the list-based checker, and 25.4 seconds for the dict-based checker! So like a 34x improvement for changing 2 or 3 lines.
Thanks all for the inspiration! Here's the solution I landed on:
def contentScan(p_m,p_t):
""" """
vcont=set()
with open(p_m,'rb') as h:
for line in h:
vcont.add(line.split("|")[2])
with open(p_t,"rb") as h:
for line in h:
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
print "PASS All variable_content_id values exist in the matrix."
return 0
Yes, it's not optimal at all. split is EXPENSIVE like hell (creates new list, creates N strings, append them to list). scan for 13s "|", scan for 14s "|" (from 13s pos) and line[pos13 + 1:pos14 - 1].
Pretty sure you can make in run 2-10x faster with this small change. To add more - you can not extract string, but loop through valid strings and for each start at pos13+1 char by char while chars match. If you ended at "|" for one of the strings - it's good one. Also it'll help a bit to sort valid strings list by frequency in data file. But not creating list with dozens of strings on each step is way more important.
Here are your tests:
generator (ts - you can adjust it to make us some real data).
no-op (just reading)
original solution.
no split solution
Wasn't able to make it format the code, so here is gist.
https://gist.github.com/iced/16fe907d843b71dd7b80
Test conditions: VBox with 2 cores and 1GB RAM running ubuntu 14.10 with latest updates. Each variation was executed 10 times, with rebooting VBox before each run and throwing off lowest and highest run time.
Results:
no_op: 25.025
original: 26.964
no_split: 25.102
original - no_op: 1.939
no_split - no_op: 0.077
Though in this case this particular optimization is useless as majority of time was spent in IO. I was unable to find test setup to make IO less than 70%. In any case - split IS expensive and should be avoided when it's not needed.
PS. Yes, I understand that in case of even 1K of good items it's way better to use hash (actually, it's better at the point when hash computation is faster than loookup - probably at 100 of elements), my point is that split is expensive in this case.
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.
This problem might be relatively simple, but I'm given two text files. One text file contains all encrypted passwords encrypted via crypt.crypt in python. The other list contains over 400k+ normal dictionary words.
The assignment is that given 3 different functions which transform strings from their normal case to all different permutations of capitalizations, transforms a letter to a number (if it looks alike, e.g. G -> 6, B -> 8), and reverses a string. The thing is that given the 10 - 20 encrypted passwords in the password file, what is the most efficient way to get the fastest running solution in python to run those functions on dictionary word in the words file? It is given that all those words, when transformed in whatever way, will encrypt to a password in the password file.
Here is the function which checks if a given string, when encrypted, is the same as the encrypted password passed in:
def check_pass(plaintext,encrypted):
crypted_pass = crypt.crypt(plaintext,encrypted)
if crypted_pass == encrypted:
return True
else:
return False
Thanks in advance.
Without knowing details about the underlying hash algorithm and possible weaknesses of the algorithm all you can do is to run a brute-force attack trying all possible transformations of the words in your password list.
The only way to speed up such a brute-force attack is to get more powerful hardware and to split the task and run the cracker in parallel.
On my slow laptop, crypt.crypt takes about 20 microseconds:
$ python -mtimeit -s'import crypt' 'crypt.crypt("foobar", "zappa")'
10000 loops, best of 3: 21.8 usec per loop
so, the brute force approach (really the only sensible one) is "kinda" feasible. By applying your transformation functions you'll get (ballpark estimate) about 100 transformed words per dictionary word (mostly from the capitalization changes), so, about 40 million transformed words out of your whole dictionary. At 20 microseconds each, that will take about 800 seconds, call it 15 minutes, for the effort of trying to crack one of the passwords that doesn't actually correspond to any of the variations; expected time about half that, to crack a password that does correspond.
So, if you have 10 passwords to crack, and they all do correspond to a transformed dictionary word, you should be done in an hour or two. Is that OK? Because there isn't much else you can do except distribute this embarassingly parallel problem over as many nodes and cores as you can grasp (oh, and, use a faster machine in the first place -- that might buy you perhaps a factor of two or thereabouts).
There is no deep optimization trick that you can add, so the general logic will be that of a triple-nested loop: one level loops over the encrypted passwords, one over the words in the dictionary, one over the variants of each dictionary word. There isn't much difference regarding how you nest things (except the loop on the variants must come within the loop on the words, for simplicity). I recommend encapsulating "give me all variants of this word" as a generator (for simplicity, not for speed) and otherwise minimizing the number of function calls (e.g. there is no reason to use that check_pass function since the inline code is just as clear, and will be microscopically faster).