How can I make this Python file scan faster?

How can I make this Python file scan faster? - python

EDIT: Python 2.7.8
I have two files. p_m has a few hundred records that contain acceptable values in column 2. p_t has tens of millions of records in which I want to make sure that column 14 is from the set of acceptable values already mentioned. So in the first while loop I'm reading in all the acceptable values, making a set (for de-duping), and then turning that set into a list (I didn't benchmark to see if a set would have been faster than a list, actually...). I got it down to about as few lines as possible in the second loop, but I don't know if they are the FASTEST few lines (I'm using the [14] index twice because exceptions are so very rare that I didn't want to bother with an assignment to a variable). Currently it takes about 40 minutes to do a scan. Any ideas on how to improve that?
def contentScan(p_m,p_t):
""" """
vcont=sets.Set()
i=0
h = open(p_m,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
vcont.add(line.split("|")[2])
h.close()
vcont = list(vcont)
vcont.sort()
i=0
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
h.close()
print "PASS All variable_content_id values exist in the matrix." %rem
return 0

Checking for membership in a set of a few hundred items is much faster than checking for membership in the equivalent list. However, given your staggering 40-minutes running time, the difference may not be that meaningful. E.g:
ozone:~ alex$ python -mtimeit -s'a=list(range(300))' '150 in a'
100000 loops, best of 3: 3.56 usec per loop
ozone:~ alex$ python -mtimeit -s'a=set(range(300))' '150 in a'
10000000 loops, best of 3: 0.0789 usec per loop
so if you're checking "tens of millions of times" using the set should save you tens of seconds -- better than nothing, but barely measurable.
The same consideration applies for other very advisable improvements, such as turning the loop structure:
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
...
h.close()
into a much-sleeker:
with open(p_t, 'rb') as h:
for line in h:
...
again, this won't save you as much as a microsecond per iteration -- so, over, say, 50 million lines, that's less than one of those 40 minutes. Ditto for the removal of the completely unused i += 1 -- it makes no sense for it to be there, but taking it way will make little difference.
One answer focused on the cost of the split operation. That depends on how many fields per record you have, but, for example:
ozone:~ alex$ python -mtimeit -s'a="xyz|"*20' 'a.split("|")[14]'
1000000 loops, best of 3: 1.81 usec per loop
so, again, whatever optimization here could save you maybe at most a microsecond per iteration -- again, another minute shaved off, if that.
Really, the key issue here is why reading and checking e.g 50 million records should take as much as 40 minutes -- 2400 seconds -- 48 microseconds per line; and no doubt still more than 40 microseconds per line even with all the optimizations mentioned here and in other answers and comments.
So once you have applied all the optimizations (and confirmed the code is still just too slow), try profiling the program -- per e.g http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html -- to find out exactly where all of the time is going.
Also, just to make sure it's not just the I/O to some peculiarly slow disk, do a run with the meaty part of the big loop "commented out" - just reading the big file and doing no processing or checking at all on it; this will tell you what's the "irreducible" I/O overhead (if I/O is responsible for the bulk of your elapsed time, then you can't do much to improve things, though changing the open to open(thefile, 'rb', HUGE_BUFFER_SIZE) might help a bit) and may want to consider improving the hardware set-up instead -- defragment a disk, use a local rather than remote filesystem, whatever...

The list lookup was the issue (as you correctly noticed). Searching the list has O(n) time complexity, where n is the number of items stored in the list. on the other hand, finding a value in a hashtable (this is what the python dictionary actually is) has O(1) complexity. As you have hundreds of items in the list, the list lookup is about two orders of magnitude more expensive than the dictionary lookup. This is in line with the 34x improvement you saw when replacing the list with the dictionary.
To further reduce execution time by 5-10x you can use a Python JIT. I personally like Pypy http://pypy.org/features.html . You do not need to modify your script, just install pypy and run:
pypy [your_script.py]

EDIT: Made more pythony.
EDIT 2: Using set builtin rather than dict.
Based on the comments, I decided to try using a dict instead of a list to store the acceptable values against which I'd be checking the big file (I did keep a watchful eye on .split but did not change it). Based on just changing the list to a dict, I saw an immediate and HUGE improvement in execution time.
Using timeit and running 5 iterations over a million-line file, I get 884.2 seconds for the list-based checker, and 25.4 seconds for the dict-based checker! So like a 34x improvement for changing 2 or 3 lines.
Thanks all for the inspiration! Here's the solution I landed on:
def contentScan(p_m,p_t):
""" """
vcont=set()
with open(p_m,'rb') as h:
for line in h:
vcont.add(line.split("|")[2])
with open(p_t,"rb") as h:
for line in h:
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
print "PASS All variable_content_id values exist in the matrix."
return 0

Yes, it's not optimal at all. split is EXPENSIVE like hell (creates new list, creates N strings, append them to list). scan for 13s "|", scan for 14s "|" (from 13s pos) and line[pos13 + 1:pos14 - 1].
Pretty sure you can make in run 2-10x faster with this small change. To add more - you can not extract string, but loop through valid strings and for each start at pos13+1 char by char while chars match. If you ended at "|" for one of the strings - it's good one. Also it'll help a bit to sort valid strings list by frequency in data file. But not creating list with dozens of strings on each step is way more important.
Here are your tests:
generator (ts - you can adjust it to make us some real data).
no-op (just reading)
original solution.
no split solution
Wasn't able to make it format the code, so here is gist.
https://gist.github.com/iced/16fe907d843b71dd7b80
Test conditions: VBox with 2 cores and 1GB RAM running ubuntu 14.10 with latest updates. Each variation was executed 10 times, with rebooting VBox before each run and throwing off lowest and highest run time.
Results:
no_op: 25.025
original: 26.964
no_split: 25.102
original - no_op: 1.939
no_split - no_op: 0.077
Though in this case this particular optimization is useless as majority of time was spent in IO. I was unable to find test setup to make IO less than 70%. In any case - split IS expensive and should be avoided when it's not needed.
PS. Yes, I understand that in case of even 1K of good items it's way better to use hash (actually, it's better at the point when hash computation is faster than loookup - probably at 100 of elements), my point is that split is expensive in this case.

Related

Data structures with fast lookup (as close as possible to sets/hashtables) but with lower memory usage? [duplicate]

I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.
With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):
S = set(b"a%09i" % i for i in range(10_000_000)) # { b"a000000000", b"a000000001", ... }
whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).
When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.
Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
Or would using a bisect or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)

Up to now here are the methods that I tested thanks to comments, and that seem working.
Sorted list + bisection search (+ bloom filter)
Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.
(optional) Create a Bloom filter, here is a very small code to do it.
(optional) First test membership with Bloom filter (fast).
Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.
If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).
In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.
Sqlite3
As suggested by #tomalak's comment, inserting all the data in a Sqlite3 database works very well.
Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.
Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.
Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.
In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!
Bisection in file with sorted lines
Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See
Bisection search in the sorted lines of an opened file (not loaded in memory)
Dict with prefixes as keys and concatenation of suffixes as values
This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.
The idea is: we have a dict D and, for a given word,
prefix, suffix = word[:4], word[4:]
D[prefix] += suffix + b' '
With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.
I also tried with dict values = list of suffixes, but it's much more RAM-consuming.

Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
While not being a set data-structure but rather a list, I think pyarrow has a quite optimized way of storing a large number of small strings. There is a pandas integration as well which should make it easy to try it out:
https://pythonspeed.com/articles/pandas-string-dtype-memory/

Is there a reason why append and insert are both there?

I'm surely not the Python guru I'd like to be and I mostly learn studying/experimenting in my spare time, it is very likely I'm going to make a trivial question for experienced users... yet, I really want to understand and this is a place that helps me a lot.
Now, after the due premise, Python documentation says:
4.6.3. Mutable Sequence Types
s.append(x) appends x to the end of the sequence (same as
s[len(s):len(s)] = [x])
[...]
s.insert(i, x) inserts x into s at the index given by i (same as
s[i:i] = [x])
and, moreover:
5.1. More on Lists
list.append(x) Add an item to the end of the list. Equivalent to
a[len(a):] = [x].
[...]
list.insert(i, x) Insert an item at a given position. The first
argument is the index of the element before which to insert, so
a.insert(0, x) inserts at the front of the list, and a.insert(len(a),
x) is equivalent to a.append(x).
So now I'm wondering why there are two methods to do, basically, the same thing? Wouldn't it been possible (and simpler) to have just one append/insert(x, i=len(this)) where the i would have been an optional parameter and, when not present, would have meant add to the end of the list?

The difference between append and insert here is the same as in normal usage, and in most text editors. Append adds to the end of the list, while insert adds in front of a specified index. The reason they are different methods is both because they do different things, and because append can be expected to be a quick operation, while insert might take a while depending on the size of the list and where you're inserting, because everything after the insertion point has to be reindexed.
I'm not privy to the actual reasons insert and append were made different methods, but I would make an educated guess that it is to help remind the developer of the inherent performance difference. Rather than one insert method, with an optional parameter, which would normally run in linear time except when the parameter was not specified, in which case it would run in constant time (very odd), a second method which would always be constant time was added. This type of design decision can be seen in other places in Python, such as when methods like list.sort return None, instead of a new list, as a reminder that they are in-place operations, and not creating (and returning) a new list.

The other answer is great -- but another fundamental difference is that insert is much slower:
$ python -m timeit -s 'x=list(range(100000))' 'x.append(1)'
10000000 loops, best of 3: 0.19 usec per loop
$ python -m timeit -s 'x=list(range(100000))' 'x.insert(len(x), 1)'
1000000 loops, best of 3: 0.514 usec per loop
It's O(n), meaning the time it takes to insert at len(x) scales linearly with the size of your list. append is O(1), meaning it's always very fast, no matter how big your list is.

Writing numbers from 0 to 1000000000

I am trying to write a list of the numbers from 0 to 1000000000 as strings, directly to a text file. I would also like each number to have leading zeros up to ten digit places, e.g. 0000000000, 0000000001, 0000000002, 0000000003, ... n. However I find that it is taking much too long for my taste.
I can use seq, but there is no support for leading zeros and I would prefer to avoid using awk and other auxiliary tools to handle these tasks. I am aware of dramatic speed-up benefits just from coding this in C, but I don't want to resort to it. I was considering mapping some functions to a large list and executing them in a loop, however I only have 2GB of RAM available, so please keep this in mind when approaching my problem.
I am using Python-Progressbar, and I am getting an ETA of approximately 2 hours. I would appreciate it if someone can offer me some advice as to how to approach this problem:
pbar = ProgressBar(widgets=[Percentage(), Bar(), ' ', ETA(), ' ', FileTransferSpeed()], maxval=1000000000).start()
with open('numlistbegin','w') as numlist:
limit, nw, pu = 1000000000, numlist.write, pbar.update
for x in range(limit):
nw('%010d\n'%(x,))
pu(x)
pbar.finish()
EDIT: So I have discovered that the formatting (regardless of what programming language you are using), creates vast amounts of overhead. Seq gets the job done quickly, but much more slowly with the formatting option (-f). However, if anyone would like to offer a python solution nonetheless, it would be most welcome.

FWIW seq does have a formatting option:
$ seq -f "%010g" 1 5
0000000001
0000000002
0000000003
0000000004
0000000005
Your long time might be memory-related. with large ranges, using xrange is more memory-efficient since it doesn't try to calculate and store the entire range in memory before it starts. See the post titled Should you always favor xrange() over range()?
Edit: Using Python 3 means xrange vs. range usage is irrelevant.

I did some experimenting and noticed that writing in larger batches improved the performance by about 30%. I'm not sure why your code is taking 2 hours to generate the file -- unless the progress bar is killing performance. If so, you should apply the same batching logic to updating the progress bar. My old Windows box will create a file one-tenth the required size in about 73 seconds.
# Python 2.
# Change xrange to range for Python 3.
import time
start_time = time.time()
limit = 100000000 # 1/10 your limit.
skip = 1000 # Batch size.
with open('numlistbegin', 'w') as fh:
for i in xrange(0, limit, skip):
batch = ''.join('%010d\n' % j for j in xrange(i, i + skip, 1))
fh.write(batch)
print time.time() - start_time # 73 sec. (106 sec. without batching).

Naming dict keys for fast lookup in python

I'm going to have 1 small dictionary (between 5 and 20 keys) that will be referenced up to a hundred times or so for one page load in python 2.5.
I'm starting to name the keys which it will be looking up and I was wondering if there is a key naming convention I could follow to help dict lookup times.

I had to test ;-)
using
f1, integer key 1
f2 short string, "one"
f3 long string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
as one of the keys into a dictionary of length 4. Iterating 10,000,000 times and measuring the times. I get this result:
<function f1 at 0xb779187c>
f1 3.64
<function f2 at 0xb7791bfc>
f2 3.48
<function f3 at 0xb7791bc4>
f3 3.65
I.e no difference...
My code

There may be sensible names for them that just so happen to produce names whose hashes aren't clashing. However, CPython dicts are already one of the most optimized data structures in the known universe, producing few collisions for most inputs, working well with the hash schemes of other builtin types, resolving clashes very fast, etc. It's extremely unlikely you'll see any benefit at all even if you found something, especially since a hundred lookups aren't really that many.
Take, for example, this timeit benchmark run on my 4 years old desktop machine (sporting a laughablely low-budget dual core CPU with 3.1 GHz):
...>python -mtimeit --setup="d = {chr(i)*100: i for i in range(15)};\
k = chr(7)*100" "d[k]"
1000000 loops, best of 3: 0.222 usec per loop
And those strings are a dozen times larger than everything that's remotely sensible to type out manually as a variable name. Cutting down the length from 100 to 10 leads to 0.0778 microseconds per lookup. Now measure your page's load speed and compare those (alternatively, just ponder how long the work you're actually doing when building the page will take); and take into account caching, framework overhead, and all these things.
Nothing you do in this regard can make a difference performance-wise, period, full stop.

Because the Python string hash function iterates over the chars (at least if this is still applicable), I'd opt for short strings.

To add another aspect:
for very small dictionaries and heavy timing constraints, the time to compute hashes might be a substancial fraction of the overall time. Therefore, for (say) 5 elements, it might be faster to use an array and a sequential search (of course, wrapped up into some MiniDictionary object), maybe even augmented by a binary search. This might find the element with 2-3 comparisons, which may or may not be faster than hash-computation plus one compare.
The break-even depends on the hash speed, the average number of elements and the number of hash collisions to expect, so some measurements is required, and there is no "one-size-fits-all" answer.

Python dictionaries have a fast path for string keys, so use these (rather than, say, tuples). The hash value of a string is cached in that string, so it's more important that the strings remain the same ones than their actual value; string constants (i.e., strings that appear verbatim in the program and are not the result of a calculation) always remain exactly the same, so as long as you use those, there's no need to worry.

What is an efficient way to write password cracking algorithm (python)

This problem might be relatively simple, but I'm given two text files. One text file contains all encrypted passwords encrypted via crypt.crypt in python. The other list contains over 400k+ normal dictionary words.
The assignment is that given 3 different functions which transform strings from their normal case to all different permutations of capitalizations, transforms a letter to a number (if it looks alike, e.g. G -> 6, B -> 8), and reverses a string. The thing is that given the 10 - 20 encrypted passwords in the password file, what is the most efficient way to get the fastest running solution in python to run those functions on dictionary word in the words file? It is given that all those words, when transformed in whatever way, will encrypt to a password in the password file.
Here is the function which checks if a given string, when encrypted, is the same as the encrypted password passed in:
def check_pass(plaintext,encrypted):
crypted_pass = crypt.crypt(plaintext,encrypted)
if crypted_pass == encrypted:
return True
else:
return False
Thanks in advance.

Without knowing details about the underlying hash algorithm and possible weaknesses of the algorithm all you can do is to run a brute-force attack trying all possible transformations of the words in your password list.
The only way to speed up such a brute-force attack is to get more powerful hardware and to split the task and run the cracker in parallel.

On my slow laptop, crypt.crypt takes about 20 microseconds:
$ python -mtimeit -s'import crypt' 'crypt.crypt("foobar", "zappa")'
10000 loops, best of 3: 21.8 usec per loop
so, the brute force approach (really the only sensible one) is "kinda" feasible. By applying your transformation functions you'll get (ballpark estimate) about 100 transformed words per dictionary word (mostly from the capitalization changes), so, about 40 million transformed words out of your whole dictionary. At 20 microseconds each, that will take about 800 seconds, call it 15 minutes, for the effort of trying to crack one of the passwords that doesn't actually correspond to any of the variations; expected time about half that, to crack a password that does correspond.
So, if you have 10 passwords to crack, and they all do correspond to a transformed dictionary word, you should be done in an hour or two. Is that OK? Because there isn't much else you can do except distribute this embarassingly parallel problem over as many nodes and cores as you can grasp (oh, and, use a faster machine in the first place -- that might buy you perhaps a factor of two or thereabouts).
There is no deep optimization trick that you can add, so the general logic will be that of a triple-nested loop: one level loops over the encrypted passwords, one over the words in the dictionary, one over the variants of each dictionary word. There isn't much difference regarding how you nest things (except the loop on the variants must come within the loop on the words, for simplicity). I recommend encapsulating "give me all variants of this word" as a generator (for simplicity, not for speed) and otherwise minimizing the number of function calls (e.g. there is no reason to use that check_pass function since the inline code is just as clear, and will be microscopically faster).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.