Time Complexity Python Script - python

I am writing a small script that guesses numeric passwords (including ones with leading zeros). The script works fine but I am having trouble understanding what the worst case time complexity would be for this algorithm. Any insight on the complexity of this implementation would be great, thanks.
def bruteforce(cipherText):
for pLen in itertools.count():
for password in itertools.product("0123456789", repeat=pLen):
if hashlib.sha256("".join(password)).hexdigest() == cipherText:
return "".join(password)

First, it's always possible that you're going to find a hash collision before you find the right password. And, for a long-enough input string, this is guaranteed. So, really, the algorithm is constant time: it will complete in about 2^256 steps no matter what the input is.
But this isn't very helpful when you're asking about how it scales with more reasonable N, so let's assume we had some upper limit that was low enough where hash collisions weren't relevant.
Now, if the password is length N, the outer loop will run N times.*
* I'm assuming here that the password really will be purely numeric. Otherwise, of course, it'll fail to find the right answer at N and keep going until it finds a hash collision.
How long does the inner loop take? Well, the main thing it does is iterate each element in product("0123456789", repeat=pLen). That just iterates the cartesian product of the 10-element list pLen times—in other words, there are 10^pLen elements in the product.
Since 10**pLen is greater than sum(10**i for i in range(pLen)) (e.g., 100000 > 11111), we can ignore all but the last time through the outer loop, so that 10**pLen is the total number of inner loops.
The stuff that it does inside each inner loop is all linear on pLen (joining a string, hashing a string) or constant (comparing two hashes), so there are (10^pLen)*pLen total steps.
So, the worst-case complexity is exponential: O(10^N). Because that's the same as multiplying N by a constant and then doing 2^cN, that's the same complexity class as O(2^N).
If you want to combine the two together, you could call this O(2^min(log2(10)*N, 256)). Which, again, is constant-time (since the asymptote is still 2^256), but shows how it's practically exponential if you only care about smaller N.

Related

Two number sum : why don't anybody do it this way

I was looking for the solution to "two number sum problem" and I saw every body using two for loops
and another way I saw was using a hash table
def twoSumHashing(num_arr, pair_sum):
sums = []
hashTable = {}
for i in range(len(num_arr)):
complement = pair_sum - num_arr[i]
if complement in hashTable:
print("Pair with sum", pair_sum,"is: (", num_arr[i],",",complement,")")
hashTable[num_arr[i]] = num_arr[i]
# Driver Code
num_arr = [4, 5, 1, 8]
pair_sum = 9
# Calling function
twoSumHashing(num_arr, pair_sum)
But why don't nobody discuss about this solution
def two_num_sum(array, target):
for num in array:
match = target - num
if match in array:
return [match, num]
return "no result found"
when using a hash table we have to store values into the hash table. But here there is no need for that.
1)Does that affect the time complexity of the solution?
2)looking up a value in hash table is easy compared to array, but if the values are huge in number,
does storing them in a hash table take more space?
First of all, the second function you provide as a solution is not correct and does not return a complete list of answers.
Second, as a Pythonist, it's better to say dictionary instead of the hash table. A python dictionary is one of the implementations of a hash table.
Anyhow, regarding the other questions that you asked:
Using two for-loops is a brute-force approach and usually is not an optimized approach in real. Dictionaries are way faster than lists in python. So for the sake of time-complexity sure, dictionaries are the winner.
From the point of view of space complexity, using dictionaries for sure takes more memory allocation, but with current hardware, it is not an essential issue for billions of numbers. It depends on your situation, whether the speed is crucial to you or the amount of memory.
first function
uses O(n) time complexity as you iterate over n members in the array
uses O(n) space complexity as you can have one pair which is the first and the last, then in the worst case you can store up to n-1 numbers.
second function
uses O(n^2) time complexity as you iterate first on the array then uses in which uses __contains__ on list which is O(n) in worst case.
So the second function is like doing two loops to brute force the solution.
Another thing to point out in second function is that you don't return all the pairs but just the first pair you find.
Then you can try and fix it by iterating from the index of num+1 but you will have duplicates.
This is all comes down to a preference of what's more important - time complexity or space complexity-
this is one of many interview / preparation to interview question where you need to explain why you would use function two (if was working properly) over function one and vice versa.
Answers for your questions
1.when using a hash table we have to store values into the hash table. But here there is no need for that. 1)Does that affect the time complexity of the solution?
Yes now time complexity is O(n^2) which is worse
2)looking up a value in hash table is easy compared to array, but if the values are huge in number, does storing them in a hash table take more space?
In computers numbers are just representation of bits. Larger numbers can take up more space as they need more bits to represent them but storing them will be the same, no matter where you store.

Questions regarding code in a Data Structures and Algorithms book

I have been refreshing my knowledge of data structures and algorithms using a book. I came across some sample code in the book, including some run time analysis values that I cannot seem to make sense of. I don't know if I am overthinking this or I am missing something extremely simple. Please help me out.
This is the case where they explain the logic behind adding an element at a specific index in a list. I get the logic, it is pretty simple. Move all the elements, starting with the rightmost one, one index to the right to make space for the element at the index. The code for this in the book is given by:
for j in range(self._n, k, −1):
self._A[j] = self._A[j−1]
What I do not get is the range of the loop. Technically, self._n is equivalent to len(list) (an internal state maintained by the list). And if you start at len(list), you are immediately at an IndexOutOfBoundError. Secondly, even if that were not true, the loop replaces n with n-1. Nowhere does it actually move n to n+1 first, so that value is lost. Am I missing something here? I actually tried these conditions out on the Python interpreter and they seem to validate my understanding.
Some of the run time analyses for list operations seems to confuse me. For example:
data.index(value) --> O(k+1)
value in data --> O(k+1)
data1 == data2 (similarly !=, <, <=, >, >=) --> O(k+1)
data[j:k] --> O(k−j+1)
I do not get why the +1 at the end of each running time analysis. Let us consider the data.index(value) operation, which basically returns the first index at which a certain value is found. At the worst case, it should iterate through all n elements of the list, but if not, if the search finds something at index k, then it returns from there. Why the O(k+1) there? The same logic applies to the other cases too, especially the list slicing. When you slice a list, isn't it just O(k-j)? On the contrary, the actual indices are j to k-1.
This understanding should be quite elementary and I really feel silly not being able to understand it. Or I don't know if there are genuine errata in the book and I understand it correctly. Could someone please clarify this for me? Help is much appreciated.
Note (from the comments): the book in question is Data Structures and Algorithms in Python by Goodrich, Tamassia and Goldwasser, and the questions are about pages 202 to 204.
If you actually look at the whole definition of insert from the book, it makes more sense.
def insert(self, k, value):
if self.n == self.capacity:
self.resize(2 * self.capacity)
for j in range(self.n, k, −1):
self.A[j] = self.A[j−1]
self.A[k] = value
self.n += 1
The first line implies that self.n is the number of elements, and corresponds to the index past-the-end, which means that, for a user of the list, accessing it at that index would be erroneous. But this code belongs to the list, and because it has a capacity in addition to a size, it can use self.A[n] if self.n < self.capacity (which is true when the for loop starts).
The loop simply moves the last element (at index n-1) to the next space in memory, which is out of bounds for a user, but not internaly. At the end, n is incremented to reflect the new size, and n-1 becomes the index of that "next space in memory", which now contains the last element.
As for the time complexity of the different operations: well, they are not incorrect. Even though O(n+1) = O(n), you can still write O(n+1) if you want to, and it might be more "precise" in some cases.
For example, it is written that data.index(value) has a complexity of O(k+1), with k the index of the value being search for. Well, if that value is at the very beginning, then k = 0 and the complexity is O(0+1) = O(1). And it's true: if you always search for a value that you know is at the very beginning, even though this operation is pointless, it has a constant time complexity. If you initially wrote O(k) instead, then you would get O(0) for that last operation, which I have never seen used, but would make me think that the operation is instantaneous.
The same thing happens for slicing: they probably wrote O(k−j+1) because if you only take one element, then j = k and the complexity is O(1) instead of O(0).
Note that time complexity isn't usually defined in terms of the actual indices of a particular application of the function, but instead in terms of the total number of elements in the container on which the function is used. You can think of it as the mean complexity for using the function with every possible index, which in the cases of index and slicing, is simply O(n).
For the first case, I think the assumption is that you have a list of fixed maximum length and you are supposed to lose the last datapoint. Also, are you certain that self._n == len(n) and not self._n == len(n)-1?
For the second case, as far as I understand, O(k+1) is the same as O(k), so it doesn't make sense to say O(k+1). But again if we really want to know how someone may count up to k+1... I would guess that guy is counting starting from 0. So to go from 0th to the kth index will take k+1 operations.
This is just an opinion, and an uniformed one, so please take it with a tablespoon of salt. I think that book is no good man.

Count number of distinct subarrays with at most m even elements

You are given an array A of integers, each of which is in the range [0, 1000], along with some number m. For example, you might get this input:
A=[5,6,7,8] m=1
The question is to determine, as efficiently as possible, how many distinct, nonempty subarrays there are of the array A that contain at most m even numbers. For example, for the above array, there are eight distinct subarrays with at most one even number, as shown here:
[(5, 6, 7), (6, 7), (5, 6), (8), (5), (6), (7), (7, 8)]
Here's the solution I have so far, which runs in time O(n3):
def beautiful(A, m):
subs = [tuple(A[i:j]) for i in range(0, len(A)) for j in range(i + 1, len(A) + 1)]
uniqSubs = set(subs)
return len([n for n in uniqSubs if sum(int(i) % 2 == 0 for i in n)<=m ])
Is there a better solution to this problem - ideally, one that runs in linear time or atleast O(n^2)?
I believe you can do this in linear time by using suffix trees. This is certainly not a lightweight solution - good luck coding up a linear-time algorithm for building a suffix tree with a variable-size alphabet! - but it shows that it's possible.
Here's the idea. Start by building a suffix tree for the array, treating it not as a list of numbers, but rather as a string of characters, where each character is a number. Since you know all the numbers are at most 1,000, the number of distinct characters is a constant, so using a fast suffix tree construction algorithm (for example, SA-IS), you can build the suffix tree in time O(n).
Suffix trees are a nice structure here because they collapse repeated copies of the same substrings together into overlapping groups, which makes it easier to deduplicate things. For example, if the pattern [1, 3, 7] appears multiple times in the array, then the root will contain exactly one path starting with [1, 3, 7].
The question now is how to go from the suffix tree to the number of distinct subarrays. For now, let's tackle an easier question - how do you count up the number of distinct subarrays, period, completely ignoring the restriction on odd and even numbers? This, fortunately, turns out to be a well-studied problem that can be solved in linear time. Essentially, every prefix encoded in the suffix tree corresponds to a distinct subarray of the original array, so you just need to count up how many prefixes there are. That can be done by recursively walking the tree, adding up, for each edge in the tree, how many characters are along that edge. This can be done in time O(n) because a suffix tree for an array/string of length n has O(n) nodes, and we spend a constant amount of time processing each node (just by looking at the edge above it.)
So now we just need to incorporate the restriction on the number of even numbers you're allowed to use. This complicates things a little bit, but the reason why is subtle. Intuitively, it seems like this shouldn't be a problem. We could, after all, just do a DFS of the suffix tree and, as we go, count the number of even numbers on the path we've traversed, stopping as soon as we exceed m.
The problem with this approach is that even though the suffix tree has O(n) nodes in it, the edges, implicitly, encode ranges whose lengths can be as high as n itself. As a result, the act of scanning the edges could itself blow the runtime up to Ω(n2): visiting Θ(n) edges and doing Ω(n) work per edge.
We can, however, speed things up a little bit. Each edge in a suffix tree is typically represented as a pair of indices [start, stop] into the original array. So let's imagine that, as an additional preprocessing step, we build a table Evens such that Evens[n] returns the number of even numbers in the array up to and including position n. Then we can count the number of even numbers in any range [start, stop] by computing Evens[start] - Evens[stop]. That takes time O(1), and it means that we can aggregate the number of even numbers we encounter along a path in time proportional to the number of edges followed, not the number of characters encountered.
... except that there's one complication. What happens if we have a very long edge where, prior to reading that edge, we know that we're below the even number limit, and after reading that edge, we know that we're above the limit? That means that we need to stop partway through the edge, but we're not sure exactly where that is. That might require us to do a linear search over the edge to find the crossover point, and there goes our runtime.
But fortunately, there's a way out of that little dilemma. (This next section contains an improvement found by #Matt Timmermans). As part of the preprocessing, in addition to the Evens array, build a second table KthEven, where KthEven[i] returns the position of the kth even number in the array. This can be built in time O(n) using the Evens array. Once you have this, let's imagine that you have a bad edge, one that will push you over the limit. If you know how many even numbers you've encountered so far, you can determine the index of the even number that will push you over the limit. Then, you can look up where that even number is by indexing into the KthEven table in time O(1). This means that we only need to spend O(1) work per edge in the suffix tree, pushing our runtime down to O(n)!
So, to recap, here's a linear-time solution to this problem:
Build a suffix tree for the array using a fast suffix tree construction algorithm, like SA-IS or Ukkonen's algorithm. This takes time O(n) because there are at most 1,000 different numbers in the string, and 1,000 is a constant.
Compute the table Even[n] in time O(n).
Compute the table KthEven[n] in time O(n).
Do a DFS over the tree, keeping track of the number of even numbers encountered so far. When encountering an edge [start, stop], compute how many even numbers are in that range using Even in time O(1). If that's below the limit, keep recursing. If not, use the KthEven table to figure out how much of the edge is usable in time O(1). Either way, increment the global count of the number of distinct subarrays by the usable length of the current edge. This does O(1) work for each of the O(n) edges in the suffix tree for a total of O(n) work.
Phew! That wasn't an easy problem. I imagine there's some way to simplify this construction, and I'd welcome comments and suggestions about how to do this. But it shows that it is indeed possible to solve this problem in O(n) time, which is not immediately obvious!

Is there a form of 'for' that evaluates the "expression list" every time?

lt = 1000 #list primes to ...
remaining = list(range(2, lt + 1)) #remaining primes
for c in remaining: #current "prime" being tested
for t in remaining[0: remaining.index(c)]: #test divisor
if c % t == 0 and c != t:
if c in remaining:
remaining.remove(c)
If you don't need context:
How can I either re-run the same target-list value, or use something other than for that reads the expression list every iteration?
If you need context:
I am currently creating a program that lists primes from 2 to a given value (lt). I have a list 'remaining' that starts as all integers from 2 to the given value. One at a time, it tests a value on the list 'c' and tests for divisibility one by one by all smaller numbers on the list 't'. If 'c' is divisible by 't', it removes it from the list. By the end of the program, in theory, only primes remain but I have run into the problem that because I am removing items from the list, and for only reads remaining once, for is skipping values in remaining and thus leaving composites in the list.
What you're trying to do is almost never the right answer (and it's definitely not the right answer here, for reasons I'll get to later), which is why Python doesn't give you a way to do it automatically. In fact, it's illegal for delete from or insert into a list while you're iterating over it, even if CPython and other Python implementations usually don't check for that error.
But there is a way you can simulate what you want, with a little verbosity:
for i in range(remaining.index(c)):
if i >= remaining.index(c): break
t = remaining[i]
Now we're not iterating over remaining, we're iterating over its indices. So, if we remove values, we'll be iterating over the indices of the modified list. (Of course we're not really relying on the range there, since the if…break tests the same thing; if you prefer for i in itertools.count():, that will work too.)
And, depending on what you want to do, you can expand it in different ways, such as:
end = remaining.index(c)
for i in range(end):
if i >= end: break
t = remaining[i]
# possibly subtract from end within the loop
# so we don't have to recalculate remaining.index(c)
… and so on.
However, as I mentioned at the top, this is really not what you want to be doing. If you look at your code, it's not only looping over all the primes less than c, it's calling a bunch of functions inside that loop that also loop over either all the primes less than c or your entire list (that's how index, remove, and in work for lists), meaning you're turning linear work into quadratic work.
The simplest way around this is to stop trying to mutate the original list to remove composite numbers, and instead build a set of primes as you go along. You can search, add, and remove from a set in constant time. And you can just iterate your list in the obvious way because you're no longer mutating it.
Finally, this isn't actually implementing a proper prime sieve, but a much less efficient algorithm that for some reason everyone has been teaching as a Scheme example for decades and more recently translating into other languages. See The Genuine Sieve of Eratosthenes for details, or this project for sample code in Python and Ruby that shows how to implement a proper sieve and a bit of commentary on performance tradeoffs.
(In the following, I ignore the XY problem of finding primes using a "mutable for".)
It's not entirely trivial to design an iteration over a sequence with well-defined (and efficient) behavior when the sequence is modified. In your case, where the sequence is merely being depleted, one reasonable thing to do is to use a list but "delete" elements by replacing them with a special value. (This makes it easy to preserve the current iteration position and avoids the cost of shifting the subsequent elements.)
To make it efficient to skip the deleted elements (both for the outer iteration and any inner iterations like in your example), the special value should be (or contain) a count of any following deleted elements. Note that there is a special case of deleting the current element, where for maximum efficiency you must move the cursor while you still know how far to move.

Maximal Length of List to Shuffle with Python random.shuffle?

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)
However, the Python reference states:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...
TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."
So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.
I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

Categories

Resources