I have just finished writing an optimised solution for Project Euler's fourth problem. While I was implementing the algorithm, I experienced some internal conflict over design choices. I was uncertain whether I should store a product of an operation in its own variable, for future reference, or instead not store it as variable and reproduce the product of the two operands whenever required. Here is a snippet of the code:
product = x * y
if (checkPalindrome(product) and product > largest_product):
largest_product = product
The operand is stored in 'product' and is referenced in the following lines. The curiosity I have is whether this is considered to be the better practice when compared to reproducing the product whenever a reference to it is required. Like this:
if (checkPalindrome(x * y) and x * y > largest_product):
largest_product = x * y
Can this difference in implementation yield a difference in space, or time performance when scaled?
Arithmetic in Python isn't particularly fast, so it's better to avoid performing the same computation multiple times. BTW, rather than determining the maximum product "by hand" you could do it with the built-in max function.
I should also mention that you aren't saving much RAM by avoiding doing product = x * y. The code still has to create an int object to hold the result of x * y anyway, binding that object to a name doesn't consume much RAM. OTOH, performing the same calculation 3 times not only wastes time, it means that 3 objects need to be created (and recycled) to store the result.
I suggest you take a look at Other languages have "variables", Python has "names". For a more in-depth examination of this important topic, please see Facts and myths about Python names and values, which was written by SO veteran Ned Batchelder.
A number takes an amount of storage proportional to its length (when printed out). For modest sized numbers, up to 20 digits say, you are not likely to notice this effect at all. Above that you are not likely to notice the effect unless you have huge numbers (thousands of digits) or lots of them.
The multiplication of two numbers takes... well, this is an active area of research, but for these purposes let's say it takes time proportional to the square of the length (when printed out) of the longest number (if they are similar in length). But again, you aren't likely to notice this effect unless you have huge numbers.
As others have observed, multiplication is not quick in Python, so that's a concern.
I suggest you write what's clearest and then clean up your performance problems when you encounter them.
In practice, the first approach is best. Since here, you are multiplying small integers, so it doesn't impact that much in the later approach. But, when you run this inside a loop, for calculation of number of products, then it really maters and impacts as well.
Let's assume you have 10 loops. In the second approach, if your one multiplication takes O(1) time. So, within a loop, you will have 2 such calculations, hence it will take O(2) times. For 10 such loops, you will have O(20) times.
if (checkPalindrome(x * y) and x * y > largest_product): # O(1)
largest_product = x * y # O(1)
# total O(2) for two calculations
But, in the first approach, since you are doing the calculation only once, and in later steps utilizing the calculated values, so it will take only O(1) time, only during the calculation. But, no time, while you will reference it for condition checking. For 10 such loops, you will have O(10) times. Thus you are saving 50 % of your time.
product = x * y # O(1) for one calculation
if (checkPalindrome(product) and product > largest_product):
largest_product = product
Yes, if memory is a constraint, then for storing variable, you might need memory. And in that case you might think of second approach. Or if there is a single point of calculation, then in that case, you are good to go with the second approach. But, for the first case, where memory is constraint, it would not take significant amount of memory just to store a variable. So, anyway, I find the first one (calculating once and storing rather than calculating each time) as the best and efficient.
Related
For the sake of the argument, consider following (very bad) sorting algorithm in python:
def so(ar):
while True:
le = len(ar)
switch = False
for y in range(le):
if y+1 == le:
break
if ar[y] > ar[y+1]:
ar[y],ar[y+1] = ar[y+1],ar[y]
switch = True
if switch == False:
break
return ar
I'm trying to understand the concept of "complexity of the algorithm" and there is one thing I don't get.
I came across the post that explains how to find the complexity of the algorithm here:
You add up how many machine instructions it will execute as a function
of the size of its input, and then simplify the expression to the
largest (when N is very large) term and can include any simplifying
constant factor.
But well, the problem is, that I cannot calculate how many machine instructions will be executed just
by knowing the length of the list.
Consider first example:
li = [random.randint(1,5000) for x in range(3000)]
start = time.time()
so(li)
end = time.time() - start
print(end)
Output: 2.96921706199646
Now have a look at the second example:
ok = [5000,43000,232] + [x for x in range(2997)]
start = time.time()
so(ok)
end = time.time() - start
print(end)
Output: 0.010689020156860352
We can see that the same sorting algorithm, two different lists, lists are the same length, and two completely different execution times.
When people are talking about algorithm complexity (big O notation) they normally assume that the only variable that determines complexity of the algo is the size of the input, but clearly, in the example above it is not the case. It is not only the size of the list, but also the positioning of each value within the list that determines the speed of the sorting.
So my question is, why do we only consider size of input when estimating complexity?
And, if it is possible, can you tell me what the complexity of the algorithm above will be?
You're correct, complexity doesn't only depend on N. That's why you'll often see indications about average, worst and best cases.
Timsort is used in Python because it's (O n log n) on average, still fast for worst-cases (O(n log n)) and extremely fast for best-cases (O(n), when the list is already sorted).
Quicksort also has an average complexity of O(n log n), but its worst case is O(n²), when the list is already sorted. This use case happens very often, so it might be worth it to actually shuffle the list before sorting it!
why do we only consider size of input when estimating complexity?
In the narrow sense of complexity as of the use of Big O notation in computer science, it is simply by definition:
In computer science, big O notation is used to classify algorithms according to how their running time or space requirements grow as the input size grows.
In the broader sense your question could be interpreted as "why do we use Big O notation to describe algorithm complexity when the nature of the data can be just as important as its size."
The answer here lies in the fact that algorithm development is often done on small datasets to make it easy, while in the real world the datasets are huge. When you are writing your sorting function you're most likely going to try it first on small lists of random data. You'd want the result small enough that you can verify that it worked by simply looking at the result...
The time complexity is not always definitely dependent on size of input. When we look at randomized sorting algorithms, the input patterns might play a significant role in determining time complexity.
We usually calculate time complexity in terms of worst, good and average case and could particularly study time complexity in terms of specific input order/patterns which could lead to good, average and best case time complexity.
For example, in first case provided by you, since input is randomized, there is 1/n! probability for a particular input to occur. The good case (when the list is sorted already) is Ω(n) and the worst case(when the list is reversely sorted) is O(n²) , but the probability is low for best or worst case to occur.
Therefore, the sorting algorithm has θ(n²) average time complexity since the probability of comparison and swap in case of two elements in average case input is high due to random distribution of numbers.
In the second case, the order is strict which means high probability for input to tend toward best case or worst case time complexity . In your case, input is more tending towards good case, therefore lesser time.
Building a string through repeated string concatenation is an anti-pattern, but I'm still curious why its performance switches from linear to quadratic after string length exceeds approximately 10 ** 6:
# this will take time linear in n with the optimization
# and quadratic time without the optimization
import time
start = time.perf_counter()
s = ''
for i in range(n):
s += 'a'
total_time = time.perf_counter() - start
time_per_iteration = total_time / n
For example, on my machine (Windows 10, python 3.6.1):
for 10 ** 4 < n < 10 ** 6, the time_per_iteration is almost perfectly constant at 170±10 µs
for 10 ** 6 < n, the time_per_iteration is almost perfectly linear, reaching 520 µs at n == 10 ** 7.
Linear growth in time_per_iteration is equivalent to quadratic growth in total_time.
The linear complexity results from the optimization in the more recent CPython versions (2.4+) that reuse the original storage if no references remain to the original object. But I expected the linear performance to continue indefinitely rather than switch to quadratic at some point.
My question is based made on this comment. For some odd reason running
python -m timeit -s"s=''" "for i in range(10**7):s+='a'"
takes incredibly long time (much longer than quadratic), so I never got the actual timing results from timeit. So instead, I used a simple loop as above to obtain performance numbers.
Update:
My question might as well have been titled "How can a list-like append have O(1) performance without over-allocation?". From observing constant time_per_iteration on small-size strings, I assumed the string optimization must be over-allocating. But realloc is (unexpectedly to me) quite successful at avoiding memory copy when extending small memory blocks.
In the end, the platform C allocators (like malloc()) are the ultimate source of memory. When CPython tries to reallocate string space to extend its size, it's really the system C realloc() that determines the details of what happens. If the string is "short" to begin with, chances are decent the system allocator finds unused memory adjacent to it, so extending the size is just a matter of the C library's allocator updating some pointers. But after repeating this some number of times (depending again on details of the platform C allocator), it will run out of space. At that point, realloc() will need to copy the entire string so far to a brand new larger block of free memory. That's the source of quadratic-time behavior.
Note, e.g., that growing a Python list faces the same tradeoffs. However, lists are designed to be grown, so CPython deliberately asks for more memory than is actually needed at the time. The amount of this overallocation scales up as the list grows, enough to make it rare that realloc() needs to copy the whole list-so-far. But the string optimizations do not overallocate, making cases where realloc() needs to copy far more frequent.
[XXXXXXXXXXXXXXXXXX............]
\________________/\__________/
used space reserved
space
When growing a contiguous array data structure (illustrated above) through appending to it, linear performance can be achieved if the extra space reserved while reallocating the array is proportional to the current size of the array. Obviously, for large strings this strategy is not followed, most probably with the purpose of not wasting too much memory. Instead a fixed amount of extra space is reserved during each reallocation, resulting in quadratic time complexity. To understand where the quadratic performance comes from in the latter case, imagine that no overallocation is performed at all (which is the boundary case of that strategy). Then at each iteration a reallocation (requiring linear time) must be performed, and the full runtime is quadratic.
TL;DR: Just because string concatenation is optimized under certain circumstances doesn't mean it's necessarily O(1), it's just not always O(n). What determines the performance is ultimatly your system and it could be smart (beware!). Lists that "garantuee" amortized O(1) append operations are still much faster and better at avoiding reallocations.
This is an extremly complicated problem, because it's hard to "measure quantitativly". If you read the announcement:
String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.
If you take a closer look at it then you'll note that it mentions "certain circumstances". The tricky thing is to find out what these certain cirumstances are. One is immediatly obvious:
If something else holds a reference to the original string.
Otherwise it wouldn't be safe to change s.
But another condition is:
If the system can do the reallocation in O(1) - that means without needing to copy the contents of the string to a new location.
That's were it get's tricky. Because the system is responsible for doing a reallocation. That's nothing you can control from within python. However your system is smart. That means in many cases you can actually do the reallocation without needing to copy the contents. You might want to take a look at #TimPeters answer, that explains some of it in more details.
I'll approach this problem from an experimentalists point of view.
You can easily check how many reallocations actually need a copy by checking how often the ID changes (because the id function in CPython returns the memory adress):
changes = []
s = ''
changes.append((0, id(s)))
for i in range(10000):
s += 'a'
if id(s) != changes[-1][1]:
changes.append((len(s), id(s)))
print(len(changes))
This gives a different number each run (or almost each run). It's somewhere around 500 on my computer. Even for range(10000000) it's just 5000 on my computer.
But if you think that's really good at "avoiding" copies you're wrong. If you compare it to the number of resizes a list needs (lists over-allocate intentionally so append is amortized O(1)):
import sys
changes = []
s = []
changes.append((0, sys.getsizeof(s)))
for i in range(10000000):
s.append(1)
if sys.getsizeof(s) != changes[-1][1]:
changes.append((len(s), sys.getsizeof(s)))
len(changes)
That only needs 105 reallocations (always).
I mentioned that realloc could be smart and I intentionally kept the "sizes" when the reallocs happened in a list. Many C allocators try to avoid memory fragmentation and at least on my computer the allocator does something different depending on the current size:
# changes is the one from the 10 million character run
%matplotlib notebook # requires IPython!
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
#ax.plot(sizes, num_changes, label='str')
ax.scatter(np.arange(len(changes)-1),
np.diff([i[0] for i in changes]), # plotting the difference!
s=5, c='red',
label='measured')
ax.plot(np.arange(len(changes)-1),
[8]*(len(changes)-1),
ls='dashed', c='black',
label='8 bytes')
ax.plot(np.arange(len(changes)-1),
[4096]*(len(changes)-1),
ls='dotted', c='black',
label='4096 bytes')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('x-th copy')
ax.set_ylabel('characters added before a copy is needed')
ax.legend()
plt.tight_layout()
Note that the x-axis represents the number of "copies done" not the size of the string!
That's graph was actually very interesting for me, because it shows clear patterns: For small arrays (up to 465 elements) the steps are constant. It needs to reallocate for every 8 elements added. Then it needs to actually allocate a new array for every character added and then at roughly 940 all bets are off until (roughly) one million elements. Then it seems it allocates in blocks of 4096 bytes.
My guess is that the C allocator uses different allocation schemes for differently sized objects. Small objects are allocated in blocks of 8 bytes, then for bigger-than-small-but-still-small arrays it stops to overallocate and then for medium sized arrays it probably positions them where they "may fit". Then for huge (comparativly speaking) arrays it allocates in blocks of 4096 bytes.
I guess the 8byte and 4096 bytes aren't random. 8 bytes is the size of an int64 (or float64 aka double) and I'm on a 64bit computer with python compiled for 64bits. And 4096 is the page size of my computer. I assume there are lots of "objects" that need have these sizes so it makes sense that the compiler uses these sizes because it could avoid memory fragmentation.
You probably know but just to make sure: For O(1) (amortized) append behaviour the overallocation must depend on the size. If the overallocation is constant it will be O(n**2) (the greater the overallocation the smaller the constant factor but it's still quadratic).
So on my computer the runtime behaviour will be always O(n**2) except for lengths (roughly) 1 000 to 1 000 000 - there it really seems to undefined. In my test run it was able to add many (ten-)thousand elements without ever needing a copy so it would probably "look like O(1)" when timed.
Note that this is just my system. It could look totally different on another computer or even with another compiler on my computer. Don't take these too seriously. I provided the code to do the plots yourself, so you can analyze your system yourself.
You also asked the question (in the comments) if there would be downsides if you over-allocate strings. That's really easy: Strings are immutable. So any overallocated byte is wasting ressources. There are only a few limited cases where it really does grow and these are considered implementation details. The developers probably don't throw away space to make implementation details perform better, some python developers also think that adding this optimization was a bad idea.
Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.
for example
f(2)->1
f(3)->2
f(4)->-1 //4 is not a prime
f(5)->3
...
generally ,make a prime generator and count before it reach x
def f(x):
p = primeGenerator()
count=1
while True:
y = next(p)
if y>x:
return -1
elif y==x:
return count
else:
count+=1
wasn't it too slow?though i can cache the list for next call,if i guarantee the input MUST be a prime,so don't have to test if the input number is a prime, is there a faster formula to get the answer?
The best method depends on what inputs you get, and whether the function will be called many times or just once or a few times.
If it will be called often, and all inputs you are going to receive are small, not larger than 107 say, the best method is to create a lookup table in advance, and just look up the input.
If it will not be called often, and all inputs are small, just generating the primes not exceeding the input and counting them is certainly good enough. It might be an enhancement to remember what you already have for the next call, so that when the first argument is 19394489, and the next is 20889937, you don't need to start from 0 again, but only need to find the primes between them. But whether the extra storage is worth to be had depends on the arguments passed.
If it will be called often and the arguments are not too large, not exceeding 1013 say, the best method is to precompute the values of π(n) for some select values of n, and for each argument look up the value for the next smaller precomputed point, and then generate and count the primes between that point and the target value (or if the target is closer to the next larger precomputed point, count the primes between the target and that).
If you calculate e.g. π(n) for all multiples of 107 not exceeding 1013, you get a lookup table with one million entries, that's not very taxing on the memory nowadays, and never need to sieve a range larger than five million, which doesn't take long.
You could also have the lookup table as a file or database on disk, which would allow much shorter intervals between the precomputed points. That would also eliminate the time for reading in the precomputed table on startup, but the lookup would now involve an access to the file system, which takes much longer than a memory read. What would be the best strategy depends on the expected inputs and the system it's run on.
Computing the lookup table will however take rather long if the upper limit isn't small, but that's a one-time cost.
If the expected inputs are larger, up to 1016 say, and you're not willing to spend the time necessary for precomputing a lookup table for that range, your best bet is to implement one of the better algorithms for the prime counting function, Meissel's method as refined by Lehmer is relatively easy to implement (not so easy that I'll give an example implementation here, though, but here's a Haskell implementation that might help). Better, but more complicated is the method as improved by Miller et al.
Beyond that, you'd need to research the current state-of-the-art, and probably should use a lower-level language than Python.
You have to check all preceding candidates for primality. There are no shortcuts. As you say, you can cache the result of a prior calculation and start from there, but that's really the best you can do.
Which is more CPU intensive, to do an if(x==num): check, or to do a sum x+y?
Your question is somewhat incomplete because you are comparing two different operations. If you need to add two things together then testing x==y isn't going to get you anywhere. So presumably you want to compare
if y != 0:
sum += y
with
sum +=y
It's a lot more complex for interpreted languages like Python, but on the hardware a test for non-zero introduces a branch and that in itself can be expensive. But I wouldn't want to say which would be faster without timing.
Throw into the equation different performance characteristics of different architectures and you have another confounding factor.
As always, you are best to write your code in the most natural maintainable way first and then time it. If you feel you need to extract more performance use a profiler to find hot spots and then optimise.