Why does naive string concatenation become quadratic above a certain length? - python

Building a string through repeated string concatenation is an anti-pattern, but I'm still curious why its performance switches from linear to quadratic after string length exceeds approximately 10 ** 6:
# this will take time linear in n with the optimization
# and quadratic time without the optimization
import time
start = time.perf_counter()
s = ''
for i in range(n):
s += 'a'
total_time = time.perf_counter() - start
time_per_iteration = total_time / n
For example, on my machine (Windows 10, python 3.6.1):
for 10 ** 4 < n < 10 ** 6, the time_per_iteration is almost perfectly constant at 170±10 µs
for 10 ** 6 < n, the time_per_iteration is almost perfectly linear, reaching 520 µs at n == 10 ** 7.
Linear growth in time_per_iteration is equivalent to quadratic growth in total_time.
The linear complexity results from the optimization in the more recent CPython versions (2.4+) that reuse the original storage if no references remain to the original object. But I expected the linear performance to continue indefinitely rather than switch to quadratic at some point.
My question is based made on this comment. For some odd reason running
python -m timeit -s"s=''" "for i in range(10**7):s+='a'"
takes incredibly long time (much longer than quadratic), so I never got the actual timing results from timeit. So instead, I used a simple loop as above to obtain performance numbers.
Update:
My question might as well have been titled "How can a list-like append have O(1) performance without over-allocation?". From observing constant time_per_iteration on small-size strings, I assumed the string optimization must be over-allocating. But realloc is (unexpectedly to me) quite successful at avoiding memory copy when extending small memory blocks.

In the end, the platform C allocators (like malloc()) are the ultimate source of memory. When CPython tries to reallocate string space to extend its size, it's really the system C realloc() that determines the details of what happens. If the string is "short" to begin with, chances are decent the system allocator finds unused memory adjacent to it, so extending the size is just a matter of the C library's allocator updating some pointers. But after repeating this some number of times (depending again on details of the platform C allocator), it will run out of space. At that point, realloc() will need to copy the entire string so far to a brand new larger block of free memory. That's the source of quadratic-time behavior.
Note, e.g., that growing a Python list faces the same tradeoffs. However, lists are designed to be grown, so CPython deliberately asks for more memory than is actually needed at the time. The amount of this overallocation scales up as the list grows, enough to make it rare that realloc() needs to copy the whole list-so-far. But the string optimizations do not overallocate, making cases where realloc() needs to copy far more frequent.

[XXXXXXXXXXXXXXXXXX............]
\________________/\__________/
used space reserved
space
When growing a contiguous array data structure (illustrated above) through appending to it, linear performance can be achieved if the extra space reserved while reallocating the array is proportional to the current size of the array. Obviously, for large strings this strategy is not followed, most probably with the purpose of not wasting too much memory. Instead a fixed amount of extra space is reserved during each reallocation, resulting in quadratic time complexity. To understand where the quadratic performance comes from in the latter case, imagine that no overallocation is performed at all (which is the boundary case of that strategy). Then at each iteration a reallocation (requiring linear time) must be performed, and the full runtime is quadratic.

TL;DR: Just because string concatenation is optimized under certain circumstances doesn't mean it's necessarily O(1), it's just not always O(n). What determines the performance is ultimatly your system and it could be smart (beware!). Lists that "garantuee" amortized O(1) append operations are still much faster and better at avoiding reallocations.
This is an extremly complicated problem, because it's hard to "measure quantitativly". If you read the announcement:
String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances.
If you take a closer look at it then you'll note that it mentions "certain circumstances". The tricky thing is to find out what these certain cirumstances are. One is immediatly obvious:
If something else holds a reference to the original string.
Otherwise it wouldn't be safe to change s.
But another condition is:
If the system can do the reallocation in O(1) - that means without needing to copy the contents of the string to a new location.
That's were it get's tricky. Because the system is responsible for doing a reallocation. That's nothing you can control from within python. However your system is smart. That means in many cases you can actually do the reallocation without needing to copy the contents. You might want to take a look at #TimPeters answer, that explains some of it in more details.
I'll approach this problem from an experimentalists point of view.
You can easily check how many reallocations actually need a copy by checking how often the ID changes (because the id function in CPython returns the memory adress):
changes = []
s = ''
changes.append((0, id(s)))
for i in range(10000):
s += 'a'
if id(s) != changes[-1][1]:
changes.append((len(s), id(s)))
print(len(changes))
This gives a different number each run (or almost each run). It's somewhere around 500 on my computer. Even for range(10000000) it's just 5000 on my computer.
But if you think that's really good at "avoiding" copies you're wrong. If you compare it to the number of resizes a list needs (lists over-allocate intentionally so append is amortized O(1)):
import sys
changes = []
s = []
changes.append((0, sys.getsizeof(s)))
for i in range(10000000):
s.append(1)
if sys.getsizeof(s) != changes[-1][1]:
changes.append((len(s), sys.getsizeof(s)))
len(changes)
That only needs 105 reallocations (always).
I mentioned that realloc could be smart and I intentionally kept the "sizes" when the reallocs happened in a list. Many C allocators try to avoid memory fragmentation and at least on my computer the allocator does something different depending on the current size:
# changes is the one from the 10 million character run
%matplotlib notebook # requires IPython!
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
#ax.plot(sizes, num_changes, label='str')
ax.scatter(np.arange(len(changes)-1),
np.diff([i[0] for i in changes]), # plotting the difference!
s=5, c='red',
label='measured')
ax.plot(np.arange(len(changes)-1),
[8]*(len(changes)-1),
ls='dashed', c='black',
label='8 bytes')
ax.plot(np.arange(len(changes)-1),
[4096]*(len(changes)-1),
ls='dotted', c='black',
label='4096 bytes')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('x-th copy')
ax.set_ylabel('characters added before a copy is needed')
ax.legend()
plt.tight_layout()
Note that the x-axis represents the number of "copies done" not the size of the string!
That's graph was actually very interesting for me, because it shows clear patterns: For small arrays (up to 465 elements) the steps are constant. It needs to reallocate for every 8 elements added. Then it needs to actually allocate a new array for every character added and then at roughly 940 all bets are off until (roughly) one million elements. Then it seems it allocates in blocks of 4096 bytes.
My guess is that the C allocator uses different allocation schemes for differently sized objects. Small objects are allocated in blocks of 8 bytes, then for bigger-than-small-but-still-small arrays it stops to overallocate and then for medium sized arrays it probably positions them where they "may fit". Then for huge (comparativly speaking) arrays it allocates in blocks of 4096 bytes.
I guess the 8byte and 4096 bytes aren't random. 8 bytes is the size of an int64 (or float64 aka double) and I'm on a 64bit computer with python compiled for 64bits. And 4096 is the page size of my computer. I assume there are lots of "objects" that need have these sizes so it makes sense that the compiler uses these sizes because it could avoid memory fragmentation.
You probably know but just to make sure: For O(1) (amortized) append behaviour the overallocation must depend on the size. If the overallocation is constant it will be O(n**2) (the greater the overallocation the smaller the constant factor but it's still quadratic).
So on my computer the runtime behaviour will be always O(n**2) except for lengths (roughly) 1 000 to 1 000 000 - there it really seems to undefined. In my test run it was able to add many (ten-)thousand elements without ever needing a copy so it would probably "look like O(1)" when timed.
Note that this is just my system. It could look totally different on another computer or even with another compiler on my computer. Don't take these too seriously. I provided the code to do the plots yourself, so you can analyze your system yourself.
You also asked the question (in the comments) if there would be downsides if you over-allocate strings. That's really easy: Strings are immutable. So any overallocated byte is wasting ressources. There are only a few limited cases where it really does grow and these are considered implementation details. The developers probably don't throw away space to make implementation details perform better, some python developers also think that adding this optimization was a bad idea.

Related

Does every simple mathematical operation use the same amount of power (as in, battery power)?

Recently I have been revising some of my old python codes, which are essentially loops of algebra, in order to have them execute faster, generally by eliminating some un-necessary operations. Often, changing the value of an entry in a list from 0 (as a python float, which I believe is a double by default) to the same value, which is obviously not necessary. Or, checking if a float is equal to something, when it MUST be that thing, because a preceeding "if" would not have triggered if it wasn't, or some other extraneous operation. This got me wondering about what will preserve my battery more, as I do a some of my coding on the bus where I can't plug my laptop in.
For example, which of the following two operations would be expected to use less battery power?
if b != 0: #b was assigned previously, and I know it is zero already
b = 0
or:
b = 0
The first one checks if b is zero, and it is, so it doesn't do the next part. The second one just assigns b to zero without bothering to check. I believe the first one is more time-efficient, as you don't have to change anything in memory. Is that correct, and if so, would it also be more power-efficient? Does "more time efficient" always imply "more power efficient"?
I suggest watching this talk by Chandler Carruth: "Efficiency with Algorithms, Performance with Data Structures"
He addresses the idea of "Power efficient instructions" at 4m 49s in the video. I agree with him, thinking about how much watt particular code consumes is useless. As he put it
Q: "How to save battery life?"
A: "Finish ruining the program".
Also, in Python you do not have low level control to be even thinking about low level problems like this. Use appropriate data structures and algorithms, and pray that Python interpreter will give you well optimized byte-code.
Does every simple mathematical operation use the same amount of power (as in, battery power)?
No. It's not the same to compute a two number addition than a fourier transform of a 20 megapixel photo.
I believe the first one is more time-efficient, as you don't have to change anything in memory. Is that correct, and if so, would it also be more power-efficient? Does "more time efficient" always imply "more power efficient"?
Yes. You are right on your intuitions but these are very trivial examples. And if you dig deeper you will get into uncharted territory of weird optimization that's quite difficult to grasp (e.g., see this question: Times two faster than bit shift?)
In general the more your code utilizes system resources the greater power those resources would use. However it is more useful to optimize your code based on time or size constraints instead of thinking about high level code in terms of power draw.
One way of doing this is Big O notation. In essence, Big O notation is a way of comparing the size and or runtime complexity of an algorithm. https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
A computer at its lowest level is large quantity of transistors which require power to change and maintain their state.
It would be extremely difficult to anticipate how much power any one line of python code would draw.
I once had questions like this. Still do sometimes. Here's the answer I wish someone told me earlier.
Summary
You are correct that generally, if your computer does less work, it'll use less power.
But we have to be really careful in figuring out which logical operations involve more work and which ones involve less work - in this case:
Reading vs writing memory is usually the same amount of work.
if and any other conditional execution also costs work.
Python's "simple operations" are not "simple operations" for the CPU.
But the idea you had is probably correct for some cases you had in mind.
If you're concerned about power consumption, measure where power is being used.
For some perspective: You're asking about which Python code costs you one more drop of water, but really in Python every operation costs a bucket and your whole Python program is using a river and your computer as a whole is using an ocean.
Direct Answers
Don't apply these answers to Python yet. Read the rest of the answer first, because there's so much indirection between Python and the CPU that you'll mislead yourself about how they're connected if you don't take that into account.
I believe the first one is more time-efficient, as you don't have to change anything in memory.
As a general rule, reading memory is just as slow as writing to memory, or even slower depending on exactly what your computer is doing. For further reading you'll want to look into CPU memory cache levels, memory access times, and how out-of-order execution and data dependencies factor into modern CPU architectures.
As a general rule, the if statement in a language is itself an operation which can have a non-negligible cost. For further reading you should look into how CPU pipelining relates to branch prediction and branch penalties. Also look into how if statements are implemented in typical CPU instruction sets.
Does "more time efficient" always imply "more power efficient"?
As a general rule, more work efficient (doing less work - less machine instructions, for example) implies more power efficient, because on modern hardware (this wasn't always this way) your hardware will use less power when it's not doing anything.
You should be careful about the idea of "more time efficient" though, because modern hardware doesn't always execute the same amount of work in the same amount of time: for further reading you'll want to look into CPU frequency scaling, ARM's big.LITTLE architectures, and discussions about the "Race to Idle" concept as a starting point.
"One Simple Operation" - CPU vs. Python
Your question is about Python, so it's important to realize that Python's x != 0, if, and x = 0 do not map directly to simple operations in the CPU.
For further reading, especially if you're familiar with C, I would recommend taking a long look at how Python is implemented. There are many implementations - the main one is CPython, which is a C program that reads and interprets Python source, converts it into Python "bytecode" and then when running interprets that bytecode one by one.
As a baseline, if you're using Python, any one "simple" operation is actually a lot of CPU operations, as each step in the Python interpreter is multiple CPU operations, but which ones cost more might be surprising.
Let's break down the three used in our example (I'm primarily describing this from the perspective of the main Python implementation written in C, called "CPython", which I am the most closely familiar with, but in general this explanation is roughly applicable to all of them, though some will be able to optimize out certain steps):
x != 0
It looks like a simple operation, and if this was C and x was an int it would be just one machine instruction - but Python allows for operator overloading, so first Python has to:
look up x (at least one memory read, but may involve one or more hashmap lookups in Python's internals, which is many machine operations),
check the type of x (more memory reading),
based on the type look up a function pointer that implements the not-equality operation (one or arbitrarily many memory reads and one or more arbitrarily many conditional branches, with data dependencies between them),
only then it can finally call that function with references to Python objects holding the values of x and 0 (which is also not "free" - look up function calling ABI for more on that).
All that and more has to be done by the CPU even if x is a Python int or float mapping closely to the CPU's native numerical data types.
x = 0
An assignment is actually far cheaper in Python (though still not trivial): it only has to get as far as step 1 above, because once it knows "where" x is, it can just overwrite that pointer with the pointer to the Python object representing 0.
if
Abstractly speaking, the Python if statement has to be able to handle "truthy" and "falsey" values, which in the most naive implementation would involves running through more CPU instructions to evaluate what result of the condition is according to Python's semantics of what's true and what's false.
Sidenote About Optimizations
Different Python implementations go to different lengths to get Python operations closer to as few CPU operations in possible. For example, an optimizing JIT (Just In Time) compiler might notice that, inside some loop on an array, all elements of the array are native integers and actually reduce the if x != 0 and x = 0 parts into their respective minimal machine instructions, but that only happens in very specific circumstances when the optimizing logic can prove that it can safely bypass a lot of the behavior it would normally need to do.
The biggest thing here is this: a high-level language like Python is so removed from the hardware that "simple" operations are often complex "under the covers".
What You Asked vs. What I Think You Wanted To Ask
Correct me if I'm wrong, but I suspect the use-case you actually had in mind was this:
if x != 0:
# some code
x = 0
vs. this:
if x != 0:
# some code
x = 0
In that case, the first option is superior to the second, because you are already paying the cost of if x != 0 anyway.
Last Point of Emphasis
The hardest breakthrough for me was moving away from trying to reason about individual instructions in my head, and instead switching into looking at how things work and measuring real systems.
Looking at how things work will teach you how to optimize, but measuring will show you where to optimize.
This question is great for exploring the former, but for your stated motivation of reducing power consumption on your laptop, you would benefit more from the latter.

Regarding variable storage

I have just finished writing an optimised solution for Project Euler's fourth problem. While I was implementing the algorithm, I experienced some internal conflict over design choices. I was uncertain whether I should store a product of an operation in its own variable, for future reference, or instead not store it as variable and reproduce the product of the two operands whenever required. Here is a snippet of the code:
product = x * y
if (checkPalindrome(product) and product > largest_product):
largest_product = product
The operand is stored in 'product' and is referenced in the following lines. The curiosity I have is whether this is considered to be the better practice when compared to reproducing the product whenever a reference to it is required. Like this:
if (checkPalindrome(x * y) and x * y > largest_product):
largest_product = x * y
Can this difference in implementation yield a difference in space, or time performance when scaled?
Arithmetic in Python isn't particularly fast, so it's better to avoid performing the same computation multiple times. BTW, rather than determining the maximum product "by hand" you could do it with the built-in max function.
I should also mention that you aren't saving much RAM by avoiding doing product = x * y. The code still has to create an int object to hold the result of x * y anyway, binding that object to a name doesn't consume much RAM. OTOH, performing the same calculation 3 times not only wastes time, it means that 3 objects need to be created (and recycled) to store the result.
I suggest you take a look at Other languages have "variables", Python has "names". For a more in-depth examination of this important topic, please see Facts and myths about Python names and values, which was written by SO veteran Ned Batchelder.
A number takes an amount of storage proportional to its length (when printed out). For modest sized numbers, up to 20 digits say, you are not likely to notice this effect at all. Above that you are not likely to notice the effect unless you have huge numbers (thousands of digits) or lots of them.
The multiplication of two numbers takes... well, this is an active area of research, but for these purposes let's say it takes time proportional to the square of the length (when printed out) of the longest number (if they are similar in length). But again, you aren't likely to notice this effect unless you have huge numbers.
As others have observed, multiplication is not quick in Python, so that's a concern.
I suggest you write what's clearest and then clean up your performance problems when you encounter them.
In practice, the first approach is best. Since here, you are multiplying small integers, so it doesn't impact that much in the later approach. But, when you run this inside a loop, for calculation of number of products, then it really maters and impacts as well.
Let's assume you have 10 loops. In the second approach, if your one multiplication takes O(1) time. So, within a loop, you will have 2 such calculations, hence it will take O(2) times. For 10 such loops, you will have O(20) times.
if (checkPalindrome(x * y) and x * y > largest_product): # O(1)
largest_product = x * y # O(1)
# total O(2) for two calculations
But, in the first approach, since you are doing the calculation only once, and in later steps utilizing the calculated values, so it will take only O(1) time, only during the calculation. But, no time, while you will reference it for condition checking. For 10 such loops, you will have O(10) times. Thus you are saving 50 % of your time.
product = x * y # O(1) for one calculation
if (checkPalindrome(product) and product > largest_product):
largest_product = product
Yes, if memory is a constraint, then for storing variable, you might need memory. And in that case you might think of second approach. Or if there is a single point of calculation, then in that case, you are good to go with the second approach. But, for the first case, where memory is constraint, it would not take significant amount of memory just to store a variable. So, anyway, I find the first one (calculating once and storing rather than calculating each time) as the best and efficient.

all (2^m−2)/2 possible ways to partition list

Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.

How can I rewrite this Python operation so it doesn't hang my system?

Beginner here, looked for an answer, but can't find one.
I know (or rather suspect) that part of the problem with the following code is how big the list of combinations gets.
(Maybe too, the last line seems like an error, in that, if I just run 'print ...' rather than 'comb += ...' it runs quickly and quits. Would 'append' be more graceful?)
I'm not 100% sure if the system hang is due to disk I/O (swapping?), CPU use, or memory... running it under Windows seems to result in a rather large disk I/O by 'System', while under Linux, top was showing high CPU and memory use before it was killed. In both cases though, the rest of the system was unusable while this operation was going (tried it in the Python interpreter directly, as well as in PyCharm).
So two part question: 1) is there some 'safe' way to test code like this that won't affect the rest of the system negatively, and 2) for this specific example, how should I rewrite it?
Trying this code (which I do not recommend!):
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb += cwr(iterable, x)
Thanks!
EDIT: Should have specified, but it is python2.7 code here as well (guess the xrange makes it obvious it's not 3 anyways). The Windows machine that's hanging has 4 GB of RAM, but it looks like the hang is on disk I/O. The original problem I was (and still am) working on was a question at codewars.com, about how many ways to make change given a list of possible coins and an amount to make. The solution I'd come up with worked for small amounts, and not big ones. Obviously, I need to come up with a better algorithm to solve that problem... so this is non-essential code, certainly. However, I would like to know if there's something I can do to set the programming environment so that bugs in my code don't propagate and choke my system this way.
FURTHER EDIT:
I was working on the problem again tonight, and realized that I didn't need to append to a master list (as some of you hinted to me in the comments), but just work on the subset that was collected. I hadn't really given enough of the code to make that obvious, but my key problem here was the line:
comb += cwr(iterable, x)
which should have been
comb = cwr(iterable, x)
Since you are trying to compute combinations with replacement, the number of orderings that must be considered will be 4^nth power.(4 because your iterable has 4 items).
More generally speaking, the number of orderings to be computed is the number of elements that can be at any spot in the list, raised to the power of how long the list is.
You are trying to compute 4^nth power for n between 3 and 99. 4^99 power is 4.01734511064748 * 1059.
I'm afraid not even a quantum computer would be much help computing that.
This isn't a very powerful laptop (3.7 GiB,Intel® Celeron(R) CPU N2820 # 2.13GHz × 2, 64bit ubuntu) but it did it in 15s or so (but did slow noticeably, top showed 100% cpu (dual core) and 35% memory. It took about 15s to release the memory when if finished.
len(comb) was 4,421,240
I had to change your code to
from itertools import combinations_with_replacement as cwr
comb = []
iterable = [1,2,3,4]
for x in xrange(4,100):
comb.extend(list(cwr(iterable, x)))
ED - just re-tried as per your original and it does run OK. My mistake. It looks as though it is the memory requirement. If you really need to do this you could write it to a file.
re-ED being curious about the back-of-an-envelope complexity calculation above not squaring my experience, I tried plotting n (X axis) against the length of list returned by combinations_with_replacement() (Y axis) for iterable lengths 2,3,4,5 i. The result seems to be below n**(i-1) (Which ties in with the figure I got for 4,99 above. It's actually (i+n-1)! / n! / (i-1)! which approximates to n**(i-1)/i! for n much bigger than i)
Also, running the plot I didn't keep the full comb list in memory and this did improve computer performance quite a bit, so maybe that's a relevant point: rather than produce a giant list then work on it afterwords, do the calculations in the loop.

Performance of multiple iterations

Wondering about the performance impact of doing one iteration vs many iterations. I work in Python -- I'm not sure if that affects the answer or not.
Consider trying to perform a series of data transformations to every item in a list.
def one_pass(my_list):
for i in xrange(0, len(my_list)):
my_list[i] = first_transformation(my_list[i])
my_list[i] = second_transformation(my_list[i])
my_list[i] = third_transformation(my_list[i])
return my_list
def multi_pass(my_list):
range_end = len(my_list)
for i in xrange(0, range_end):
my_list[i] = first_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = second_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = third_transformation(my_list[i])
return my_list
Now, apart from issues with readability, strictly in performance terms, is there a real advantage to one_pass over multi_pass? Assuming most of the work happens in the transformation functions themselves, wouldn't each iteration in multi_pass only take roughly 1/3 as long?
The difference will be how often the values and code you're reading are in the CPU's cache.
If the elements of my_list are large, but fit into the CPU cache, the first version may be beneficial. On the other hand, if the (byte)code of the transformations is large, caching the operations may be better than caching the data.
Both versions are probably slower than the way more readable:
def simple(my_list):
return [third_transformation(second_transformation(first_transformation(e)))
for e in my_list]
Timing it yields:
one_pass: 0.839533090591
multi_pass: 0.840938806534
simple: 0.569097995758
Assuming you're considering a program that can easily be one loop with multiple operations, or multiple loops doing one operation each, then it never changes the computational complexity (e.g. an O(n) algorithm is still O(n) either way).
One advantage of the single-pass approach are that you save on the "book-keeping" of the looping. Whether the iteration mechanism is incrementing and comparing counters, or retrieving "next" pointers and checking for null, or whatever, you do it less when you do everything in one pass. Assuming that your operations do any significant amount of work at all (and that your looping mechanism is simple and straightforward, not looping over an expensive generator or something), then this "book-keeping" work will be dwarfed by the actual work of your operations, which makes this definitely a micro-optimisation that you shouldn't be doing unless you know your program is too slow and you've exhausted all more significant available optimisations.
Another advantage can be that applying all your operations to each element of the iteration before you move on to the next one tends to benefit better from the CPU cache, since each item could still be in the cache in subsequent operations on the same item, whereas using multiple passes makes that almost impossible (unless your entire collection fits in the cache). Python has so much indirection via dictionaries going on though that it's probably not hard for each individual operation to overflow the cache by reading hash buckets scattered all over the memory space. So this is still a micro-optimisation, but this analysis gives it more of a chance (though still no certainty) of making a significant difference.
One advantage of multi-pass can be that if you need to keep state between loop iterations, the single-pass approach forces you to keep the state of all operations around. This can hurt the CPU cache (maybe the state of each operation individually fits in the cache for an entire pass, but not the state of all the operations put together). In the extreme case this effect could theoretically make the difference between the program fitting in memory and not (I have encountered this once in a program that was chewing through very large quantities of data). But in the extreme cases you know that you need to split things up, and the non-extreme cases are again micro-optimisations that are not worth making in advance.
So performance generally favours single-pass by an insignificant amount, but can in some cases favour either single-pass or multi-pass by a significant amount. The conclusion you can draw from this is the same as the general advice applying to all programming: start by writing code in whatever way is most clear and maintainable and still adequately addresses your program. Only once you've got a mostly finished program and if it turns out to be "not fast enough", then measure the performance effects of the various parts of your code to find out where it's worth spending your time.
Time spent worrying about whether to write single-pass or multi-pass algorithms for performance reasons will almost always turn out to have been wasted. So unless you have unlimited development time available to you, you will get the "best" results from your total development effort (where best includes performance) by not worrying about this up-front, and addressing it on an as-needed basis.
Personally, I would no doubt prefer the one_pass option. It definitely performs better. You may be right that the difference wouldn't be huge. Python has optimized the xrange iterator really well, but you are still doing 3 times more iterations than needed.
You may get decreased cached misses in either version compared to the other. It depends on what those transform functions actually do.
If those functions have a lot of code and operate on different sets of data (besides the input and output), multipass may be better. Otherwise the single pass is likely to be better because the current list element will likely remain cached and the loop operations are only done once instead of three times.
This is a case were comparing actual run times would be very useful.

Categories

Resources