I'm sorry if this has been asked before, but I'm not really sure what I'm looking for and I lack the domain knowledge to correctly frame the question which make answers rather hard to find!
Anyway, I am trying to implement in Python a simulated annealing algorithm from a paper (IBM J. Res. Dev., 2001; 45(3/4); 545). The author gives a clear outline of the algorithm which he has implemented in C++, however at the end of his definition he states the follwing
"To avoid repeated and potentially expensive memory allocation S and S* are implemented as a single object that is able to revert to its initial state after an unfavourable mutation."
(S and S* represent the original and step changed state of whatever is being optimised).
In a previous more naive version I have used two lists to hold each state, but his comment seems to suggest that such an approach is memory inefficient. Thus, my questions are:
Is his comment C++ specific and in Python I can continue to use lists and not worry about it?
If I do need to worry about it, what Python data structure should I use? Simply define a class with original and mutated attributes and a method to do the mutation or is there something else I'm missing?
I still need the two states, so does wrapping that in a class change the way the memory is allocated to make the class representation more compact?
You could theoretically log and replay all manipulations to a mutable Python object to achieve it's state prior to the misguided mutation. But that's painful and expensive. So is mapping logged events to inverse functions in such a way that a mutation can be reverted.
The trend with functional programming, however seems to strongly encourage expanded use of immutable data...especially with concurrent programming. This mindset isn't just limited to functional languages like Haskell, OCaml, and Erlang. It's even infiltrated the Java world:
An object is considered immutable if its state cannot change after it
is constructed. Maximum reliance on immutable objects is widely
accepted as a sound strategy for creating simple, reliable code.
Immutable objects are particularly useful in concurrent applications.
Since they cannot change state, they cannot be corrupted by thread
interference or observed in an inconsistent state.
Programmers are often reluctant to employ immutable objects, because
they worry about the cost of creating a new object as opposed to
updating an object in place. The impact of object creation is often
overestimated, and can be offset by some of the efficiencies
associated with immutable objects. These include decreased overhead
due to garbage collection, and the elimination of code needed to
protect mutable objects from corruption.
The following subsections take a class whose instances are mutable and
derives a class with immutable instances from it. In so doing, they
give general rules for this kind of conversion and demonstrate some of
the advantages of immutable objects.
Generate a new list with a map or list comprehension and modify accordingly. If Ram really is an issue, consider using a generator with that gives you an iterable with desired modification and a lower memory footprint.
It depends on the object being created. If it is a large object (in-memory) then creating two of them S and S' as lists or otherwise would be less memory efficient then designing a way to transform S -> S' and vice versa as long as the transformation itself does not consume much memory.
These type of Data Structures are classified as RetroActive Data Structures with a few other synonyms like Kinetic data structures.
Using a deque and managing all the object states is one way of doing it.
Related
The Python/C API has a number of related functions that perform similar operations where one is usually for general use and another is somehow more efficient or convenient for a specific situation.
For example, PyDict_SetItem and PyDict_SetItemString are the same but the latter is specialized for a C string as a key. For PyList_SetItem and PyList_SET_ITEM, the second has no error checking and doesn't decref the existing item (explained here).
But what is the reason for having special conversions like PyInt_FromLong when there is always Py_BuildValue? Both return a new reference to a Python object and seem to do the exact same thing. This same question applies to functions like PyInt_FromSsize_t, PyTuple_Pack, etc. (all compared to Py_BuildValue with the appropriate format string of course).
When should you use one over the other?
For me, it is a trade-off between performance and simplicity. For code that I've written, I tend to use Py_BuildValue when I don't care about the running time (i.e. returning a set of values that won't change, or the running time is so long the extra overhead doesn't matter.) For performance critical functions, especially those that can be at the innermost level of a loop, I will use the use lowest level API call that I can.
Whether it is worth it or not is a different question. I'll never recoup the time I've spent trying to get the best performance but I know other applications that use my code are measurably faster.
Does Python stores similar objects at memory locations nearer to each other?
Because id of similar objects, say lists and tuples, are nearer to each other than an object of type str.
No, except of course by coincidence. While this is highly implementation- and environment-specific, and there are actually memory management schemes which would dedicate page-sized memory regions to objects of the same type, no Python implementations I'm aware of exhibits the behavior you describe. With the possible exception of small numbers, which are sometimes cached under the hood and will likely be located next to each other.
What you're seeing may be because string literals are created at import time (part of the constants in the byte code) and interned, while lists and tuples (that don't contain literals) are created while running code. If a bunch of memory is allocated in between (especially if it isn't freed), the state of the heap may be sufficiently different that quite different addresses are handed out when you're checking.
Is there an article or forum discussion or something somewhere that explains why lists use append/extend, but sets and dicts use add/update?
I frequently find myself converting lists into sets and this difference makes that quite tedious, so for my personal sanity I'd like to know what the rationalization is.
The need to convert between these occurs regularly as we iterate on development. Over time as the structure of the program morphs, various structures gain and lose requirements like ordering and duplicates.
For example, something that starts out as an unordered bunch of stuff in a list might pick up the the requirement that there be no duplicates and so need to be converted to a set.
All such changes require finding and changing all places where the relevant structure is added/appended and extended/updated.
So I'm curious to see the original discussion that led to this language choice, but unfortunately I didn't have any luck googling for it.
append has a popular definition of "add to the very end", and extend can be read similarly (in the nuance where it means "...beyond a certain point"); sets have no "end", nor any way to specify some "point" within them or "at their boundaries" (because there are no "boundaries"!), so it would be highly misleading to suggest that these operations could be performed.
x.append(y) always increases len(x) by exactly one (whether y was already in list x or not); no such assertion holds for s.add(z) (s's length may increase or stay the same). Moreover, in these snippets, y can have any value (i.e., the append operation never fails [except for the anomalous case in which you've run out of memory]) -- again no such assertion holds about z (which must be hashable, otherwise the add operation fails and raises an exception). Similar differences apply to extend vs update. Using the same name for operations with such drastically different semantics would be very misleading indeed.
it seems pythonic to just use a list
on the first pass and deal with the
performance on a later iteration
Performance is the least of it! lists support duplicate items, ordering, and any item type -- sets guarantee item uniqueness, have no concept of order, and demand item hashability. There is nothing Pythonic in using a list (plus goofy checks against duplicates, etc) to stand for a set -- performance or not, "say what you mean!" is the Pythonic Way;-). (In languages such as Fortran or C, where all you get as a built-in container type are arrays, you might have to perform such "mental mapping" if you need to avoid using add-on libraries; in Python, there is no such need).
Edit: the OP asserts in a comment that they don't know from the start (e.g.) that duplicates are disallowed in a certain algorithm (strange, but, whatever) -- they're looking for a painless way to make a list into a set once they do discover duplicates are bad there (and, I'll add: order doesn't matter, items are hashable, indexing/slicing unneeded, etc). To get exactly the same effect one would have if Python's sets had "synonyms" for the two methods in question:
class somewhatlistlikeset(set):
def append(self, x): self.add(x)
def extend(self, x): self.update(x)
Of course, if the only change is at the set creation (which used to be list creation), the code may be much more challenging to follow, having lost the useful clarity whereby using add vs append allows anybody reading the code to know "locally" whether the object is a set vs a list... but this, too, is part of the "exactly the same effect" above-mentioned!-)
set and dict are unordered. "Append" and "extend" conceptually only apply to ordered types.
It's written that way to annoy you.
Seriously. It's designed so that one can't simply convert one into the other easily. Historically, sets are based off dicts, so the two share naming conventions. While you could easily write a set wrapper to add these methods ...
class ListlikeSet(set):
def append(self, x):
self.add(x)
def extend(self, xs):
self.update(xs)
... the greater question is why you find yourself converting lists to sets with such regularity. They represent substantially different models of a collection of objects; if you have to convert between the two a lot, it suggests you may not have a very good handle on the conceptual architecture of your program.
All the books I've read on data structures so far seem to use C/C++, and make heavy use of the "manual" pointer control that they offer. Since Python hides that sort of memory management and garbage collection from the user is it even possible to implement efficient data structures in this language, and is there any reason to do so instead of using the built-ins?
Python gives you some powerful, highly optimized data structures, both as built-ins and as part of a few modules in the standard library (lists and dicts, of course, but also tuples, sets, arrays in module array, and some other containers in module collections).
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
When you need something more than the rich library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'payload', 'left', 'right'
def __init__(self, payload=None, left=None, right=None):
self.payload = payload
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).
For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure.
If however, you need something that's not available as a primitive or if the primitive doesn't perform well enough, you'll have to implement your own type.
The details like pointer management etc. are just implementation talk and don't really limit the capabilities of the language itself.
C/C++ data structure books are only attempting to teach you the underlying principles behind the various structures - they are generally not advising you to actually go out and re-invent the wheel by building your own library of stacks and lists.
Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).
How Python handles objects at a low level isn't too strange anyway. This document should disambiguate it a tad; it's basically all the pointer logic you're already familiar with.
With Python you have access to a vast assortment of library modules written and debugged by other people. Odds are very good that somewhere out there, there is a module that does at least part of what you want, and odds are even good that it might be implemented in C for performance.
For example, if you need to do matrix math, you can use NumPy, which was written in C and Fortran.
Python is slow enough that you won't be happy if you try to write some sort of really compute-intensive code (example, a Fast Fourier Transform) in native Python. On the other hand, you can get a C-coded Fourier Transform as part of SciPy, and just use it.
I have never had a situation where I wanted to solve a problem in Python and said "darn, I just can't express the data structure I need."
If you are a pioneer, and you are doing something in Python for which there just isn't any library module out there, then you can try writing it in pure Python. If it is fast enough, you are done. If it is too slow, you can profile it, figure out where the slow parts are, and rewrite them in C using the Python C API. I have never needed to do this yet.
It's not possible to implement something like a C++ vector in Python, since you don't have array primitives the way C/C++ do. However, anything more complicated can be implemented (efficiently) on top of it, including, but not limited to: linked lists, hash tables, multisets, bloom filters, etc.
In order to save space and the complexity of having to maintain the consistency of data between different sources, I'm considering storing start/end indices for some substrings instead of storing the substrings themselves. The trick is that if I do so, it's possible I'll be creating slices ALL the time. Is this something to be avoided? Is the slice operator fast enough I don't need to worry? How about the new object creation/destruction overhead?
Okay, I learned my lesson. Don't optimize unless there's a real problem you're trying to fix. (Of course this doesn't mean to right needlessly bad code, but that's beside the point...) Also, test and profile before coming to stack overflow. =D Thanks everyone!
Fast enough as opposed to what? How do you do it right now? What exactly are you storing, what exactly are you retrieving? The answer probably highly depends on this. Which brings us to ...
Measure! Don't discuss and analyze theoretically; try and measure what is the more performant way. Then decide whether the possible performance gain justifies refactoring your database.
Edit: I just ran a test measuring string slicing versus lookup in a dict keyed on (start, end) tuples. It suggests that there's not much of a difference. It's a pretty naive test, though, so take it with a pinch of salt.
In a comment the OP mentions bloat "in the database" -- but no information regarding what database he's talking about; from the scant information in that comment it would seem that Python string slices aren't necessarily what's involved, rather, the "slicing" would be done by the DB engine upon retrieval.
If that's the actual situation then I would recommend on general principles against storing redundant information in the DB -- a "normal form" (maybe in a lax sense of the expression;-) whereby information is stored just once and derived information is recomputed (or cached charge of the DB engine, etc;-) should be the norm, and "denormalization" by deliberately storing derived information very much the exception and only when justified by specific, well measured retrieval-performance needs.
If the reference to "database" was a misdirection;-), or rather used in a lax sense as I did for "normal form" above;-), then another consideration may apply: since Python strings are immutable, it would seem to be natural to not have to do slices by copying, but rather have each slice reuse part of the memory space of the parent it's being sliced from (much as is done for numpy arrays' slices). However that's not currently part of the Python core. I did once try a patch to that purpose, but the problem of adding a reference to the big string and thus making it stay in memory just because a tiny substring thereof is still referenced loomed large for general-purpose adaptation. Still it would be possible to make a special purpose subclass of string (and one of unicode) for the case in which the big "parent" string needs to stay in memory anyway. Currently buffer does a tiny bit of that, but you can't call string methods on a buffer object (without explicitly copying it to a string object first), so it's only really useful for output and a few special cases... but there's no real conceptual block against adding string method (I doubt that would be adopted in the core, but it should be decently easy to maintain as a third party module anyway;-).
The worth of such an approach can hardly be solidly proven by measurement, one way or another -- speed would be very similar to the current implicitly-copying approach; the advantage would come entirely in terms of reducing memory footprint, which wouldn't so much make any given Python code faster, but rather allow a certain program to execute on a machine with a bit less RAM, or multi-task better when several instances are being used at the same time in separate processes. See rope for a similar but richer approach once experimented with in the context of C++ (but note it didn't make it into the standard;-).
I haven't done any measurements either, but since it sounds like you're already taking a C approach to a problem in Python, you might want to take a look at Python's built-in mmap library:
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.
I'm not sure from your question if that's exactly what you're looking for. And it bears repeating that you need to take some measurements. Python's timeit library is the easy one to use, but there's also cProfile or hotshot, although hotshot is at risk of being removed from the standard library as I understand it.
Would slices be ineffective because they create copies of the source string? This may or may not be an issue. If it turns out to be an issue, would it not be possible to simply implement a "String view"; an object that has a reference to the source string and has a start and end point.. Upon access/iteration, it just reads from the source string.
premature optimization is the rool of all evil.
Prove to yourself that you really have a need to optimize code, then act.