suggestions for practicing Generators(python) - python

I understand the concept behind Generators and why one would choose that over lists.., but i'm struggling so much with getting quality practice by actually implementing them in my coding..Any suggestions on the type of problems I should play around with? I did the 'Fibonacci' code already but would like to practice with other types that would put generators to good use.--thanks--

How about this one: implement a generator that reads chunks from a large file or a big database (so big that it wouldn't fit into the memory). Alternatively, consider a stream of infinitely many values as input.
As you might already have learned, this is a common use case in real world applications:
https://docs.python.org/3/howto/functional.html
With a list comprehension, you get back a Python list; [...] Generator expressions return an iterator that computes the values as necessary, not needing to materialize all the values at once. This means that list comprehensions aren’t useful if you’re working with iterators that return an infinite stream or a very large amount of data. Generator expressions are preferable in these situations.
http://naiquevin.github.io/python-generators-and-being-lazy.html
Now you may ask how does this differ from an ordinary list and what is the use of all this anyway? The key difference is that the generator gives out new values on the fly and doesn't keep the elements in memory.
https://wiki.python.org/moin/Generators
The performance improvement from the use of generators is the result of the lazy (on demand) generation of values, which translates to lower memory usage. Furthermore, we do not need to wait until all the elements have been generated before we start to use them. This is similar to the benefits provided by iterators, but the generator makes building iterators easy.

Related

Python Equivalent of Scheme/Lisp CAR and CDR Functions

I am currently trying to implement fold/reduce in Python, since I don't like the version from functools. This naturally involved implementing something like the Lisp CDR function, since Python doesn't seem to have anything like it. Here is what I am thinking of trying:
def tail(lat):
# all elements of list except first
acc = []
for i in range(1,len(lat)):
acc = acc + [lat[i]]
Would this be an efficient way of implementing this function? Am I missing some kind of built-in function? Thanks in advance!
"Something like the Lisp CDR function" is trivial:
acc[1:]
This will be significantly faster than your attempt, but only by a constant factor.
However, it doesn't make much sense to do this in the first place. The whole point of CDR is that, when your lists are linked lists stored in CONS cells, going from one cell to its tail is a single machine-language operation. But with arrays (which is what Python lists are), acc[1:]—or the more complicated thing you tried to write, or in fact any possible implementation—allocates a whole new array of size N-1 and copies over N-1 values.
The efficiency cost of doing that over and over again (in an algorithm that was expecting it to be nearly free) is going to be so huge that the constant-factor speedup of using acc[1:] is unlikely to be nearly enough of an improvemnt to make it acceptable.
Most algorithms that are fast with CDR are going to be slow with this kind of slicing, and most algorithms that are fast with this kind of slicing would be slow with CDR. That's why we have multiple data structures in the first place: because they're good for different things.
If you want to know the most efficient way to fold/reduce on an array—it's the way functools.reduce (and the variations of it that libraries like toolz offer) do it: just iterate.
And just iterating has another huge advantage. Python doesn't just have lists, it has an abstraction called iterables, which include iterators and other types that can generate their contents lazily. If you're folding forward, you can take advantage of that laziness. (Folding backward does of course take linear space, either explicitly or on the stack—but it's still better than quadratic copying.) Ignoring that fact defeats the purpose.

Pythonic pattern for building up parallel lists

I am new-ish to Python and I am finding that I am writing the same pattern of code over and over again:
def foo(list):
results = []
for n in list:
#do some or a lot of processing on N and possibly other variables
nprime = operation(n)
results.append(nprime)
return results
I am thinking in particular about the creation of the empty list followed by the append call. Is there a more Pythonic way to express this pattern? append might not have the best performance characteristics, but I am not sure how else I would approach it in Python.
I often know exactly the length of my output, so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up. I am writing a lot of text parsing code that isn't super performance sensitive on any particular loop or piece because all of the performance is really contained in gensim or NLTK code and is in much more capable hands than mine.
Is there a better/more pythonic pattern for doing this type of operation?
First, a list comprehension may be all you need (if all the processing mentioned in your comment occurs in operation.
def foo(list):
return [operation(n) for n in list]
If a list comprehension will not work in your situation, consider whether foo really needs to build the list and could be a generator instead.
def foo(list):
for n in list:
# Processing...
yield operation(n)
In this case, you can iterate over the sequence, and each value is calculated on demand:
for x in foo(myList):
...
or you can let the caller decide if a full list is needed:
results = list(foo())
If neither of the above is suitable, then building up the return list in the body of the loop as you are now is perfectly reasonable.
[..] so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up.
If you are worried about this, don't. Python over-allocates when a new resizing of the list is required (lists are dynamically resized based on their size) in order to perform O(1) appends. Either you manually call list.append or build it with a list comprehension (which internally also uses .append) the effect, memory wise, is similar.
The list-comprehension just performs (speed wise) a bit better; it is optimized for creating lists with specialized byte-code instructions that aid it (LIST_APPEND mainly that directly calls lists append in C).
Of course, if memory usage is of concern, you could always opt for the generator approach as highlighted in chepners answer to lazily produce your results.
In the end, for loops are still great. They might seem clunky in comparison to comprehensions and maps but they still offer a recognizable and readable way to achieve a goal. for loops deserve our love too.

Efficiency of for loops in python3

I am currently learning Python (3), having mostly experience with R as main programming language. While in R for-loops have mostly the same functionality as in Python, I was taught to avoid using it for big operations and instead use apply, which is more efficient.
My question is: how efficient are for-loops in Python, are there alternatives and is it worth exploring those possibilities as a Python newbie?
For example:
p = some_candidate_parameter_generator(data)
for i in p:
fit_model_with paramter(data, i)
Bear with me, it is tricky to give an example without going too much into specific code. But this is something that in R I would have writting with apply, especially if p is large.
The comments correctly point out that for loops are "only as efficient as your logic"; however, the range and xrange in Python do have performance implications, and this may be what you had in mind when asking this question. These methods have nothing to do with the intrinsic performance of for loops though.
In Python 3.0, xrange is now implicitly just range; however, in Python versions less than 3.0, there used to be a distinction – range loaded your entire iterable into memory, and then iterated over each item, while xrange was more akin to a generator, where each item was loaded into memory only when needed and then removed from memory after it was iterated over.
After your updated question:
In other words, if you have a giant list of items that you need to iterate over via a for loop, it is often more memory efficient to use a generator, not a list or a tuple, etc. Again though, this has nothing to do with how the Python for-loop operates, but more to do with what you're iterating over. If in doubt, use a generator, and your memory-efficiency will be as good as it will get with Python.

How best to store large sequences of text in Python?

I recently discovered that a student of mine was doing an independent project in which he was using very large strings (2-4MB) as values in a dictionary.
I've never had a reason to work with such large blocks of text and it got me wondering if there were performance issues associated with creating such large strings.
Is there a better way of doing it than to simply create a string? I realize this question is largely context dependent, but I'm looking for generalized answers that may cover more than one possible use-case.
If you were working with that much text, how would you store it in your code, and would you do anything different than if you were simply working with an ordinary string of only a few characters?
It depends a lot on what you're doing with the strings. I'm not exactly sure how Python stores strings but I've done a lot of work on XEmacs (similar to GNU Emacs) and on the underlying implementation of Emacs Lisp, which is a dynamic language like Python, and I know how strings are implemented there. Strings are going to be stored as blocks of memory similar to arrays. There's not a huge issue creating large arrays in Python, so I don't think simply storing the strings this way will cause performance issues. Some things to consider though:
How are you building up the string? If you build up piece-by-piece by simply appending to ever larger strings, you have an O(N^2) algorithm that will be very slow. Java handles this with a StringBuilder class. I'm not sure if there's an exact equivalent in Python but you can simply create an array with all the parts you want to join together, then join at the end using ''.join(array).
Do you need to search the string? This isn't related to creating the strings but it's something to consider. Searching will in general be O(n) in the size of the string; there are speedups that make it O(n/m) where m is the size of the substring you're searching for, but that's about it. The main consideration here is whether to store one big string or a series of substrings. If you need to search all the substrings, that won't help much over searching a big string, but it's possible you might know in advance that some parts don't need to be searched.
Do you need to access substrings? Again, this isn't related to creating the strings, it's something to consider. Accessing a substring by position is just a matter of indexing to the right memory location, but if you need to take large substrings, it may be inefficient, and you might be able to speed things up by storing your string as an array of substrings, and then creating a new string as another array with some of the strings shared. However, doing it this way takes work, and shouldn't be done unless it's really necessary.
In sum, I think for simple cases it's fine to have large strings like this, but you should think about the sorts of operations you're going to perform and what their O(...) time is.
I would say that potential issues depend on two things:
how many strings of this kind are hold in memory at the same time, compared to the capacity of the memory (the RAM) ?
what are the operations done on these strings ?
It seems to me I've read that operations on strings in Python are very efficient, so it isn't supposed to present problem working on very long strings. But in fact it depends on the algorithm of each operation performed on a big string.
This answer is rather vague, I haven't enough eperience to make more useful estimation of the problem. But the question is also very broad.

Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store.
I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries.
I know there is a lot of overhead in general with lists and dictionaries. I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2.
I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration.
One of the difficulties is knowing how the power of 2 system counts "items"? Is each key:pair counted as 1 item? That's seems important to know because if you have a 100 item monolithic dictionary then space 100^2 items would be allocated. If you have 100 single item dictionaries (1 key:pair) then each dictionary would only be allocation 1^2 (aka no extra allocation)?
Any clearly laid out information would be very helpful!
Three suggestions:
Use one dictionary.
It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later.
If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables.
Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science. The short of it is that hash tables are:
average case O(1) lookup time
O(n) space (Expect about 2n, depending on various parameters)
I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables:
O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size.
O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces. Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not.
Here are some more resources that might help:
The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you.
Here are some things you might consider if you find you actually need to optimize your dictionary implementation:
Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here:
dictobject.h
dictobject.c
Here is a python implementation of that, in case you don't like reading C.
(Thanks to Ben Peterson)
The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.
If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's.
This smacks of premature optimization, not performance improvement. Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.
"Simple" is generally better than "clever", especially if you have no tested reason to go beyond "simple". And anyway "Memory efficient" is an ambiguous term, and there are tradeoffs, when you consider persisting, serializing, cacheing, swapping, and a whole bunch of other stuff that someone else has already thought through so that in most cases you don't need to.
Think "Simplest way to handle it properly" optimize much later.
Premature optimization bla bla, don't do it bla bla.
I think you're mistaken about the power of two extra allocation does. I think its just a multiplier of two. x*2, not x^2.
I've seen this question a few times on various python mailing lists.
With regards to memory, here's a paraphrased version of one such discussion (the post in question wanted to store hundreds of millions integers):
A set() is more space efficient than a dict(), if you just want to test for membership
gmpy has a bitvector type class for storing dense sets of integers
Dicts are kept between 50% and 30% empty, and an entry is about ~12 bytes (though the true amount will vary by platform a bit).
So, the fewer objects you have, the less memory you're going to be using, and the fewer lookups you're going to do (since you'll have to lookup in the index, then a second lookup in the actual value).
Like others, said, profile to see your bottlenecks. Keeping an membership set() and value dict() might be faster, but you'll be using more memory.
I'd also suggest reposting this to a python specific list, such as comp.lang.python, which is full of much more knowledgeable people than myself who would give you all sorts of useful information.
If your dictionary is so big that it does not fit into memory, you might want to have a look at ZODB, a very mature object database for Python.
The 'root' of the db has the same interface as a dictionary, and you don't need to load the whole data structure into memory at once e.g. you can iterate over only a portion of the structure by providing start and end keys.
It also provides transactions and versioning.
Honestly, you won't be able to tell the difference either way, in terms of either performance or memory usage. Unless you're dealing with tens of millions of items or more, the performance or memory impact is just noise.
From the way you worded your second sentence, it sounds like the one big dictionary is your first inclination, and matches more closely with the problem you're trying to solve. If that's true, go with that. What you'll find about Python is that the solutions that everyone considers 'right' nearly always turn out to be those that are as clear and simple as possible.
Often times, dictionaries of dictionaries are useful for other than performance reasons. ie, they allow you to store context information about the data without having extra fields on the objects themselves, and make querying subsets of the data faster.
In terms of memory usage, it would stand to reason that one large dictionary will use less ram than multiple smaller ones. Remember, if you're nesting dictionaries, each additional layer of nesting will roughly double the number of dictionaries you need to allocate.
In terms of query speed, multiple dicts will take longer due to the increased number of lookups required.
So I think the only way to answer this question is for you to profile your own code. However, my suggestion is to use the method that makes your code the cleanest and easiest to maintain. Of all the features of Python, dictionaries are probably the most heavily tweaked for optimal performance.

Categories

Resources