If I am reducing over a collection in Python, what's the most efficient way to get the rest of the collection (the unvisited items) ? It is quite often that I need to reduce over a collection, but I want my reducing function to take the unvisited items of the collection I am reducing over.
edit - to clarify, I want something like:
reduce(lambda to-return, item, rest: (code here), collection, initial)
where rest is the items not yet seen by my lambda
This is the best I can do. It expects that the "collection" be sliceable:
def myreduce(func,collection,*args):
"""func takes 3 parameters. The previous value,
the current value, and the rest of the collection"""
def new_func(x,y):
try:
return func(x[1],y[1],collection[y[0]:])
except TypeError:
return func(x,y[1],collection[y[0]:])
return reduce(new_func,enumerate(collection),*args)
print myreduce(lambda x,y,rest:x+y+sum(rest),range(30))
Note that this is very poorly tested. Please test thoroughly before you attempt to use this in any real code. If you really want this to work for any iterable, you could put a collection = tuple(collection) in there at the top I suppose (assuming you have enough memory to store your entire iterable in memory at once)
Related
I'm caching values that are slow to calculate but are usually needed several times. I have a dictionary that looks something like this:
stored_values = {
hash1: slow_to_calc_value1
hash2: slow_to_calc_value2
# And so on x5000
}
I'm using it like this, to quickly fetch the value if it has been calculated before.
def calculate_value_for_item(item):
item_hash = hash_item(item) # Hash the item, used as the dictionary key
stored_value = stored_values.get(item_hash, None)
if stored_value is not None:
return stored_value
calculated_value = do_heavy_math(item) # This is slow and I want to avoid
# Storing the reult for re-use makes me run out of memory at some point
stored_values[item_hash] = calculated_value
return calculated_value
However, I'm running out of memory if I try to store all values that are calculated throughout the program.
How can I manage the size of the lookup dictionary efficiently? It's a reasonable assumption that values which were needed most recently are also most likely to be needed in the future.
Things to note
I have simplified the scenario a lot.
The stored values actually use a lot of memory. The dictionary itself doesn't contain too many items, only several thousand. I can definitely afford some parallel book-keeping data structures if needed.
An ideal solution would let me store n last needed values while removing the rest. But any heuristic close enough is good enough.
Have you tried using the #lru_cache decorator? It seems to do exactly what you are asking for.
from functools import lru_cache
store_this_many_values = 5
#lru_cache(maxsize=store_this_many_values)
def calculate_value_for_item(item):
calculated_value = do_heavy_math(item)
return calculated_value
#lru_cache also adds new functions, which might help you to optimise for memory and/or performance, such as cache_info
for i in [1,1,1,2]:
calculate_value_for_item(i)
print(calculate_value_for_item.cache_info())
>>> CacheInfo(hits=2, misses=2, maxsize=5, currsize=2)
I've got a django app that, at a very pseudo-codey level, does something like this:
class SerialisedContentItem():
def __init__(self, item),
self.__output = self.__jsonify(item)
def fetch_output(self):
return self.__output
def __jsonify(self, item):
serialized = do_a_bunch_of_serialisey_stuff()
return json.dumps(serialized)
So basically - as soon as the class is instantiated, it:
runs an internal function to generate an output string of JSON
stores it in an internal variable
exposes a public function that can be
called later to retrieve the JSON
It's then being used to generate a page something like this:
for item in page.items:
json_item = SerialisedContentItem(item)
yield json_item.fetch_output()
This, to me, seems a bit pointless. And it's also causing issues with some business logic changes we need to make.
What I'd prefer to do is defer the calling of the "jsonify" function until I actually want it. Roughly speaking, changing the above to:
class SerialisedContentItem():
def __init__(self, item),
self.__item = item
def fetch_output(self):
return self.__jsonify(self.__item):
This seems simpler, and mucks with my logic slightly less.
But: is there a downside I'm not seeing? Is my change less performant, or not a good way of doing things?
As long as you only call fetch_output once per item, there's no performance hit (there would be one, obviously, if you called fetch_output twice on the same SerializedContentItem instance). And not doing useless operations is usually a good thing too (you don't expect open("/path/to/some/file.ext") to read the file's content, do you ?)
The only caveat is that, with the original version, if item is mutated between the initialization of SerializedContentItem and the call to fetch_output, the change won't be reflected in the json output (since it's created right at initialisation time), while with your "lazy" version those changes WILL reflect in the json. Whether this is a no-go, a potential issue or actually just what you want depends on the context, so only you can tell.
EDIT:
what's prompted my question: according to my (poor) understanding of yield, using it here makes sense: I only need to iterate page items once, so do so in a way that minimises the memory footprint. But the bulk of the work isn't currently being done in the yield function, it's being done on the line above it when the class is instantiated, making yield a bit pointless. Or am I misunderstanding how it works?
I'm afraid you are indeed misunderstanding yield. Defering the json seralization until the yield json_item.fetch_output() will change nothing (nada, zero, zilch, shunya) to memory comsuption wrt/ the original version.
yield is not a function, it's a keyword. What it does is to turn the function containing it into a "generator function" - a function that returns a generator (a lazy iterator) object, that you can then iterate over. It will not change anything to the memory used to jsonify an item, and whether this jsonification happens "on the same line" as the yield keyword or not is totally irrelevant.
What a generator brings you (wrt/ memory use) is that you don't have to create a whole list of contents at once, ie:
def eager():
result = []
for i in range(1000):
result.append("foo {}\n".format(i))
return result
with open("file.txt", "w") as outfile:
for item in eager():
outfile.write(item)
This FIRST creates a 1000 items long list in memory, then iterate over it.
vs
def lazy():
result = []
for i in range(1000):
yield "foo {}\n".format(i)
with open("file.txt", "w") as outfile:
for item in lazy():
outfile.write(item)
this lazily generates string after string on each iteration, so you don't end up with a 1000 items list in memory - BUT you still generated 1000 strings, each of them using the same amount of space as with the first solution. The difference is that since (in this example) you don't keep any reference on those strings they can be garbage collected on each iteration, while storing them in a list prevent them from being collected until there' no more reference on the list itself.
I have a situation in one of my projects that I can either use lists or dictionaries and I am having hard time picking which one to use.
I am analyzing large number of items (>400k). And I will have (>400k) list or dictionaries which I will use very frequently. (Get/Set/Update)
In my particular situation, using a dictionary feels like more convenient than list if I wouldn't think about performance at all. However, I know I can manage writing the same thing using lists.
Should I go for readibility and use dictionaries or going with dictionary may add too much of a overhead that will dramatically decrease my performance from the perspective of both memory and time.
I know this question is a bit too-broad. But I wanted to ask it before I start building all my logic after having this decision done.
My situation in a nutshell:
I have values for keys 0,1,...,n. For now, keys will be always integers from 0 to n which I can keep in a list.
However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers. Or integers which are not consecutive.
So, the question is if using dictionaries instead of lists in the first place wouldn't add much of a memory/time cost, I will go with dictionaries in the first place. However, I am not sure having >400k dictionaries vs. having >400k lists make big of a difference in terms of performance.
In direct answer to your question: dictionaries have significantly more overhead than lists:
Each item consumes memory for both key and value, in contrast to only values for lists.
Adding or removing an item requires consulting a hash table.
Despite the fact that Python dictionaries are extremely well-designed and surprisingly fast, if you have an algorithm that can use direct index, you will save space and time.
However, from the sound of your question and subsequent discusion, it sounds like your needs may change over time and you have some uncertainty ("However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers")
If this is the case, I suggest creating a hybrid data structure of your own so that as your needs evolve you can address the efficiency of storage in an isolated place while allowing your application to use simple, readable code to store and retrieve objects.
For example, here is a Python3 class called maybelist that is derived from a list, but detects the presence of non-numeric keys, storing exceptions in a dictionary while providing mappings for some common list operations:
class maybelist(list):
def __init__(self, *args):
super().__init__(*args)
self._extras = dict()
def __setitem__(self, index, val):
try:
super().__setitem__(index, val)
return
except TypeError:
# Index is not an integer, store in dict
self._extras[index] = val
return
except IndexError:
pass
distance = index - len(self)
if distance > 0:
# Put 'None' in empty slots if need be
self.extend((None,) * distance)
self.append(val)
def __getitem__(self, index):
try:
return super().__getitem__(index)
except TypeError:
return self._extras[index]
def __str__(self):
return str([item for item in self])
def __len__(self):
return super().__len__() + len(self._extras)
def __iter__(self):
for item in itertools.chain(super().__iter__(), self._extras):
yield item
So, you could treat it like an array, and have it auto expand:
>>> x = maybelist()
>>> x[0] = 'first'
>>> x[1] = 'second'
>>> x[10] = 'eleventh'
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh']
>>> print(x[10])
eleventh
Or you could add items with non-numeric keys if they were present:
>>> x['unexpected'] = 'something else'
>>> print(x['unexpected'])
something else
And yet have the object appear to behave properly if you access it using iterators or other methods of your choosing:
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh', 'unexpected']
>>> print(len(x))
12
This is just an example, and you would need to tailor such a class to meet the needs of your application. For example, the resulting object does not strictly behave like a list (x[len(x)-1] is not the last item, for example). However, your application may not need such strict adherence, and if you are careful and plan properly, you can create an object which both provides highly optimized storage while leaving room for evolving data structure needs in the future.
dict uses a lot more memory that a list. Probably not enough to be a concern if the computer isn't very busy. There are exceptions of course - if it's a web server with 100 connections per second, you may want to consider saving memory at the expense of readability
>>> L = range(400000)
>>> sys.getsizeof(L)
3200072 # ~3 Megabytes
>>> D = dict(zip(range(400000), range(400000)))
>>> sys.getsizeof(D)
25166104 # ~25 Megabytes
Lists are what they seem - a list of values, but in a dictionary, you
have an 'index' of words, and for each of them a definition.
Dictionaries are the same, but the properties of a dict are different than lists because they work with mapping keys to values. That means you use a dictionary when:
You have to retrieve things based on some identifier, like names, addresses, or anything that can be a key.
You don't need things to be in order. Dictionaries do not normally have any notion of order, so you have to use a list for that.
You are going to be adding and removing elements and their keys.
Efficiency constrains are discussed at Stack posts Link1 & Link2.
Go for a dictionary as you have doubts regarding future values
also there is no memory constrains to bother
Reference
Not exactly the spot on answer for your not so clear question, but here are my thoughts:
You said
I am analyzing large number of items (>400k)
In that case, I'd advise you to use generators and/or process your date in chunks.
Better option would be to put your data, which are key-value pairs, in Redis and take out chunks of it at a time. Redis can handle your volume of data very easily.
You could write a script that processes one chunk at a time, and using the asyncio module, you could parallelize the chunk processing.
Something like this:
from concurrent import futures
def chunk_processor(data):
"""
Process your list data here
"""
pass
def parallelizer(map_func, your_data_list, n_workers=3):
with futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
for result in executor.map(map_func, your_data_list):
# Do whatever with your result
# Do the take out chunks of your data from Redis here
chunk_of_list = get_next_chunk_from_redis()
# Your processing starts here
parallelizer(chunk_processor, your_data_list)
Again, something better could be done, but I'm presenting you one of the ways to go about it.
Here's what I have so far:
def is_ordered(collection):
if isinstance(collection, set):
return False
if isinstance(collection, list):
return True
if isinstance(collection, dict):
return False
raise Exception("unknown collection")
Is there a much better way to do this?
NB: I do mean ordered and not sorted.
Motivation:
I want to iterate over an ordered collection. e.g.
def most_important(priorities):
for p in priorities:
print p
In this case the fact that priorities is ordered is important. What kind of collection it is is not. I'm trying to live duck-typing here. I have frequently been dissuaded by from type checking by Pythonistas.
If the collection is truly arbitrary (meaning it can be of any class whatsoever), then the answer has to be no.
Basically, there are two possible approaches:
know about every possible class that can be presented to your method, and whether it's ordered;
test the collection yourself by inserting into it every possible combination of keys, and seeing whether the ordering is preserved.
The latter is clearly infeasible. The former is along the lines of what you already have, except that you have to know about every derived class such as collections.OrderedDict; checking for dict is not enough.
Frankly, I think the whole is_ordered check is a can of worms. Why do you want to do this anyway?
Update: In essence, you are trying to unittest the argument passed to you. Stop doing that, and unittest your own code. Test your consumer (make sure it works with ordered collections), and unittest the code that calls it, to ensure it is getting the right results.
In a statically-typed language you would simply restrict yourself to specific types. If you really want to replicate that, simply specify the only types you accept, and test for those. Raise an exception if anything else is passed. It's not pythonic, but it reliably achieves what you want to do
Well, you have two possible approaches:
Anything with an append method is almost certainly ordered; and
If it only has an add method, you can try adding a nonce-value, then iterating over the collection to see if the nonce appears at the end (or, perhaps at one end); you could try adding a second nonce and doing it again just to be more confident.
Of course, this won't work where e.g. the collection is empty, or there is an ordering function that doesn't result in addition at the ends.
Probably a better solution is simply to specify that your code requires ordered collections, and only pass it ordered collections.
I think that enumerating the 90% case is about as good as you're going to get (if using Python 3, replace basestring with str). Probably also want to consider how you would handle generator expressions and similar ilk, too (again, if using Py3, skip the xrangor):
generator = type((i for i in xrange(0)))
enumerator = type(enumerate(range(0)))
xrangor = type(xrange(0))
is_ordered = lambda seq : isinstance(seq,(tuple, list, collections.OrderedDict,
basestring, generator, enumerator, xrangor))
If your callers start using itertools, then you'll also need to add itertools types as returned by islice, imap, groupby. But the sheer number of these special cases really starts to point to a code smell.
What if the list is not ordered, e.g. [1,3,2]?
I have one or more unordered sequences of (immutable, hashable) objects with possible duplicates and I want to get a sorted sequence of all those objects without duplicates.
Right now I'm using a set to quickly gather all the elements discarding duplicates, convert it to a list and then sort that:
result = set()
for s in sequences:
result = result.union(s)
result = list(result)
result.sort()
return result
It works but I wouldn't call it "pretty". Is there a better way?
This should work:
sorted(set(itertools.chain.from_iterable(sequences)))
I like your code just fine. It is straightforward and easy to understand.
We can shorten it just a little bit by chaining off the list():
result = set()
for s in sequences:
result = result.union(s)
return sorted(result)
I really have no desire to try to boil it down beyond that, but you could do it with reduce():
result = reduce(lambda s, x: s.union(x), sequences, set())
return sorted(result)
Personally, I think this is harder to understand than the above, but people steeped in functional programming might prefer it.
EDIT: #agf is much better at this reduce() stuff than I am. From the comments below:
return sorted(reduce(set().union, sequences))
I had no idea this would work. If I correctly understand how this works, we are giving reduce() a callable which is really a method function on one instance of a set() (call it x for the sake of discussion, but note that I am not saying that Python will bind the name x with this object). Then reduce() will feed this function the first two iterables from sequences, returning x, the instance whose method function we are using. Then reduce() will repeatedly call the .union() method and ask it to take the union of x and the next iterable from sequences. Since the .union() method is likely smart enough to notice that it is being asked to take the union with its own instance and not bother to do any work, it should be just as fast to call x.union(x, some_iterable) as to just call x.union(some_iterable). Finally, reduce() will return x, and we have the set we want.
This is a bit tricky for my personal taste. I had to think this through to understand it, while the itertools.chain() solution made sense to me right away.
EDIT: #agf made it less tricky:
return sorted(reduce(set.union, sequences, set()))
What this is doing is much simpler to understand! If we call the instance returned by set() by the name of x again (and just like above with the understanding that I am not claiming that Python will bind the name x with this instance); and if we use the name n to refer to each "next" value from sequences; then reduce() will be repeatedly calling set.union(x, n). And of course this is exactly the same thing as x.union(n). IMHO if you want a reduce() solution, this is the best one.
--
If you want it to be fast, ask yourself: is there any way we can apply itertools to this? There is a pretty good way:
from itertools import chain
return sorted(set(chain(*sequences)))
itertools.chain() called with *sequences serves to "flatten" the list of lists into a single iterable. It's a little bit tricky, but only a little bit, and it's a common idiom.
EDIT: As #Jbernardo wrote in the most popular answer, and as #agf observes in comments, itertools.chain() returns an object that has a .from_iterable() method, and the documentation says it evaluates an iterable lazily. The * notation forces building a list, which may consume considerable memory if the iterable is a long sequence. In fact, you could have a never-ending generator, and with itertools.chain().from_iterable() you would be able to pull values from it for as long as you want to run your program, while the * notation would just run out of memory.
As #Jbernardo wrote:
sorted(set(itertools.chain.from_iterable(sequences)))
This is the best answer, and I already upvoted it.