Saving calculation results for re-use, while managing memory consumption - python

I'm caching values that are slow to calculate but are usually needed several times. I have a dictionary that looks something like this:
stored_values = {
hash1: slow_to_calc_value1
hash2: slow_to_calc_value2
# And so on x5000
}
I'm using it like this, to quickly fetch the value if it has been calculated before.
def calculate_value_for_item(item):
item_hash = hash_item(item) # Hash the item, used as the dictionary key
stored_value = stored_values.get(item_hash, None)
if stored_value is not None:
return stored_value
calculated_value = do_heavy_math(item) # This is slow and I want to avoid
# Storing the reult for re-use makes me run out of memory at some point
stored_values[item_hash] = calculated_value
return calculated_value
However, I'm running out of memory if I try to store all values that are calculated throughout the program.
How can I manage the size of the lookup dictionary efficiently? It's a reasonable assumption that values which were needed most recently are also most likely to be needed in the future.
Things to note
I have simplified the scenario a lot.
The stored values actually use a lot of memory. The dictionary itself doesn't contain too many items, only several thousand. I can definitely afford some parallel book-keeping data structures if needed.
An ideal solution would let me store n last needed values while removing the rest. But any heuristic close enough is good enough.

Have you tried using the #lru_cache decorator? It seems to do exactly what you are asking for.
from functools import lru_cache
store_this_many_values = 5
#lru_cache(maxsize=store_this_many_values)
def calculate_value_for_item(item):
calculated_value = do_heavy_math(item)
return calculated_value
#lru_cache also adds new functions, which might help you to optimise for memory and/or performance, such as cache_info
for i in [1,1,1,2]:
calculate_value_for_item(i)
print(calculate_value_for_item.cache_info())
>>> CacheInfo(hits=2, misses=2, maxsize=5, currsize=2)

Related

Python - get object - Is dictionary or if statements faster?

I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?
First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++
I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.
I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.

What is the overhead of using a dictionary instead of a list?

I have a situation in one of my projects that I can either use lists or dictionaries and I am having hard time picking which one to use.
I am analyzing large number of items (>400k). And I will have (>400k) list or dictionaries which I will use very frequently. (Get/Set/Update)
In my particular situation, using a dictionary feels like more convenient than list if I wouldn't think about performance at all. However, I know I can manage writing the same thing using lists.
Should I go for readibility and use dictionaries or going with dictionary may add too much of a overhead that will dramatically decrease my performance from the perspective of both memory and time.
I know this question is a bit too-broad. But I wanted to ask it before I start building all my logic after having this decision done.
My situation in a nutshell:
I have values for keys 0,1,...,n. For now, keys will be always integers from 0 to n which I can keep in a list.
However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers. Or integers which are not consecutive.
So, the question is if using dictionaries instead of lists in the first place wouldn't add much of a memory/time cost, I will go with dictionaries in the first place. However, I am not sure having >400k dictionaries vs. having >400k lists make big of a difference in terms of performance.
In direct answer to your question: dictionaries have significantly more overhead than lists:
Each item consumes memory for both key and value, in contrast to only values for lists.
Adding or removing an item requires consulting a hash table.
Despite the fact that Python dictionaries are extremely well-designed and surprisingly fast, if you have an algorithm that can use direct index, you will save space and time.
However, from the sound of your question and subsequent discusion, it sounds like your needs may change over time and you have some uncertainty ("However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers")
If this is the case, I suggest creating a hybrid data structure of your own so that as your needs evolve you can address the efficiency of storage in an isolated place while allowing your application to use simple, readable code to store and retrieve objects.
For example, here is a Python3 class called maybelist that is derived from a list, but detects the presence of non-numeric keys, storing exceptions in a dictionary while providing mappings for some common list operations:
class maybelist(list):
def __init__(self, *args):
super().__init__(*args)
self._extras = dict()
def __setitem__(self, index, val):
try:
super().__setitem__(index, val)
return
except TypeError:
# Index is not an integer, store in dict
self._extras[index] = val
return
except IndexError:
pass
distance = index - len(self)
if distance > 0:
# Put 'None' in empty slots if need be
self.extend((None,) * distance)
self.append(val)
def __getitem__(self, index):
try:
return super().__getitem__(index)
except TypeError:
return self._extras[index]
def __str__(self):
return str([item for item in self])
def __len__(self):
return super().__len__() + len(self._extras)
def __iter__(self):
for item in itertools.chain(super().__iter__(), self._extras):
yield item
So, you could treat it like an array, and have it auto expand:
>>> x = maybelist()
>>> x[0] = 'first'
>>> x[1] = 'second'
>>> x[10] = 'eleventh'
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh']
>>> print(x[10])
eleventh
Or you could add items with non-numeric keys if they were present:
>>> x['unexpected'] = 'something else'
>>> print(x['unexpected'])
something else
And yet have the object appear to behave properly if you access it using iterators or other methods of your choosing:
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh', 'unexpected']
>>> print(len(x))
12
This is just an example, and you would need to tailor such a class to meet the needs of your application. For example, the resulting object does not strictly behave like a list (x[len(x)-1] is not the last item, for example). However, your application may not need such strict adherence, and if you are careful and plan properly, you can create an object which both provides highly optimized storage while leaving room for evolving data structure needs in the future.
dict uses a lot more memory that a list. Probably not enough to be a concern if the computer isn't very busy. There are exceptions of course - if it's a web server with 100 connections per second, you may want to consider saving memory at the expense of readability
>>> L = range(400000)
>>> sys.getsizeof(L)
3200072 # ~3 Megabytes
>>> D = dict(zip(range(400000), range(400000)))
>>> sys.getsizeof(D)
25166104 # ~25 Megabytes
Lists are what they seem - a list of values, but in a dictionary, you
have an 'index' of words, and for each of them a definition.
Dictionaries are the same, but the properties of a dict are different than lists because they work with mapping keys to values. That means you use a dictionary when:
You have to retrieve things based on some identifier, like names, addresses, or anything that can be a key.
You don't need things to be in order. Dictionaries do not normally have any notion of order, so you have to use a list for that.
You are going to be adding and removing elements and their keys.
Efficiency constrains are discussed at Stack posts Link1 & Link2.
Go for a dictionary as you have doubts regarding future values
also there is no memory constrains to bother
Reference
Not exactly the spot on answer for your not so clear question, but here are my thoughts:
You said
I am analyzing large number of items (>400k)
In that case, I'd advise you to use generators and/or process your date in chunks.
Better option would be to put your data, which are key-value pairs, in Redis and take out chunks of it at a time. Redis can handle your volume of data very easily.
You could write a script that processes one chunk at a time, and using the asyncio module, you could parallelize the chunk processing.
Something like this:
from concurrent import futures
def chunk_processor(data):
"""
Process your list data here
"""
pass
def parallelizer(map_func, your_data_list, n_workers=3):
with futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
for result in executor.map(map_func, your_data_list):
# Do whatever with your result
# Do the take out chunks of your data from Redis here
chunk_of_list = get_next_chunk_from_redis()
# Your processing starts here
parallelizer(chunk_processor, your_data_list)
Again, something better could be done, but I'm presenting you one of the ways to go about it.

Python memory explosion with embedded functions

I have used Python for a while and from time to time I meet some memory explosion problem. I have searched for some sources to resolve my question such as
Memory profiling embedded python
and
https://mflerackers.wordpress.com/2012/04/12/fixing-and-avoiding-memory-leaks-in-python/
and
https://docs.python.org/2/reference/datamodel.html#object.del However, none of them works for me.
My current problem is the memory explosion when using embedded functions. The following codes works fine:
class A:
def fa:
some operations
get dictionary1
combine dictionary1 to get string1
dictionary1 = None
return *string1*
def fb:
for i in range(0, j):
call self.fa
get dictionary2 by processing *string1*
# dictionary1 and dictionary2 are basically the same.
update *dictionary3* by processing dictionary2
dictionary2 = None
return *dictionary3*
class B:
def ga:
for n in range(0, m):
call A.fb # as one argument is updated dynamically, I have to call it within the loop
processes *dictoinary3*
return something
The problem arouses when I notice that I don't need to combine dictionary1 to string1, I can directly pass dictionary1 to A.fb. I implemented it this way, then the program becomes extremely slow and the memory usage explodes for more than 10 times. I have verified that both method returned correct result.
May anybody suggest why such a little modification will result in so large difference?
Previously, I also noticed this when I was levelizing nodes in a multi-source tree (with 100,000+ nodes). If I start levelizing from the source node (which may have largest height) the memory usage is 100 times worse than that from the source node which may have smallest height. While the levelization time is about the same.
This has baffled me for a long time. Thank you so much in advance!
If anybody interested, I can email you the source code for a more clear explanation.
The fact that you're solving the same problem shouldn't imply anything about the efficiency of the solution. Same issue can be claimed with sorting arrays: you can use bubble-sort O(n^2), or merge-sort O(nlogn), or, if you can apply some restrictions you can use a non-comparison sorting algorithm like radix or bucket-sort which have linear runtime.
Starting to traverse from different nodes will generate different ways of traversing the graph - some of which might be non-efficient (repeating nodes more times).
As for "combine dictionary1 to string1" - it might be a very expensive operation and since this function is called recursively (many times) - the performance could be significantly poorer. But that's just an educated guess and cannot be answered without having more details about the complexity of the operations performed in these functions.

Quick way to extend a set if we know elements are unique

I am performing multiple iterations of the type:
masterSet=masterSet.union(setA)
As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).
I expect that the time is taken up checking whether each element of setA is already in masterSet?
My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?
[UPDATE]
Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:
When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.
I wondered if there was a way to 'tell' the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer's assertion they were definately distict. Perhpas through calling some different ".unionWithDistinctSet()" procedure or something.
I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.
I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?
You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union...
>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])
Of course, if you're doing this in a loop:
masterSet = set()
for setA in iterable:
masterSet = masterSet.union(setA)
You might get a performance boost by doing something like:
masterSet = set().union(*iterable)
Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn't really a big performance hit.
As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:
def union():
i = set(range(10000))
j = set(range(5000, 15000))
return i.union(j)
def update():
i = set(range(10000))
j = set(range(5000, 15000))
i.update(j)
return i
timeit.Timer(union).timeit(10000) # 10.351907968521118
timeit.Timer(update).timeit(10000) # 8.83384895324707
If you know your elements are unique, a set is not necessarily the best structure.
A simple list is way faster to extend.
masterList = list(masterSet)
masterList.extend(setA)
For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)
However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it's not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.

Reducing collection of reduce in Python

If I am reducing over a collection in Python, what's the most efficient way to get the rest of the collection (the unvisited items) ? It is quite often that I need to reduce over a collection, but I want my reducing function to take the unvisited items of the collection I am reducing over.
edit - to clarify, I want something like:
reduce(lambda to-return, item, rest: (code here), collection, initial)
where rest is the items not yet seen by my lambda
This is the best I can do. It expects that the "collection" be sliceable:
def myreduce(func,collection,*args):
"""func takes 3 parameters. The previous value,
the current value, and the rest of the collection"""
def new_func(x,y):
try:
return func(x[1],y[1],collection[y[0]:])
except TypeError:
return func(x,y[1],collection[y[0]:])
return reduce(new_func,enumerate(collection),*args)
print myreduce(lambda x,y,rest:x+y+sum(rest),range(30))
Note that this is very poorly tested. Please test thoroughly before you attempt to use this in any real code. If you really want this to work for any iterable, you could put a collection = tuple(collection) in there at the top I suppose (assuming you have enough memory to store your entire iterable in memory at once)

Categories

Resources