Is there anything faster than dict()? - python

I need a faster way to store and access around 3GB of k:v pairs. Where k is a string or an integer and v is an np.array() that can be of different shapes.
Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame?
As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?

No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )
Once you saved your items in a dictionary, you can have access to them in constant time which means that it's unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.
e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary's keys. Interning is caching the objects in memory --or as in Python, table of "interned" strings-- rather than creating them as a separate object.
Python has provided an intern() function within the sys module that you can use for this.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
also ...
If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop

No, I don't think there is anything faster than dict. The time complexity of its index checking is O(1).
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity

A numpy.array[] and simple dict = {} comparison:
import numpy
from timeit import default_timer as timer
my_array = numpy.ones([400,400])
def read_out_array_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_array[i,j]
start = timer()
read_out_array_values()
end = timer()
print("Time for array calculations:" + str(end - start))
my_dict = {}
for i in range(400):
for j in range(400):
my_dict[i,j] = 1
def read_out_dict_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_dict[i,j]
start = timer()
read_out_dict_values()
end = timer()
print("Time for dict calculations:" + str(end - start))
Prints:
Time for dict calculations:0.046898419999999996
Time for array calculations:0.07558204099999999
============= RESTART: C:/Users/user/Desktop/dict-vs-numpyarray.py =============
Time for array calculations:0.07849989000000002
Time for dict calculations:0.047769446000000104

One would think that array indexing is faster than hash lookup.
So if we could store this data in a numpy array, and assume the keys are not strings, but numbers, would that be faster than a python a dictionary?
Unfortunately not, because NumPy is optimized for vector operations, not for individual look up of values.
Pandas fares even worse.
See the experiment here: https://nbviewer.jupyter.org/github/annotation/text-fabric/blob/master/test/pandas/pandas.ipynb
The other candidate could be the Python array, in the array module. But that is not usable for variable-size values.
And in order to make this work, you probably need to wrap it into some pure python code, which will set back all time performance gains that the array offers.
So, even if the requirements of the OP are relaxed, there still does not seem to be a faster option than dictionaries.

You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.
You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say
HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE
or something similar due to collision we need to use 256^i.
You can try to store them in Trie and see how it performs.

Related

Most efficient way to find mode in an array using python? Return type is an array of integers

Here is my solution, which works in O(N) time and O(N) space:
def find_mode(array):
myDict = {}
result = []
for i in range(len(array)):
if array[i] in myDict:
myDict[array[i]] += 1
else:
myDict[array[i]] = 1
maximum = max(myDict.values())
for key, value in myDict.items():
if value == maximum:
result.append(key)
return result
I can't think of a more efficient solution than O(N) but if anyone has any improvements to this function please let me know. The return type is an array of integers.
First, you should note that O(n) worst-case time cannot be improved upon with a deterministic, non-randomized algorithm, since we may need to check all elements.
Second, since you want all modes, not just one, the best space complexity of any possible algorithm is O(|output|), not O(1).
Third, this is as hard as the Element distinctness problem. This implies that any algorithm that is 'expressible' in terms of decision trees only, can at best achieve Omega(n log n) runtime. To beat this, you need to be able to hash elements or use numbers to index the computer's memory or some other non-combinatorial operation. This isn't a rigorous proof that O(|output|) space complexity with O(n) time is impossible, but it means you'll need to specify a model of computation to get a more precise bound on runtime, or specify bounds on the range of integers in your array.
Lastly, and most importantly, you should profile your code if you are worried about performance. If this is truly the bottleneck in your program, then Python may not be the right language to achieve the absolute minimum number of operations needed to solve this problem.
Here's a more Pythonic approach, using the standard library's very useful collections.Counter(). The Counter initialization (in CPython) is usually done through a C function, which will be faster than your for loop. It is still O(n) time and space, though.
def find_mode(array: List[int]) -> List[int]:
counts = collections.Counter(array)
maximum = max(counts.values())
return [key for key, value in counts.items()
if value == maximum]

Why `s = [x for x in "hello"]; sum(s)` doesn't work in Python [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Implementation of hashing and dictionaries

Let's take this code which does nothing really interesting. It creates a dictionary with a single key 'x' * N (N being an argument of the script), then accesses this item 10000000 times and prints the execution time.
import sys, time
X = 'x' * int(sys.argv[1])
LOOPS = 10000000
attrs = {X: 1}
t1 = time.time()
n = 0
for _ in range(LOOPS):
n += attrs[X]
attrs[X] += 1
t2 = time.time()
print(t2 - t1)
I launched it with different values of N in {1, 10, 100, 1000} and I did not observe any increase of the execution time (around 2.8 seconds each time on my machine). I was expecting some as my first guess was that each access to attrs would cause an hash value to be computed for 'x' * N. So I'm curious about the magic behind this. Is there some caching mechanism applied? Or am I wrong about my assumption on the implementation of dictionaries?
Python's implementation of str.__hash__() caches the hash the first time it's computed for that string, so the hash only has to be computed once for any given string.
Some (but not all) other immutable objects do the same sort of caching. See https://bugs.python.org/issue1462796 (a rejected request to apply the same caching logic to tuples).
The value/length of 'X' variable is changing based on your input i.e., {1, 10, 100*, 1000}. This means the memory used to store the variable is increasing with your input size.
The value of X is used in a dictionary and is used only to hash/index the value in the dictionary. In the dictionary, the hashing is done by internally storing the address of the index. Due to this hash/index, the time to access the value is the same irrespective of the length of X. This is the reason for the CPU time taken to be constant.

What is optimal algorithm to check if a given integer is equal to sum of two elements of an int array?

def check_set(S, k):
S2 = k - S
set_from_S2=set(S2.flatten())
for x in S:
if(x in set_from_S2):
return True
return False
I have a given integer k. I want to check if k is equal to sum of two element of array S.
S = np.array([1,2,3,4])
k = 8
It should return False in this case because there are no two elements of S having sum of 8. The above code work like 8 = 4 + 4 so it returned True
I can't find an algorithm to solve this problem with complexity of O(n).
Can someone help me?
You have to account for multiple instances of the same item, so set is not good choice here.
Instead you can exploit dictionary with value_field = number_of_keys (as variant - from collections import Counter)
A = [3,1,2,3,4]
Cntr = {}
for x in A:
if x in Cntr:
Cntr[x] += 1
else:
Cntr[x] = 1
#k = 11
k = 8
ans = False
for x in A:
if (k-x) in Cntr:
if k == 2 * x:
if Cntr[k-x] > 1:
ans = True
break
else:
ans = True
break
print(ans)
Returns True for k=5,6 (I added one more 3) and False for k=8,11
Adding onto MBo's answer.
"Optimal" can be an ambiguous term in terms of algorithmics, as there is often a compromise between how fast the algorithm runs and how memory-efficient it is. Sometimes we may also be interested in either worst-case resource consumption or in average resource consumption. We'll loop at worst-case here because it's simpler and roughly equivalent to average in our scenario.
Let's call n the length of our array, and let's consider 3 examples.
Example 1
We start with a very naive algorithm for our problem, with two nested loops that iterate over the array, and check for every two items of different indices if they sum to the target number.
Time complexity: worst-case scenario (where the answer is False or where it's True but that we find it on the last pair of items we check) has n^2 loop iterations. If you're familiar with the big-O notation, we'll say the algorithm's time complexity is O(n^2), which basically means that in terms of our input size n, the time it takes to solve the algorithm grows more or less like n^2 with multiplicative factor (well, technically the notation means "at most like n^2 with a multiplicative factor, but it's a generalized abuse of language to use it as "more or less like" instead).
Space complexity (memory consumption): we only store an array, plus a fixed set of objects whose sizes do not depend on n (everything Python needs to run, the call stack, maybe two iterators and/or some temporary variables). The part of the memory consumption that grows with n is therefore just the size of the array, which is n times the amount of memory required to store an integer in an array (let's call that sizeof(int)).
Conclusion: Time is O(n^2), Memory is n*sizeof(int) (+O(1), that is, up to an additional constant factor, which doesn't matter to us, and which we'll ignore from now on).
Example 2
Let's consider the algorithm in MBo's answer.
Time complexity: much, much better than in Example 1. We start by creating a dictionary. This is done in a loop over n. Setting keys in a dictionary is a constant-time operation in proper conditions, so that the time taken by each step of that first loop does not depend on n. Therefore, for now we've used O(n) in terms of time complexity. Now we only have one remaining loop over n. The time spent accessing elements our dictionary is independent of n, so once again, the total complexity is O(n). Combining our two loops together, since they both grow like n up to a multiplicative factor, so does their sum (up to a different multiplicative factor). Total: O(n).
Memory: Basically the same as before, plus a dictionary of n elements. For the sake of simplicity, let's consider that these elements are integers (we could have used booleans), and forget about some of the aspects of dictionaries to only count the size used to store the keys and the values. There are n integer keys and n integer values to store, which uses 2*n*sizeof(int) in terms of memory. Add to that what we had before and we have a total of 3*n*sizeof(int).
Conclusion: Time is O(n), Memory is 3*n*sizeof(int). The algorithm is considerably faster when n grows, but uses three times more memory than example 1. In some weird scenarios where almost no memory is available (embedded systems maybe), this 3*n*sizeof(int) might simply be too much, and you might not be able to use this algorithm (admittedly, it's probably never going to be a real issue).
Example 3
Can we find a trade-off between Example 1 and Example 2?
One way to do that is to replicate the same kind of nested loop structure as in Example 1, but with some pre-processing to replace the inner loop with something faster. To do that, we sort the initial array, in place. Done with well-chosen algorithms, this has a time-complexity of O(n*log(n)) and negligible memory usage.
Once we have sorted our array, we write our outer loop (which is a regular loop over the whole array), and then inside that outer loop, use dichotomy to search for the number we're missing to reach our target k. This dichotomy approach would have a memory consumption of O(log(n)), and its time complexity would be O(log(n)) as well.
Time complexity: The pre-processing sort is O(n*log(n)). Then in the main part of the algorithm, we have n calls to our O(log(n)) dichotomy search, which totals to O(n*log(n)). So, overall, O(n*log(n)).
Memory: Ignoring the constant parts, we have the memory for our array (n*sizeof(int)) plus the memory for our call stack in the dichotomy search (O(log(n))). Total: n*sizeof(int) + O(log(n)).
Conclusion: Time is O(n*log(n)), Memory is n*sizeof(int) + O(log(n)). Memory is almost as small as in Example 1. Time complexity is slightly more than in Example 2. In scenarios where the Example 2 cannot be used because we lack memory, the next best thing in terms of speed would realistically be Example 3, which is almost as fast as Example 2 and probably has enough room to run if the very slow Example 1 does.
Overall conclusion
This answer was just to show that "optimal" is context-dependent in algorithmics. It's very unlikely that in this particular example, one would choose to implement Example 3. In general, you'd see either Example 1 if n is so small that one would choose whatever is simplest to design and fastest to code, or Example 2 if n is a bit larger and we want speed. But if you look at the wikipedia page I linked for sorting algorithms, you'll see that none of them is best at everything. They all have scenarios where they could be replaced with something better.

Optimized searching in Python against a list

Problem:
Given a list of n objects (n's Order of magnitude is 10^5), search for a given item very fast with a minimum of spacetime tradeoff. Current, unoptimized & prototype-y solution takes too long and consumes too much RAM (the optimization is not premature, that is).
There is not a primary key to sort against in the object, but it can be sorted to a certain degree, such as the following example, where the first column is sorted.
o1 => f, g, h
o2 => f, g, i
o3 => f, j, k
o4 => k, j, m
To date, the solution has been nested filters:
filter(test1, filter(test2, filter(test3, the_list)))
But that has been slow, since it involves n * (n - 1) * (n - 2) operations, which approximates to O(n^3) speed, and at least n*2 extra lists of references.
As a note, it would be vastly preferably to have an in-place search.
I haven't found a standard library for handling this. What is the typical solution to this problem?
filter(test1, filter(test2, filter(test3, the_list)))
Firstly, this is O(n) time, not O(n^3) time. The time adds not multiply. The only this could be worse then that is if test3/test2/test1 are doing something odd, in which we should look at those.
If we suggest that each test? function takes 10 ms, then we have 10*3*10^5 ms = 50 minutes. If it was n^3, then we'd have (10*10^5)^3 = 31 million years. I'm pretty sure you are only one linear time, you just have a ton of data.
Replace filter with itertools.ifilter, it'll avoid generating the list. Instead, python will pull one item out of the list at a time, pass it through the three tests and give it to you if and only if it passes. It'll avoid the memory requirement and probably be faster as well.
You aren't going to be able to improve on O(n) time unless you use some indexing techniques. However, the applicability of indexing techniques depends on what you are doing inside the test1/test2/test3 functions. If you want help on that, show an example for those functions.
As other have noted, database were designed to solve these problems. You can make this faster only be reimplementing badly what databases already do for you.
Concatenate the attribute values for each object to make unique keys. You may have to pad the attributes out to the same length to guarantee uniqueness. Construct a hash table to return the object that matches a key.
10^5 is not really that big a number of objects, even in-memory. littletable is a little module I wrote as an experiment for simulating queries, pivots, etc. using just Python dicts. One nice thing about littletable queries is that the result of any query or join is itself a new littletable Table. Indexes are kept as dicts of keys->table objects, and index keys can be defined to be unique or not.
I created a table of 140K objects with 3 single letter keys, and then queried for a specific key. The time to build the table itself was the longest, the indexing and querying pretty fast.
from itertools import product
from littletable import Table,DataObject
objects = Table()
alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
alphas += alphas.lower()
import time
print "building table", time.time()
objects.insert_many(
DataObject(k1=k1, k2=k2, k3=k3, created=time.time())
for k1,k2,k3 in product(alphas.upper(),alphas,alphas)
)
print "table complete", time.time()
print len(objects)
print "indexing table", time.time()
for k in "k1 k2 k3".split():
objects.create_index(k)
print "index complete", time.time()
print "get specific row", time.time()
matches = objects.query(k1="X", k2="k", k3="W")
for o in matches:
print o
print time.time()
Prints:
building table 1309377011.63
table complete 1309377012.52
140608
indexing table 1309377012.52
index complete 1309377012.98
get specific row 1309377012.98
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377011.9960001}
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377012.4260001}
1309377013.0
It seems to me one typical solution would be to use a database query. Either SQL (raw or with some kind of ORM), or some kind of object database, maybe MongoDB?
If your data is in a CSV file, you could try sql2csv: https://sourceforge.net/projects/sql2csv/.
EDIT: Pardon my early-onset senility, I meant this project: https://github.com/ccoffey/sql4csv/wiki/Examples.

Categories

Resources