Large sortable data structures? Dictionary or something else?

Large sortable data structures? Dictionary or something else? - python

I have a large python dictionary (65535 key:value pairs), where key is range(0, 65536) and the values are integers.
The solution I found to sorting this data structure is posted here:
Sort a Python dictionary by value
That solution works, but is not necessarily very fast.
To further complicate the problem, there is a potential for me to have many (thousands) of these dictionaries that I must combine prior to sorting. I am currently combining these dictionaries by iterating over the pairs in one dictionary, doing a key lookup in the other dictionary, and adding/updating the entry as appropriate.
This makes my question two fold:
1)Is a dictionary the right data structure for this problem? Would a custom tree or something else make more sense?
2)If dictionary is the smart, reasonable choice, what is the ideal way to combine multiples of the dictionary and then sort it?
One solution may be for me to redesign my program's flow in order to decrease the number of dictionaries being maintained to one, though this is more of a last resort.
Thanks

A dictionary populated with 65535 entries with keys from the range (0:65536) sounds suspiciously like an array. If you need sorted arrays, why are you using dictionaries?
Normally, in Python, you would use a list for this type of data. In your case, since the values are integers, you might also want to consider using the array module. You should also have a look at the heapq module since if your data can be represented in this way, there is a builtin merge function that could be used.
In any case, if you need to merge data structures and produce a sorted data structure as a result, it is best to to use a merge algorithm and one possibility for that is a mergesort algorithm.

There's not enough information here to say which data structure you should use, because we don't know what else you're doing with it.
If you need to be able to quickly insert records into the data structure later one at a time, then you do need a tree-like data structure, which unfortunately doesn't have a standard implementation (or even a standard interface, for some operations) in Python.
If you only need to be able to do what you said--to sort existing data--then you can use lists. The sorting is quick, especially if parts of the data are already sorted, and you can use binary searching for fast lookups. However, inserting elements will be O(n) rather than the O(log n) you'll get with a tree.
Here's a simple example, converting the dicts to a list or tuples, sorting the combined results and using the bisect module to search for items.
Note that you can have duplicate keys, showing up in more than one dict. This is handled easily: they'll be sorted together naturally, and the bisection will give you a [start, end) range containing all of those keys.
If you want to add blocks of data later, append it to the end and re-sort the list; Python's sorting is good at that and it'll probably be much better than O(n log n).
This code assumes your keys are integers, as you said.
dataA = { 1: 'data1', 3: 'data3', 5: 'data5', 2: 'data2' }
dataB = { 2: 'more data2', 4: 'data4', 6: 'data6' }
combined_list = dataA.items() + dataB.items()
combined_list.sort()
print combined_list
import bisect
def get_range(data, value):
lower_bound = bisect.bisect_left(data, (value, ))
upper_bound = bisect.bisect_left(data, (value+1, ))
return lower_bound, upper_bound
lower_bound, upper_bound = get_range(combined_list, 2)
print lower_bound, upper_bound
print combined_list[lower_bound:upper_bound]

With that quantity of data, I would bite the bullet and use the built in sqlite module. Yes you give up some python flexibility and have to use SQL, but right now its sorting 65k values; next it will be find values that meet certain criteria. So instead of reinventing relational databases, just go the SQL route now.

Related

Difference between removing duplicates from a list using dict and set?

According to my research there are two easy ways to remove duplicates from a list:
a = list(dict.fromkeys(a))
and
a = list(set(a))
Is one of them more efficient than the other?

Definitely the second one is more efficient as sets are more or less created for that purpose and you skip the overhead related to creation of dict which is way heavier.
Perfomance-wise it definitely depends on what the payload actually is.
import timeit
import random
input_data = [random.choice(range(100)) for i in range(1000)]
from_keys = timeit.timeit('list(dict.fromkeys(input_data))', number=10000, globals={'input_data': input_data})
from_set = timeit.timeit('list(set(input_data))', number=10000, globals={'input_data': input_data})
print(f"From keys performance: {from_keys:.3f}")
print(f"From set performance: {from_set:.3f}")
Prints:
From keys performance: 0.230
From set performance: 0.140
It doesn't really mean it's almost twice as fast. The difference is barely visible. Try it for yourself with different random data.

The second answer is way better not only because its faster, but it shows the intention of the programmer better. set() is designed specifically to describe mathematical sets in which elements cannot be duplicated, thus it fits this purpose and the intention is clear to the reader. On the other hand dict() is for storing key-value pairs and tells nothing about the intention.

in case we have a list containing a = [1,16,2,3,4,5,6,8,10,3,9,15,7]
and we used a = list(set(a)) the set() function will drop the duplication's and also reorder our list, the new list will look like this [1,2,3,4,5,6,7,8,9,10,15,16], while if we use a = list(dict.fromkeys(a)) the dict.fromkeys() function will drop the duplication's and keep the list elements in the same order [1,16,2,3,4,5,6,8,10,9,15,7].
to sum things up, if you're looking for a way to drop duplications from a list without caring about reordering the list then set() is what you're looking for, but!! if keeping the order of the list is required, then you can use dict.fromkeys()

CAUTION: since Python 3.7 the keys of a dict are ordered.
So the first form that uses
list(dict.fromkeys(a)) # preserves order!!
preserves the order while using the set will potentially (and probably) change the order of the elements of the list 'a'.

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).

There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())

Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Can I build a list, and sort it at the same time?

I'm working on a script for a piece of software, and it doesn't really give me direct access to the data I need. Instead, I need to ask for each piece of information I need, and build a list of the data I'm getting. For various reasons, I need the list to be sorted. It's very easy to just build the list once, and then sort it, followed by doing stuff with it. However, I assume it would be faster to run through everything once, rather than build the list and then sort it.
So, at the moment I've basically got this:
my_list = []
for item in "query for stuff":
my_list.append("query for %s data" % item)
my_list.sort()
do_stuff(my_list)
The "query for stuff" bit is the query interface with the software, which will give me an iterable. my_list needs to contain a list of data from the contents of said iterable. By doing it like this, I'm querying for the first list, then looping over it to extract the data and put it into my_list. Then I'm sorting it. Lastly, I'm doing stuff to it with the do_stuff() method, which will loop over it and do stuff to each item.
The problem is that I can't do_stuff() to it before it's sorted, as the list order is important for various reasons. I don't think I can get away from having to loop over lists twice — once to build the list and once to do stuff to each item in it, as we won't know in advance if a recently added item at position N will stay at position N after we've added the next item — but it seems cleaner to insert each item in a sorted fashion, rather than just appending them at the end. Kind of like this:
for item in "query for stuff":
my_list.append_sorted(item)
Is it worth bothering trying to do it like this, or should I just stick to building the list, and then sorting it?
Thanks!

The short answer is: it's not worth it.
Have a look at insertion sort. The worst-case running time is O(n^2) (average case is also quadratic). On the other hand, Python's sort (also known as Timsort) will take O(n log n) in the worst case.
Yes, it does "seem" cleaner to keep the list sorted as you're inserting, but that's a fallacy.
There is no real benefit to it. The only time you'd consider using insertion sort is when you need to show the sorted list after every insertion.

The two approaches are asmptotically equivalent.
Sorting is O(n lg n) (Python uses Timsort by default, except for very small arrays), and inserting in a sorted list is O(lg n) (using binary search), which you would have to do n times.
In practice, one method or the other may be slightly faster, depending on how much of your data is already sorted.
EDIT: I assumed that inserting in the middle of a sorted list after you've found the insertion point would be constant time (i.e. the list behaved like a linked list, which is the data structure you would use for such an algorithm). This probably isn't the case with Python lists, as pointed out by Sven. This would make the "keep the list sorted" approach O(n^2), i.e. insertion sort.
I say "probably" because some list implementations switch from array to linked list as the list grows, the most notable example being CFArray/NSArray in CoreFoundation/Cocoa. This may or may not be the case with Python.

Take a look at the bisect module. It gives you various tools for maintaining a list order. In your case, you probably want to use bisect.insort.
for item in query_for_stuff():
bisect.insort( my_list, "query for %s data" % item )

Storing more than one key value in a tuple with python?

I'm new to Python and still learning. I was wondering if there was a standard 'best practice' for storing more than one key value in a tuple. Here's an example:
I have a value called 'red' which has a value of 3 and I need to divide it by a number (say 10). I need to store 3 values: Red (the name), 3 (number of times its divides 10) and 1 (the remainder). There are other values that are similar that will need to be included as well, so this is for red but same results for blue, green, etc. (numbers are different for each label).
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records (and I'll need to search through it so I thought maybe nested anything might slow me down).
I tried to create something like {'red':3:1,...} but its not the correct syntax and I'm considering adding a delimiter in the key value and then splitting it but not sure if that's efficient (such as {'red':3a1,..} then parse by the letter a).
I'm wondering if there's any better ways to store this or is nested tuples my only solution? I'm using Python 2.

The syntax for tuples is: (a,b,c).
If you want a dictionary with multiple values you can have a list as the value: {'red':[3,1]}.
You may want to also consider named tuples, or even classes. This will allow you to name the fields instead of accessing them by index, which will make the code more clear and structured.
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records(and I'll need to search through it so I thought maybe nested anything might slow me down).
If you have a billion records you probably should be persisting the data (for example in a database). You will likely run out of memory if you try to keep all the data in memory at once.

Use tuple. For example:
`('red', 3, 1)`

Perhaps you mean dictionaries instead of tuples?
{'red': [3,1], 'blue': [2,2]}
If you are trying to store key/value pairs the best way would be to store them in a dictionary. And if you need more than one value to each key, just put those values in a list.
I don't think you would want to store such things in a tuple because tuples aren't mutable. So if you decide to change the order of the quotient and remainder (1, 3) instead of (3,1), you would need to create new tuples. Whereas with lists, you could simply rearrange the order.

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.