Python Compare Dictionaries with similar and exact keys - python

I have a scenario where to compare two dictionaries based on the set of keys.
i.e
TmpDict ={}
TmpDict2={}
for line in reader:
line = line.strip()
TmpArr=line.split('|')
TmpDict[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
for line in reader2:
line = line.strip()
TmpArr=line.split('|')
TmpDict2[TmpArr[2],TmpArr[3],TmpArr[11],TmpArr[12],TmpArr[13],TmpArr[14]]=line
This works fine comparing two dictionaries with exactly same keys but there is a tolerance ,i need to consider.That is .. TmpArr[12],TmpArr[14] are time and duration where tolerance need to be checked.Please see the example below
Example:
dict1={(111,12,23,12:22:30,12:23:34,64): 4|1994773966623|773966623|754146741|\N|359074037474030|413025600032728|}
dict2={(111,12,23,12:22:34,12:23:34,60) :4|1994773966623|773966623|754146741|\N|359074037474030|413025600032728|}
Lets assume i have two dictionaries with length 1 each and tolerance is '4'seconds so the above keys must be considered as Matched lines even though there is a difference in the time and duration of 4 seconds.
I know Dictionaries search for a key is o(1) irrespective of the length ,how could i achieve this scenario with same performance.
Thanks

You have at least these 4 options:
store all keys within the tolerance (consumes memory).
Look up keys with tolerance. Notice that if the tolerance is defined and constant, then
lookups are C * O(N) which is O(n).
combine the previous ones: compress the keys using some scheme, say round down to divisible to 4, and up to divisible by 4, and then store the value for these keys in the dictionary, and verify from the exact value if it is correct.
or do not use a dictionary but instead some kind of tree structure; notice that you can still store the exact part of the key in a dictionary.
As such, you do not provide enough information to decide whichever is best of these. However I personally would go for 3.

If you can use more memory to keep performance as above code, you can insert more than one entry for each element. For example "111,12,23,12:22:30,12:23:34,60", "111,12,23,12:22:30,12:23:34,61", "... , "111,12,23,12:22:30,12:23:34,68" are inserted for just ""111,12,23,12:22:30,12:23:34,64" key.
If you want to not waste memory but o(1) performance is kept, You can check 8 keys (4 before and 4 after) for one key. It has 8 times more comparison than above code but is o(1) also.

Related

Time Complexity of finding the first item in a dictionary in Python 3.7+

Since Python 3.7, dictionaries preserve order based on insertion.
It seems like you can get the first item in a dictionary using next(iter(my_dict))?
My question is around the Big O time complexity of that operation?
Can I regard next(iter(my_dict)) as a constant time (O(1)) operation? Or what's the best way to retrieve the first item in the dictionary in constant time?
The reason I'm asking is that I'm hoping to use this for coding interviews, where there's a significant emphasis on the time complexity of your solution, rather than how fast it runs in milliseconds.
It's probably the best way (actually you're getting the first key now, next(iter(d.values())) gets your value).
This operation (any iteration through keys, values or items for combined tables at least) iterates through an array holding the dictionary entries:
PyDictKeyEntry *entry_ptr = &DK_ENTRIES(k)[i];
while (i < n && entry_ptr->me_value == NULL) {
entry_ptr++;
i++;
}
entry_ptr->me_value holds the value for each respective key.
If your dictionary is freshly created, this finds the first inserted item during the first iteration (the dictionary entries array is append-only, hence preserving order).
If your dictionary has been altered (you've deleted many of the items) this might, in the worse case, result in O(N) to find the first (among remaining items) inserted item (where N is the total number of original items). This is due to dictionaries not resizing when items are removed and, as a result, entry_ptr->me_value being NULL for many entries.
Note that this is CPython specific. I'm not aware of how other implementations of Python implement this.

Inefficient code for removing duplicates from a list in Python - interpretation?

I am writing a Python program to remove duplicates from a list. My code is the following:
some_values_list = [2,2,4,7,7,8]
unique_values_list = []
for i in some_values_list:
if i not in unique_values_list:
unique_values_list.append(i)
print(unique_values_list)
This code works fine. However, an alternative solution is given and I am trying to interpret it (as I am still a beginner in Python). Specifically, I do not understand the added value or benefit of creating an empty set - how does that make the code clearer or more efficient? Isn´t it enough to create an empty list as I have done in the first example?
The code for the alternative solution is the following:
a = [10,20,30,20,10,50,60,40,80,50,40]
dup_items = set()
uniq_items = []
for x in a:
if x not in dup_items:
uniq_items.append(x)
dup_items.add(x)
print(dup_items)
This code also throws up an error TypeError: set() missing 1 required positional argument: 'items' (This is from a website for Python exercises with answers key, so it is supposed to be correct.)
Determining if an item is present in a set is generally faster than determining if it is present in a list of the same size. Why? Because for a set (at least, for a hash table, which is how CPython sets are implemented) we don't need to traverse the entire collection of elements to check if a particular value is present (whereas we do for a list). Rather, we usually just need to check at most one element. A more precise way to frame this is to say that containment tests for lists take "linear time" (i.e. time proportional to the size of the list), whereas containment tests in sets take "constant time" (i.e. the runtime does not depend on the size of the set).
Lookup for an element in a list takes O(N) time (you can find an element in logarithmic time, but the list should be sorted, so not your case). So if you use the same list to keep unique elements and lookup newly added ones, your whole algorithm runs in O(N²) time (N elements, O(N) average lookup). set is a hash-set in Python, so lookup in it should take O(1) on average. Thus, if you use an auxiliary set to keep track of unique elements already found, your whole algorithm will only take O(N) time on average, chances are good, one order better.
In most cases sets are faster than lists. One of this cases is when you look for an item using "in" keyword. The reason why sets are faster is that, they implement hashtable.
So, in short, if x not in dup_items in second code snippet works faster than if i not in unique_values_list.
If you want to check the time complexity of different Python data structures and operations, you can check this link
.
I think your code is also inefficient in a way that for each item in list you are searching in larger list. The second snippet looks for the item in smaller set. But that is not correct all the time. For example, if the list is all unique items, then it is the same.
Hope it clarifies.

Finding intersections of huge sets with huge dicts

I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.

Redis / Dictionaries / sqlite3 on millions of pairs

I have pairs (key,value) which consist key = string , value = int. I try to build an index from a large text corpus, so I store the string and an identifier. For every term I read from the corpus I have to check to the index to see if it exists, so I need fast lookups(O(1) if possible). I was using python dictionary to create the index. Problem is that I go out of Ram (16GB Ram). My alternative was to use dictionary and when my ram goes 90% usage I was using a sqlite3 database to store the pairs to the disk. But now the problem is that the seeking time takes too much time(first check dict, if fail go and check database at disk) .
I am thinking to switch to Redis-db. My question is, should I strore the key values as strings or should I hash them and then store them ? (keys are strings which contain (2~100 chars). And what about the values, should I try anything on them (values are int32 numbers)?
edit:
I want to store every term and its identifier(unique pairs) , and if i read a term and it exists inside the index then pass it.
edit2:
I tried using redis but it seems it goes really slow (?) , I use the same code instead of dictionary I use redis set & get which are supposed to have O(1) complexity, but the building time of the index is tooo slow. any advice ?
A Python dictionary can be simulated with C hashes quite easely. Glib provides a working hash implementation that is not difficult to use with some C training. The advantage is that is will be faster and (much) less memory hungry that the Python dictionary:
https://developer.gnome.org/glib/2.40/glib-Hash-Tables.html
GLib Hash Table Loop Problem
You can also add some algorithm to improve the performance. For example store a compressed key.
Even easier, you can segment your large text corpus in sections, create an independent index for each section and then "merge" the indexes.
So for example index 1 will look:
key1 -> page 1, 3, 20
key2 -> page 2, 7
...
index 2:
key1 -> page 50, 70
key2 -> page 65
...
Then you can merge index 1 and 2:
key1 -> page 1, 3, 20, 50, 70
key2 -> page 2, 7, 65
...
You can even paralelize into N machines.
should I store the key values as strings or should I hash them and then store them? [...] what about the values?
The most naive way to use Redis in your case is to perform a SET for every unique pair, e.g SET foo 1234 and so on.
As demonstrated by Instagram (x) what you can do instead is using Redis Hashes that feature transparent memory optimizations behind the scenes:
Hashes [...] when
smaller than a given number of elements, and up to a maximum element
size, are encoded in a very memory efficient way that uses up to 10
times less memory
(see Redis memory optimization documentation for more details).
As suggested by Instagram what you can do is:
hash every key with a 64-bit hash function: n = hash(key)
compute the corresponding bucket: b = n/1000 (with 1,000 elements per bucket)
store the hash, value (= i) pair in this bucket: HSET b n i
Note: keep your integer value i as-is since behind the scenes integers are encoded using a variable number of bytes in ziplists.
Of course make sure to configure Redis with hash-max-ziplist-entries 1000 to make sure every hash will be memory optimized (xx).
To speed up your initial insertion, you may want to use the raw Redis protocol via mass insertion.
(x) Storing hundreds of millions of simple key-value pairs in Redis.
Edit:
(xx) even though in practice most (if not all) of your hashes will contain a single element because of the sparsity of the hash function. In other words, since your keys are hashed strings and not monotonically increasing ID-s as in Instagram exemple, this approach may NOT be as interesting in terms of memory saving (all your ziplists will contain a single pair). You may want to load your dataset and see what it does with real data in comparison to the basic SET key(= string) value(= integer) approach.

Storing more than one key value in a tuple with python?

I'm new to Python and still learning. I was wondering if there was a standard 'best practice' for storing more than one key value in a tuple. Here's an example:
I have a value called 'red' which has a value of 3 and I need to divide it by a number (say 10). I need to store 3 values: Red (the name), 3 (number of times its divides 10) and 1 (the remainder). There are other values that are similar that will need to be included as well, so this is for red but same results for blue, green, etc. (numbers are different for each label).
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records (and I'll need to search through it so I thought maybe nested anything might slow me down).
I tried to create something like {'red':3:1,...} but its not the correct syntax and I'm considering adding a delimiter in the key value and then splitting it but not sure if that's efficient (such as {'red':3a1,..} then parse by the letter a).
I'm wondering if there's any better ways to store this or is nested tuples my only solution? I'm using Python 2.
The syntax for tuples is: (a,b,c).
If you want a dictionary with multiple values you can have a list as the value: {'red':[3,1]}.
You may want to also consider named tuples, or even classes. This will allow you to name the fields instead of accessing them by index, which will make the code more clear and structured.
I read around and I think way I found was to use nested lists, but I am doing this type of storage for a billion records(and I'll need to search through it so I thought maybe nested anything might slow me down).
If you have a billion records you probably should be persisting the data (for example in a database). You will likely run out of memory if you try to keep all the data in memory at once.
Use tuple. For example:
`('red', 3, 1)`
Perhaps you mean dictionaries instead of tuples?
{'red': [3,1], 'blue': [2,2]}
If you are trying to store key/value pairs the best way would be to store them in a dictionary. And if you need more than one value to each key, just put those values in a list.
I don't think you would want to store such things in a tuple because tuples aren't mutable. So if you decide to change the order of the quotient and remainder (1, 3) instead of (3,1), you would need to create new tuples. Whereas with lists, you could simply rearrange the order.

Categories

Resources