Speed up dictionary merging with soft conjunction logic

Speed up dictionary merging with soft conjunction logic - python

I have a look-up table which contains <word: dictionary>pairs.
Then, given a word list,
I can produce a dictionary list using this look-up table.
(Each time, the length of this word list is not fixed).
Values in these dictionaries represent log probability of some keys.
Here is an example:
Given a word list
['fruit','animal','plant'],
we can check out the look-up table and have
dict_list = [{'apple':-1, 'flower':-2}, {'apple':-3, 'dog':-1}, {'apple':-2, 'flower':-1}].
We can see from the list that we have a set of keys: {'apple', 'flower', 'dog'}
For each key, I want to give a sum of each value in the dict_list. And if a key is not existed in one dictionary, then we add a small value -10 to the value (you can regard -10 as an very small log probability).
The result dictionary looks like:
dict_merge = {'apple':-6, 'flower':-13, 'dog':-21},
because 'apple' = (-1) + (-3) + (-2), 'flower' = (-2) + (-10) + (-1), 'dog' = (-10) + (-1) + (-10)
Here is my python3 code:
dict_list = [{'apple':-1, 'flower':-2}, {'apple':-3, 'dog':-1}, {'apple':-2, 'flower':-1}]
key_list = []
for dic in dict_list:
key_list.extend(dic.keys())
dict_merge = dict.fromkeys(key_list, 0)
for key in dict_merge:
for dic in dict_list:
dict_merge[key] += dic.get(key, -10)
This code works, but if the sizes of some dictionaries in dict_list are super large (for example 100,000), then it could take over 200ms, which is not acceptable in practice.
The main computation is in the for key in dict_merge loop, imagine it is a loop of size 100,000.
Is there any speed-up solutions? Thanks! And, thanks for reading~ maybe too long and too annoying...
P.S.
There are only a few dictionaries in the look-up table have super large size. So there could be some chances here.

As I can understand, sum(len(d) for d in dict_list) is much smaller then len(key_list) * len(dict_list).
from collections import defaultdict
dict_list = [{'apple':-1, 'flower':-2}, {'apple':-3, 'dog':-1}, {'apple':-2, 'flower':-1}]
default_value = len(dict_list) * (-10)
dict_merge = defaultdict(lambda: default_value)
for d in dict_list:
for key, value in d.items():
dict_merge[key] += value + 10

Related

Extract numbers within range python dictionary

I have two datasets which are in a dictionary format:
ch1 = {1000: 128,
2830: 1022,
3438: 198,
5908: 109}
ch2 = {1295: 1203,
2836: 1238,
4901: 8367,
7608: 249}
Currently, I have code to look for matches between keys in one dictionary and keys in the other:
coin = [int(ch1[key])+int(ch2[key]) for key in ch1.keys() & ch2.keys()]
I'm looking to change this code so that it finds keys that are within a given range of one another. For example, if within a range of 10, the output list from the example dictionaries would be [2260] as the sum of 1022+1238, by matching keys 2830 from dic1 and 2836 from dic2.
One limitation is the data files are large (~500Mb) which has limited the solutions I have thought iterating through list data.
In the rare case where there are two keys in one dictionary which are in the range of a key in the other dictionary, this should give one match.
ch1 = {1000: 128,
2830: 1022,
3438: 198,
5908: 109}
ch2 = {1295: 1203,
2836: 1238,
2839: 8367,
7608: 249}
Should still yield [5825], round((1238+8367)/2+1022).
In the even rarer case that there are two pairs, it does not matter which of these are matched together. There should be only two outputs in this case. Eg:
ch1 = {1000: 128,
2837: 1022,
2838: 198,
5908: 109}
ch2 = {1295: 1203,
2836: 1238,
2839: 8367,
7608: 249}
Result = [2260, 8565] which comes from 1022+1238, 198+8367

I propose a simple answer that runs in 9seconds for two dictionaries that are 15MB each (31MB total), so it may be used as a baseline for comparison. Which is the desired speed?
I don't sum the results but find all eligible pairs, as I believe I don't quite understand how should they be summed. I believe already having the combinations it can be quite easy to apply your own rules.
Create two dictionaries
import sys
from numpy.random import default_rng
rng = rng = default_rng(12345)
MAX_KEY = 10000000
MAX_VALUE = 10000
M = 1000000
dictionaries = {'1': {}, '2':{}}
for i in range(M):
for i in dictionaries:
key = rng.integers(low=1, high=MAX_KEY)
value = rng.integers(low=1, high=MAX_VALUE)
dictionaries[i][key] = value
ch1 = dictionaries['1']
ch2 = dictionaries['2']
TOT_SIZE = 0
TOT_SIZE += sys.getsizeof(list(ch1))
TOT_SIZE += sys.getsizeof(list(ch2))
TOT_SIZE += sys.getsizeof([ch1[key] for key in ch1])
TOT_SIZE += sys.getsizeof([ch2[key] for key in ch2])
TOT_SIZE /= (1024**2)
print(f"TOT_SIZE = {TOT_SIZE} MB")
Function
def get_possiblePairs(TH = TH):
possible_sums = {}
list_keys = (list(ch1)+list(ch2))
list_keys.sort()
N = len(list_keys)
possible_sums = {}
coin_list = []
for i in range(N):
for j in range(i+1, N):
key1 = list_keys[i]
key2 = list_keys[j]
if key2<key1+TH:
if key1 in ch1 and key2 in ch2:
coin = ch1[key1] + ch2[key2]
possible_sums[(key1,key2)] = coin
else:
break
for j in range(N):
for i in range(i+1, N):
key1 = list_keys[i]
key2 = list_keys[j]
if key1<key2+TH:
if key1 in ch1 and key2 in ch2:
coin = ch1[key1] + ch2[key2]
possible_sums[(key1,key2)] = coin
else:
break
return possible_sums

I like the solution posted by konrad-h, but I've decided to write mine which is somewhat simpler, and it appears to have less loops. However, it relies more on numpy functions, so I'm not sure which one is more efficient, and mine has a drawback - it won't work for the 3rd case.
I'll present the solution, and clarify why I think the 3rd case is difficult to solve regardless of the approach.
So, my approach is:
1.) Create numpy array of keys. I call these arrays c1_keys and c2_keys, for dictionaries ch1 and ch2 respectively.
2.) For every key1 in c1_keys, I find all keys keys in ch2_keys which are +-10 of key1.
3.) Create avg_ of all key2's found (their corresponding values from ch2).
4.) To list results append the sum of the avg and the value from ch1 provided by key1.
5.) Remove found key2's from numpy array ch2_keys.
def find_pairs(ch1, ch2):
# dict keys to np array
c1_keys = np.array(list(ch1.keys()))
c2_keys = np.array(list(ch2.keys()))
results = []
# find pairs of keys
for key1 in c1_keys:
indices = np.logical_and(key1 - 10 <= c2_keys, c2_keys <= key1 + 10)
sum_ = 0
if np.any(indices):
for index in c2_keys[indices]:
sum_ += ch2[index]
sum_ = sum_ / indices.sum()
results.append(round(ch1[key1] + sum_))
c2_keys = c2_keys[~indices]
return results
Pro's:
-This solution will work faster and faster as the search progresses, because we're removing keys from the 2nd dictionary (that is, from the np.array)
-It will solve the most common and the rare case
Con's:
-Won't solve the 3rd case.
Explanation:
As #konrad-h has stated as well, he didn't know how to sum the eligible pairs, because it's somewhat confusing. This is why his solution has more loops. Suppose the dictionaries look like this:
ch1 = {
1: 100,
2: 200,
3: 300,
...
}
ch2 = {
1: 100,
2: 200,
3: 300,
...
}
How should we deal with this? For a +-10 tolerance, all keys 1-10 in ch1 will be compatible with all keys 1-10 from ch2. The only way that we can know that perfect pairs exist, is to do the complete search two times. First, we find all eligible pairs (which is what konrad-h) did. Second, we find the best pair combos (which also isn't simple).
This is why I highly suggest you reduce the accepted tolerance and sum the first eligible pairs you encounter. There is no way to know if two pairs in ch1 and ch2 exist, without passing through them twice. But you can easily do cases 1 and c2.

How to find the highest value element in a list with reference to a dictionary on python

How do I code a function in python which can:
iterate through a list of word strings which may contain duplicate words and referencing to a dictionary,
find the word with the highest absolute sum, and
output it along with the corresponding absolute value.
The function also has to ignore words which are not in the dictionary.
For example,
Assume the function is called H_abs_W().
Given the following list and dict:
list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
Then calling the function as:
H_abs_W(list_1,Dict_1)
Should give the output:
'apples',10.46
EDIT:
I managed to do it in the end with the code below. Looking over the answers, turns out I could have done it in a shorter fashion, lol.
def H_abs_W(list_1,Dict_1):
freqW = {}
for char in list_1:
if char in freqW:
freqW[char] += 1
else:
freqW[char] = 1
ASum_W = 0
i_word = ''
for a,b in freqW.items():
x = 0
d = Dict_1.get(a,0)
x = abs(float(b)*float(d))
if x > ASum_W:
ASum_W = x
i_word = a
return(i_word,ASum_W)

list_1 = ['apples','oranges','pears','apples']
Dict_1 = {'apples':5.23,'pears':-7.62}
d = {k:0 for k in list_1}
for x in list_1:
if x in Dict_1.keys():
d[x]+=Dict_1[x]
m = max(Dict_1, key=Dict_1.get)
print(m,Dict_1[m])

try this,
key, value = sorted(Dict_1.items(), key = lambda x : x[1], reverse=True)[0]
print(f"{key}, {list_1.count(key) * value}")
# apples, 10.46

you can use Counter to calculate the frequency(number of occurrences) of each item in the list.
max(counter.values()) will give us the count of maximum occurring element
max(counter, key=counter.get) will give the which item in the list is
associated with that highest count.
========================================================================
from collections import Counter
def H_abs_W(list_1, Dict_1):
counter = Counter(list_1)
count = max(counter.values())
item = max(counter, key=counter.get)
return item, abs(count * Dict_1.get(item))

Python sort nested dictionary

I want to sort this nested dictionary twice. First, I want to sort by time, and then by key. This is a nested nested dictionary. The time should be filtered first and then by keys ("FileNameXXX") of the inner dictionary.
data = {1: {"05:00:00": {"FileName123": "LineString1"}},
2: {"16:00:00": {"FileName456": "LineString2"}},
3: {"07:00:00": {"FileName789": "LineString3"}},
4: {"07:00:00": {"FileName555": "LineString4"}}}
Expected Result:
1: {"05:00:00": {"FileName123": "LineString1"}}
3: {"07:00:00": {"FileName789": "LineString3"}}
4: {"07:00:00": {"FileName555": "LineString4"}}
2: {"16:00:00": {"FileName456": "LineString2"}}

You can achieve that by building some notion of value for each entry in data. For example, I defined the "value" of a data entry in the following function but notice that it heavily relies on having exactly one key inside the second nested dict which must also be strictly a time formatted as string.
def get_comparable(key):
raw_time = list(data[key].keys())[0]
time = datetime.strptime(raw_time, "%H:%M:%S").time()
return time.hour * 3600 + time.minute * 60 + time.second + key * 0.001
The you can just use:
for k in sorted(data, key=get_comparable):
print(k, data[k])
output:
1 {'05:00:00': {'FileName123': 'LineString1'}}
3 {'07:00:00': {'FileName789': 'LineString3'}}
4 {'07:00:00': {'FileName555': 'LineString4'}}
2 {'16:00:00': {'FileName456': 'LineString2'}}
Using
sorted(data, key=lambda x: list(data[x].keys())[0])
will produce the same output but be careful and notice that it will not take into account the values of first level keys (the numbers) and that will sort the times lexicographically.

Run a random algorithm mutiple times and average over the results

I have the following random selection script:
import random
length_of_list = 200
my_list = list(range(length_of_list))
num_selections = 10
numbers = random.sample(my_list, num_selections)
It looks at a list of predetermined size and randomly selects 10 numbers. Is there a way to run this section 500 times and then get the top 10 numbers which were selected the most? I was thinking that I could feed the numbers into a dictionary and then get the top 10 numbers from there. So far, I've done the following:
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
current_number = my_dict.get(number)
key_number = number
my_dict.update(number = number+1)
print(my_dict)
Here I want the code to take the current number assigned to that key and then add 1, but I cannot manage to make it work. It seems like the key for the dictionary update has to be that specific key, cannot insert a variable.. Also, I think having this nested loop might not be so efficient as I have to run this 500 times 1500 times 23... so I am concerned about performance. If anyone has an idea of what I should try, it would be great! Thanks
SOLUTION:
import random
from collections import defaultdict
from collections import OrderedDict
length_of_list = 50
my_list = list(range(length_of_list))
num_selections = 10
my_dict = dict.fromkeys(my_list)
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
def get_top_numbers(data, n, order=False):
"""Gets the top n numbers from the dictionary"""
top = sorted(data.items(), key=lambda x: x[1], reverse=True)[:n]
if order:
return OrderedDict(top)
return dict(top)
print(get_top_numbers(di, n=10))

my_dict.update(number = number+1) in this line you are assigning a new value to a variable inside the parentheses of a function call. Unless you're giving the function a kwarg called number with value number+1 this in the following error:
TypeError: 'number' is an invalid keyword argument for this function
Also dict.update doesn't accept an integer but another dictionary. You should read the documentation about this function: https://www.tutorialspoint.com/python3/dictionary_update.htm
Here it say's dict.update(dict2) takes a dictionary which it will integrate into dict. See example below:
dict = {'Name': 'Zara', 'Age': 17}
dict2 = {'Gender': 'female' }
dict.update(dict2)
print ("updated dict : ", dict)
Gives as result:
updated dict : {'Gender': 'female', 'Age': 17, 'Name': 'Zara'}
So far for the errors in your code, I see a good answer is already given so I won't repeat him.

Checkout defaultdict of collections module,
So basically, you create a defaultdict with default value 0 and then iterate over your numbers list and update the value of the number to +=1
from collections import defaultdict
di = defaultdict(int)
for run in range(0, 500):
numbers = random.sample(my_list, num_selections)
for number in numbers:
di[number] += 1
print(di)

You can use for this task collections.Counter which provides addition method. So you will use two counters one which is sum of all and second which contains count of samples.
counter = collections.Counter()
for run in range(500):
samples = random.sample(my_list, num_samples)
sample_counter = collections.Counter(samples)
counter = counter + sample_counter

Calculating means of values for subgroups of keys in python dictionary

I have a dictionary which looks like this:
cq={'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00,
'B2_B2M_02':4.30, 'B2_B2M_02':2.40 etc.}
I need to calculate mean of triplets, where the keys[2:] agree. So, I would ideally like to get another dictionary which will be:
new={'_B2M_01': 2.47, '_B2M_02': 3.9}
The data is/should be in triplets so in theory I could just get the means of the consecutive values, but first of all, I have it in a dictionary so the keys/values will likely get reordered, besides I'd rather stick to the names, as a quality check for the triplets assigned to names (I will later add a bit showing error message when there will be more than three per group).
I've tried creating a dictionary where the keys would be _B2M_01 and _B2M_02 and then loop through the original dictionary to first append all the values that are assigned to these groups of keys so I could later calculate an average, but I am getting errors even in the first step and anyway, I am not sure if this is the most effective way to do this...
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
trips=set([x[2:] for x in cq.keys()])
new={}
for each in trips:
for k,v in cq.iteritems():
if k[2:]==each:
new[each].append(v)
Traceback (most recent call last):
File "<pyshell#28>", line 4, in <module>
new[each].append(v)
KeyError: '_B2M_01'
I would be very grateful for any suggestions. It seems like a fairly easy operation but I got stuck.
An alternative result which would be even better would be to get a dictionary which contains all the names used as in cq, but with values being the means of the group. So the end result would be:
final={'A1_B2M_01':2.47, 'A2_B2M_01':2.47, 'A3_B2M_01':2.47, 'B1_B2M_02':3.9,
'B2_B2M_02':3.9, 'B2_B2M_02':3.9}

Something like this should work. You can probably make it a little more elegant.
cq = {'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00, 'B2_B2M_02':4.30, 'B2_B2M_02':2.40 }
sum = {}
count = {}
mean = {}
for k in cq:
if k[2:] in sum:
sum[k[2:]] += cq[k]
count[k[2:]] += 1
else:
sum[k[2:]] = cq[k]
count[k[2:]] = 1
for k in sum:
mean[k] = sum[k] / count[k]

cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
sums = dict()
for k, v in cq.iteritems():
_, p2 = k.split('_', 1)
if p2 not in sums:
sums[p2] = [0, 0]
sums[p2][0] += v
sums[p2][1] += 1
res = {}
for k, v in sums.iteritems():
res[k] = v[0]/float(v[1])
print res
also could be done with one iteration

Grouping:
SEPARATOR = '_'
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
groups = {}
for key in cq:
group_key = SEPARATOR.join(key.split(SEPARATOR)[1:])
if group_key in groups:
groups[group_key].append(cq[key])
else:
groups[group_key] = [cq[key]]
Generate means:
def means(groups):
for group, group_vals in groups.iteritems():
yield (group, float(sum(group_vals)) / len(group_vals),)
print list(means(groups))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up dictionary merging with soft conjunction logic - python

Related

Extract numbers within range python dictionary

How to find the highest value element in a list with reference to a dictionary on python

Python sort nested dictionary

Run a random algorithm mutiple times and average over the results

Calculating means of values for subgroups of keys in python dictionary

Categories

Resources