I am working on a dictionary structure where I have a dictionary of documents and each document has a dictionary of words (where each key is word_id (integer) and values are counts) such that:
document_dict = { "doc1": {1:2, 2:10, 10:2, 100: 1}, "doc2": {10:2, 20:10, 30:2, 41: 19},...}
Note that the inner dictionaries are pretty sparse, so even though I have 250K words, I don't expect to have more than 1K keys per document.
In each iteration, I need to sum up a dict of words:counts to one of the documents, e.g. I need to union a new dict of {1:2, 2:10, 10:2, 120: 1} to "doc1": {1:2, 2:10, 10:2, 100: 1}.
Right now, my implementation runs quite fast, however after 2 hours it runs out of memory (I am using a 40GB server).
The way I was summing up the keys was something like this:
Assume that new_dict is the new word:count pairs that I want to add to doc1 such as:
new_dict = {1:2, 2:10, 10:2, 120: 1}
doc1 = {1:2, 2:10, 10:2, 100: 1}
for item in new_dict:
doc1[item] = doc1.get(item, 0) + new_dict[item]
Then since it was simply impossible to run the code with dictionaries because my dicts get quite large in a very short time, I tried to implement dictionaries as a list of 2 lists: e.g. doc1 = [[],[]] where first list keeps the keys and second key keeps the values.
Now when I want to union 2 structure like this, I first try to get the index of each item of new_dict in doc1. If I successfully obtain an index, it means the key is already in the doc1 so I can just update the corresponding value. Otherwise, it is not in the doc1 yet, so I am append()ing the new key and value to the end of the lists. However this approach runs extremely slow (in dict version, I was able to process up to 600K documents in 2 hours, now I could only processed 250K documents in 15 hours).
So my question is: If I want to use a dictionary structure (key, val) pairs where I need to union keys of 2 dicts and sum their values in each iteration, is there a way to implement this more space efficiently?
It's not necessarily more space efficient, but I would suggest switching to a disk-based dictionary by using the shelve module so you don't have to have the entire dictionary in memory at once.
They're very easy to use since they support the familiar dictionary interface, as shown below:
import shelve
document_dict = shelve.open('document_dict', writeback=True)
document_dict.update({"doc1": {1:2, 2:10, 10:2, 100: 1},
"doc2": {10:2, 20:10, 30:2, 41: 19},
"doc3": {1:2, 2:10, 10:2, 100: 1},})
new_dict = {1:2, 2:10, 10:2, 120: 1}
doc = document_dict.get("doc3", {}) # get current value, if any
for item in new_dict:
doc[item] = doc.get(item, 0) + new_dict[item] # update version in memory
document_dict["doc3"] = doc # write modified (or new) entry to disk
document_dict.sync() # clear cache
print document_dict
document_dict.close()
Output:
{'doc2': {41: 19, 10: 2, 20: 10, 30: 2},
'doc3': {120: 1, 1: 4, 2: 20, 100: 1, 10: 4},
'doc1': {1: 2, 2: 10, 100: 1, 10: 2}}
Related
I have a rather algorithmic question. I’m using Python so that would be a bonus but in essence it’s about how to approach a problem.
I have a nested dictionary where for each customer, there are their relations with another customers. Basically if they enjoy same things, they are more compatible. The way this is calculated is that, for first customer the compatibility is calculated with every other customers, but for the second customer’s loop first one is skipped because it had been calculated already. You end up a dictionary something like this
Custdict={‘1’:{‘2’:1,’3’:0,’4’:3},’2’:{‘3’:1,’4’:2}…}
So for the last customer let’s say 10th, there is no entry for it as a key/value pair since it’s calculated in previous ones. My question is, how can I obtain this data from previous ones and add them as key/values in later ones. So above dictionary should become
Custdict={‘1’:{‘2’:1,’3’:0,’4’:3},’2’:{‘1’:1,‘3’:1,’4’:2}…}
I did something online search to see if there is such algorithm but couldn’t find anything
As a simple solution, you can just iterate over all values in the dictionary, and for each customer-customer pair (i, j), you also set the value for (j, i).
from collections import defaultdict
cust = {
1: {
2: 1,
3: 0
},
2: {
3: 1
}
}
new_cust = defaultdict(dict)
for customer in cust.keys():
for neighbour in cust[customer].keys():
new_cust[neighbour][customer] = cust[customer][neighbour]
new_cust[customer][neighbour] = cust[customer][neighbour]
print(dict(new_cust))
# prints {2: {1: 1, 3: 1}, 1: {2: 1, 3: 0}, 3: {1: 0, 2: 1}}
I have two dictionaries, that look like:
dict1 = {1: 10, 2: 23, .... 999: 12}
dict2 = {1: 42, 2: 90, .... 999: 78}
I want to perform a simple calculation: Multiply value of dict1 with value of dict2 for 1 and 2 each.
The code so far is:
dict1[1] * dict2[1]
This calculates 10*42, which is exactly what i want.
Now i want to perform this calculation for every index in the dictionary, so for 1 up to 999.
I tried:
i = {1,2,3,4,5,6 ... 999}
dict1[i] * dict2[i]
But it didnt work.
This creates a new dict with the results:
out = { i: dict1[i] * dict2[i] for i in range(1,1000) }
If you need to work with vectors and matrices take a look at the numpy module. It has data structures and a huge collection of tools for working with them.
New to the Stackoverflow, apologies if the title isn't that clear.
Effectively I am working with two xl to CSV files, both converted into nested dictionaries using method to_dict, where index is the key for the each (main?) dictionary and the columns are the keys for each nested dictionary.
i.e.
DICTA = {0: {x:1, y:2, v:3}, 1: {x:5, y:6, v:7}, 2: {x:8, y:9, v:10}}
DICTB = {0: {a:3, b:12, c:13, d:14}, 1: {a:15, b:16, c:17, d:18}, 2: {a:19, b:20, c:21, d:22}}
Values are arbitrary for the example above (length of both dictionaries will always be the same, nested dictionaries have different number of keys)
Each nested dictionary in DICT B can only be used once to update a a nested DICT A dict i.e. each nested dict in DICT A 'belongs' to a nested dict in DICT B but not in any specific order.
My aim is to update values (of nested dicts) in Dict A with values from Dict B (keys are diff for both) if other conditions/values are met.. i.e. what I have so far:
for k, v in DICTA.items():
i=0
h=0
if DICTA[i].get('v') in (DICTB[h].get('a'), (DICTB[h].get('b')):
if (DICTB[h].get('a') != '15': #another condition I need to put in
DICTA[i].update({'x': DICTB[h].get('c')})
DICTA[i].update({'y': DICTB[h].get('d')})
i+=1
else:
DICTA[i].update({'y': DICTB[h].get('c')})
DICTA[i].update({'x': DICTB[h].get('d')})
i+=1
else:
h+=1
Actual output:
In: DICTA
Out: {0: {x:13, y:14, v:3}, 1: {x:5, y:6, v:7}, 2: {x:8, y:9, v:10}}
Expected Output for the above:
In: DICTA
Out: {0: {x:13, y:14, v:3}, 1: {x:18, y:17, v:7}, 2: {x:21, y:22, v:10}}
My issue is that this works for the first DICTA entry but then fails to update the next two i.e. this clearly doesn't update i or h correctly to loop through the next nested dictionary.
Fully aware the above might be painfully un-pythonic and am very much open to easier ways of solving this.
Thanks guys appreciate any help with the above.
If I understand you correctly this should work:
for row_a, row_b in zip(DICTA.values(), DICTB.values()):
if row_a.get('v') in (row_b.get('a'), row_b.get('b')):
if row_b.get('a') != '15':
row_a.update({
'x': row_b.get('c'),
'y': row_b.get('d')
})
else:
row_a.update({
'y': row_b.get('c'),
'x': row_b.get('d')
})
Also instead of:
row_a.update({
'x': row_b.get('c'),
'y': row_b.get('d')
})
You could use:
row_a['x'] = row_b.get('c')
row_a['y'] = row_b.get('d')
but that's a question of preference.
I am not used to code with Python, but I have to do this one with it. What I am trying to do, is something that would reproduce the result of SQL statment like this :
SELECT T2.item, AVG(T1.Value) AS MEAN FROM TABLE_DATA T1 INNER JOIN TABLE_ITEMS T2 ON T1.ptid = T2.ptid GROUP BY T2.item.
In Python, I have two lists of dictionnaries, with the common key 'ptid'. My dctData contains around 100 000 pdit and around 7000 for the dctItems. Using a comparator like [i for i in dctData for j in dctItems if i['ptid'] == j['ptid']] is endless:
ptid = 1
for line in lines[6:]: # Skipping header
data = line.split()
for d in data:
dctData.append({'ptid' : ptid, 'Value': float(d)})
ptid += 1
dctData = [{'ptid':1,'Value': 0}, {'ptid':2,'Value': 2}, {'ptid':3,'Value': 2}, {'ptid':4,'Value': 5}, {'ptid':5,'Value': 3}, {'ptid':6,'Value': 2}]
for line in lines[1:]: # Skipping header
data = line.split(';')
dctItems.append({'ptid' : int(data[1]), 'item' : data[3]})
dctItems = [{'item':21, 'ptid':1}, {'item':21, 'ptid':2}, {'item':21, 'ptid':6}, {'item':22, 'ptid':2}, {'item':22, 'ptid':5}, {'item':23, 'ptid':4}]
Now, what I would like to get for result, is a third list that would present the average values according to each item in dctItems dictionnary, while the link between the two dictionnaries would be based on the 'pdit' value.
Where for example with the item 21, it would calculate the mean value of 1.3 by getting the values (0, 2, 2) of the ptid 1, 2 and 6:
And finally, the result would look something like this, where the key Value represents the mean calculated :
dctResults = [{'id':21, 'Value':1.3}, {'id':22, 'Value':2.5}, {'id':23, 'Value':5}]
How can I achieve this?
Thanks you all for your help.
Given those data structures that you use, this is not trivial, but it will become much easier if you use a single dictionary mapping items to their values, instead.
First, let's try to re-structure your data in that way:
values = {entry['ptid']: entry['Value'] for entry in dctData}
items = {}
for item in dctItems:
items.setdefault(item['item'], []).append(values[item['ptid']])
Now, items has the form {21: [0, 2, 2], 22: [2, 3], 23: [5]}. Of course, it would be even better if you could create the dictionary in this form in the first place.
Now, we can pretty easily calculate the average for all those lists of values:
avg = lambda lst: float(sum(lst))/len(lst)
result = {item: avg(values) for item, values in items.items()}
This way, result is {21: 1.3333333333333333, 22: 2.5, 23: 5.0}
Or if you prefer your "list of dictionaries" style:
dctResult = [{'id': item, 'Value': avg(values)} for item, values in items.items()]
I have a dictionary of "documents" in python with document ID numbers as keys and dictionaries (again) as values. These internal dictionaries each have a 'weight' key that holds a floating-point value of interest. In other words:
documents[some_id]['weight'] = ...
What I want to do is obtain a list of my document IDs sorted in descending order of the 'weight' value. I know that dictionaries are inherently unordered (and there seem to be a lot of ways to do things in Python), so what is the most painless way to go? It feels like kind of a messy situation...
I would convert the dictionary to a list of tuples and sort it based on weight (in reverse order for descending), then just remove the objects to get a list of the keys
l = documents.items()
l.sort(key=lambda x: x[1]['weight'], reverse=True)
result = [d[0] for d in l]
I took the approach that you might want the keys as well as the rest of the object:
# Build a random dictionary
from random import randint
ds = {} # A |D|ata |S|tructure
for i in range(20,1,-1):
ds[i]={'weight':randint(0,100)}
sortedDS = sorted(ds.keys(),key=lambda x:ds[x]['weight'])
for i in sortedDS :
print i,ds[i]['weight']
sorted() is a python built in that takes a list and returns it sorted (obviously), however it can take a key value that it uses to determine the rank of each object. In the above case it uses the 'weight' value as the key to sort on.
The advantage of this over Ameers answer is that it returns the order of keys rather than the items. Its an extra step, but it means you can refer back into the original data structure
This seems to work for me. The inspiration for it came from OrderedDict and question #9001509
from collections import OrderedDict
d = {
14: {'weight': 90},
12: {'weight': 100},
13: {'weight': 101},
15: {'weight': 5}
}
sorted_dict = OrderedDict(sorted(d.items(), key=lambda rec: rec[1].get('weight')))
print sorted_dict