Link two dictionaries lists and calculate average values - python

I am not used to code with Python, but I have to do this one with it. What I am trying to do, is something that would reproduce the result of SQL statment like this :
SELECT T2.item, AVG(T1.Value) AS MEAN FROM TABLE_DATA T1 INNER JOIN TABLE_ITEMS T2 ON T1.ptid = T2.ptid GROUP BY T2.item.
In Python, I have two lists of dictionnaries, with the common key 'ptid'. My dctData contains around 100 000 pdit and around 7000 for the dctItems. Using a comparator like [i for i in dctData for j in dctItems if i['ptid'] == j['ptid']] is endless:
ptid = 1
for line in lines[6:]: # Skipping header
data = line.split()
for d in data:
dctData.append({'ptid' : ptid, 'Value': float(d)})
ptid += 1
dctData = [{'ptid':1,'Value': 0}, {'ptid':2,'Value': 2}, {'ptid':3,'Value': 2}, {'ptid':4,'Value': 5}, {'ptid':5,'Value': 3}, {'ptid':6,'Value': 2}]
for line in lines[1:]: # Skipping header
data = line.split(';')
dctItems.append({'ptid' : int(data[1]), 'item' : data[3]})
dctItems = [{'item':21, 'ptid':1}, {'item':21, 'ptid':2}, {'item':21, 'ptid':6}, {'item':22, 'ptid':2}, {'item':22, 'ptid':5}, {'item':23, 'ptid':4}]
Now, what I would like to get for result, is a third list that would present the average values according to each item in dctItems dictionnary, while the link between the two dictionnaries would be based on the 'pdit' value.
Where for example with the item 21, it would calculate the mean value of 1.3 by getting the values (0, 2, 2) of the ptid 1, 2 and 6:
And finally, the result would look something like this, where the key Value represents the mean calculated :
dctResults = [{'id':21, 'Value':1.3}, {'id':22, 'Value':2.5}, {'id':23, 'Value':5}]
How can I achieve this?
Thanks you all for your help.

Given those data structures that you use, this is not trivial, but it will become much easier if you use a single dictionary mapping items to their values, instead.
First, let's try to re-structure your data in that way:
values = {entry['ptid']: entry['Value'] for entry in dctData}
items = {}
for item in dctItems:
items.setdefault(item['item'], []).append(values[item['ptid']])
Now, items has the form {21: [0, 2, 2], 22: [2, 3], 23: [5]}. Of course, it would be even better if you could create the dictionary in this form in the first place.
Now, we can pretty easily calculate the average for all those lists of values:
avg = lambda lst: float(sum(lst))/len(lst)
result = {item: avg(values) for item, values in items.items()}
This way, result is {21: 1.3333333333333333, 22: 2.5, 23: 5.0}
Or if you prefer your "list of dictionaries" style:
dctResult = [{'id': item, 'Value': avg(values)} for item, values in items.items()]

Related

Producing key:value pairs from existing pairs in nested dictionary

I have a rather algorithmic question. I’m using Python so that would be a bonus but in essence it’s about how to approach a problem.
I have a nested dictionary where for each customer, there are their relations with another customers. Basically if they enjoy same things, they are more compatible. The way this is calculated is that, for first customer the compatibility is calculated with every other customers, but for the second customer’s loop first one is skipped because it had been calculated already. You end up a dictionary something like this
Custdict={‘1’:{‘2’:1,’3’:0,’4’:3},’2’:{‘3’:1,’4’:2}…}
So for the last customer let’s say 10th, there is no entry for it as a key/value pair since it’s calculated in previous ones. My question is, how can I obtain this data from previous ones and add them as key/values in later ones. So above dictionary should become
Custdict={‘1’:{‘2’:1,’3’:0,’4’:3},’2’:{‘1’:1,‘3’:1,’4’:2}…}
I did something online search to see if there is such algorithm but couldn’t find anything
As a simple solution, you can just iterate over all values in the dictionary, and for each customer-customer pair (i, j), you also set the value for (j, i).
from collections import defaultdict
cust = {
1: {
2: 1,
3: 0
},
2: {
3: 1
}
}
new_cust = defaultdict(dict)
for customer in cust.keys():
for neighbour in cust[customer].keys():
new_cust[neighbour][customer] = cust[customer][neighbour]
new_cust[customer][neighbour] = cust[customer][neighbour]
print(dict(new_cust))
# prints {2: {1: 1, 3: 1}, 1: {2: 1, 3: 0}, 3: {1: 0, 2: 1}}

count how often a key appears in a dataset

i have a pandas dataframe
where you can find 3 columns. the third is the second one with some str slicing.
To every warranty_claim_number, there is a key_part_number (first column).
this dataframe has a lot of rows.
I have a second list, which contains 70 random select warranty_claim_numbers.
I was hoping to find the corresponding key_part_number from those 70 claims in my dataset.
Then i would like to create a dictionary with the key_part_number as key and the corresponding value as warranty_claim_number.
At last, count how often each key_part_number appears in this dataset and update the key.
This should like like this:
dicti = {4:'000120648353',10:'000119582589',....}
first of all you need to change the datatype of warranty_claim_numbers to string or you wont get the leading 0's
You can subset your df form that list of claim numbers:
df = df[df["warranty_claim_number"].isin(claimnumberlist)]
This gives you a dataframe with only the rows with those claim numbers.
countofkeyparts = df["key_part_number"].value_counts()
this gives you a pandas series with the values and you can cast i to a dict with to_dict()
countofkeyparts = countofkeyparts.to_dict()
The keys in a dict have to be unique so if you want the count as a key you can have the value be a list of key_part_numbers
values = {}
for key, value in countofkeyparts.items():
values[value]= values.get(value,[])
values[value].append(key)
According to your example, you can't use the number of occurrences as the key of the dictionary because the key in the dictionary is unique and you can't exclude multiple data columns with the same frequency of occurrence, so it is recommended to set the result in this format: dicti = {4:['000120648353', '09824091'],10:['000119582589'] ,....}
I'll use randomly generated data as an example
from collections import Counter
import random
lst = [random.randint(1, 10) for i in range(20)]
counter = Counter(lst)
print(counter) # First element, then number of occurrences
nums = set(counter.values()) # All occurrences
res = {item: [val for val in counter if counter[val] == item] for item in nums}
print(res)
# Counter({5: 6, 8: 4, 3: 2, 4: 2, 9: 2, 2: 2, 6: 1, 10: 1})
# {1: [6, 10], 2: [3, 4, 9, 2], 4: [8], 6: [5]}
This does what you want:
# Select rows where warranty_claim_numbers item is in lst:
df_wanted = df.loc[df["warranty_claim_numbers"].isin(lst), "warranty_claim_numbers"]
# Count the values in that row:
count_values = df_wanted.value_counts()
# Transform to Dictionary:
print(count_values.to_dict())

Run calculation multiple times with different values

I have two dictionaries, that look like:
dict1 = {1: 10, 2: 23, .... 999: 12}
dict2 = {1: 42, 2: 90, .... 999: 78}
I want to perform a simple calculation: Multiply value of dict1 with value of dict2 for 1 and 2 each.
The code so far is:
dict1[1] * dict2[1]
This calculates 10*42, which is exactly what i want.
Now i want to perform this calculation for every index in the dictionary, so for 1 up to 999.
I tried:
i = {1,2,3,4,5,6 ... 999}
dict1[i] * dict2[i]
But it didnt work.
This creates a new dict with the results:
out = { i: dict1[i] * dict2[i] for i in range(1,1000) }
If you need to work with vectors and matrices take a look at the numpy module. It has data structures and a huge collection of tools for working with them.

How to query values in a dictionary of a dictionary in python?

I have a list in Python and it's a dictionary contains a dictionary.
{'METTS MARK': {'salary': 365788, 'po': 1}, 'HARRY POTTER':{'salary': 3233233, 'po': 0}
How do I calculate the number of records with 'po' = 1?
I tried this:
sum = 0
for key, values in DIC:
if values[po] == 1:
sum = sum + 1
But it returns: too many values to unpack
Thanks in advance
You can simply use sum and sum over the condition:
total = sum(values.get('po') == 1 for values in DIC.values())
which is equivalent to (as #VPfB says):
total = sum (1 for item in DIC.values() if item.get('po') == 1)
but the latter is a bit less efficient.
You should also use 'po' instead of po since it is a string, and you better use .get('po') since this guarantees that it will work if 'po' is not part of every dictionary.
I think you forgot to use .items() in your for loop. By iterating over the dictionary, you iterate over the keys and in this case, you cannot unpack your keys into tuples with two elements.
Nevertheless using a generator, this will be more memory efficient (probably faster as well) and it is clean code. Furthermore by iterating directly over the .values instead of the .items one expects an increase in performance because one saves on packing and unpacking.
You can get it like this:
a = {
'METTS MARK': {'salary': 365788, 'po': 1},
'HARRY POTTER': {'salary': 3233233, 'po': 0}
}
print(len([b for b in a.values() if b.get('po')==1]))
Output:
1
Here we are creating a list of dictionaries where the key po==1. And then we calculate the length of the list.

How can I mimic a space efficient version of dict in Python?

I am working on a dictionary structure where I have a dictionary of documents and each document has a dictionary of words (where each key is word_id (integer) and values are counts) such that:
document_dict = { "doc1": {1:2, 2:10, 10:2, 100: 1}, "doc2": {10:2, 20:10, 30:2, 41: 19},...}
Note that the inner dictionaries are pretty sparse, so even though I have 250K words, I don't expect to have more than 1K keys per document.
In each iteration, I need to sum up a dict of words:counts to one of the documents, e.g. I need to union a new dict of {1:2, 2:10, 10:2, 120: 1} to "doc1": {1:2, 2:10, 10:2, 100: 1}.
Right now, my implementation runs quite fast, however after 2 hours it runs out of memory (I am using a 40GB server).
The way I was summing up the keys was something like this:
Assume that new_dict is the new word:count pairs that I want to add to doc1 such as:
new_dict = {1:2, 2:10, 10:2, 120: 1}
doc1 = {1:2, 2:10, 10:2, 100: 1}
for item in new_dict:
doc1[item] = doc1.get(item, 0) + new_dict[item]
Then since it was simply impossible to run the code with dictionaries because my dicts get quite large in a very short time, I tried to implement dictionaries as a list of 2 lists: e.g. doc1 = [[],[]] where first list keeps the keys and second key keeps the values.
Now when I want to union 2 structure like this, I first try to get the index of each item of new_dict in doc1. If I successfully obtain an index, it means the key is already in the doc1 so I can just update the corresponding value. Otherwise, it is not in the doc1 yet, so I am append()ing the new key and value to the end of the lists. However this approach runs extremely slow (in dict version, I was able to process up to 600K documents in 2 hours, now I could only processed 250K documents in 15 hours).
So my question is: If I want to use a dictionary structure (key, val) pairs where I need to union keys of 2 dicts and sum their values in each iteration, is there a way to implement this more space efficiently?
It's not necessarily more space efficient, but I would suggest switching to a disk-based dictionary by using the shelve module so you don't have to have the entire dictionary in memory at once.
They're very easy to use since they support the familiar dictionary interface, as shown below:
import shelve
document_dict = shelve.open('document_dict', writeback=True)
document_dict.update({"doc1": {1:2, 2:10, 10:2, 100: 1},
"doc2": {10:2, 20:10, 30:2, 41: 19},
"doc3": {1:2, 2:10, 10:2, 100: 1},})
new_dict = {1:2, 2:10, 10:2, 120: 1}
doc = document_dict.get("doc3", {}) # get current value, if any
for item in new_dict:
doc[item] = doc.get(item, 0) + new_dict[item] # update version in memory
document_dict["doc3"] = doc # write modified (or new) entry to disk
document_dict.sync() # clear cache
print document_dict
document_dict.close()
Output:
{'doc2': {41: 19, 10: 2, 20: 10, 30: 2},
'doc3': {120: 1, 1: 4, 2: 20, 100: 1, 10: 4},
'doc1': {1: 2, 2: 10, 100: 1, 10: 2}}

Categories

Resources