Matching from dictionaries based on sum of multiple values

Matching from dictionaries based on sum of multiple values - python

I have two dictionaries as follows. One for services and the other for suppliers who can do the service. Each service can be provided by multiple suppliers.
service = {'service1': {'serviceId': 's0001', 'cost': 220},
'service2': {'serviceId': 's0002', 'cost': 130}....}
supplier = {'supplier1': {'supplierId': 'sup1', 'bid': 30},
'supplier2': {'supplierId': 'sup2', 'bid': 12},
'supplier3': {'supplierId': 'sup3', 'bid': 30}....}
I want to have a new dictionary of matching services to suppliers based on the sum of multiple bids is greater than or equal the cost of service. Something Like:
matched = {'service1': [sup1, sup2, sup100],
'service2': [sup20, sup64, sup200, sup224]....}
Assuming we have huge number of entries in both dictionaries, what is a good way for such required matching? no restrictions on the number of suppliers that can provide a single service.
I tired the following but did not work.
match = {}
for key, value in service.items():
if service[key]['cost'] >= supplier[key]['bid']:
match[key] = [sup for sup in supplier[key]['supplierID']]
Here is the expected output:
matched = {'service1': [sup1, sup2, sup100], 'service2': [sup20, sup64, sup200, sup224]....}

I assume that we have huge number of entries in both dictionaries. This is how I would approach the problem:
import numpy as np
# data
service = {'service1': {'serviceId': 's0001', 'cost': 12},
'service2': {'serviceId': 's0002', 'cost': 30}}
supplier = {'supplier1': {'supplierId': 'sup1', 'bid': 30},
'supplier2': {'supplierId': 'sup2', 'bid': 12},
'supplier3': {'supplierId': 'sup3', 'bid': 30}}
# lists of suppliers' IDs and bids
sups, bids = list(), list()
for key, info in supplier.items():
sups.append(info['supplierId'])
bids.append(info['bid'])
# sorted lists of suppliers' IDs and bids to allow binary search
bids, sups = zip(*sorted(zip(bids, sups)))
# main loop
matched = dict()
for key, info in service.items():
matched[key] = sups[:np.searchsorted(bids, info['cost'], side='right')]
matched:
{'service1': ('sup2',), 'service2': ('sup2', 'sup1', 'sup3')}
This code does not implement easy handling of new entries but allows to do so. For every new service_record we have to perform a single binary search, for every new supplier_record we have to perform a single binary search and a loop over service to update matched.
The code may be and should be improved depending on the specific requirements and the way you use to store the data.

Related

Fastest way to match 2 lists of dicts on a key value

I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.
For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]

Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using a list comprehension.
This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:
user_id_mapping = {entry["user_id"]: entry for entry in users}
for record in records:
if record["user_id"] in user_id_mapping:
user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]
result = [item for item in user_id_mapping.values() if item["record_id"] is not None]
print(result)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database rather than performing this in Python.

You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:
import pandas as pd
users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')
result = users.drop('record_id', axis=1).merge(records, on='user_id')
If you want to keep the users which have no matching value in records, change the merge to
merge(records, on='user_id', how='left')
To output the result as a list of dictionaries, use to_dict():
result.to_dict('records')
Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A

For scalability, you can use pandas dataframes, like so:
result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')
If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.

Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.
One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:
rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
user_id = user['user_id']
record_id = rdict.get(user_id)
if record_id:
user['record_id'] = record_id
upload.append(user)
This way you're paying the price once for building the hash, and everything else is very efficient.

Comparing values of two dictionary's items

I need to compare the values of the items in two different dictionaries.
Let's say that dictionary RawData has items that represent phone numbers and number names.
Rawdata for example has items like: {'name': 'Customer Service', 'number': '123987546'} {'name': 'Switchboard', 'number': '48621364'}
Now, I got dictionary FilteredData, which already contains some items from RawData: {'name': 'IT-support', 'number': '32136994'} {'name': 'Company Customer Service', 'number': '123987546'}
As you can see, Customer Service and Company Customer Service both have the same values, but different keys. In my project, there might be hundreds of similar duplicates, and we only want unique numbers to end up in FilteredData.
FilteredData is what we will be using later in the code, and RawData will be discarded.
Their names(keys) can be close duplicates, but not their numbers(values)**
There are two ways to do this.
A. Remove the duplicate items in RawData, before appending them into FilteredData.
B. Append them into FilteredData, and go through the numbers(values) there, removing the duplicates. Can I use a set here to do that? It would work on a list, obviously.
I'm not looking for the most time-efficient solution. I'd like the most simple and easy to learn one, if and when someone takes over my job someday. In my project it's mandatory for the next guy working on the code to get a quick grip of it.
I've already looked at sets, and tried to face the problem by nesting two for loops, but something tells me there gotta be an easier way.
Of course I might have missed the obvious solution here.
Thanks in advance!

I hope I understands your problem here:
data = [{'name': 'Customer Service', 'number': '123987546'}, {'name': 'Switchboard', 'number': '48621364'}]
newdata = [{'name': 'IT-support', 'number': '32136994'}, {'name': 'Company Customer Service', 'number': '123987546'}]
def main():
numbers = set()
for entry in data:
numbers.add(entry['number'])
for entry in newdata:
if entry['number'] not in numbers:
data.append(entry)
print data
main()
Output:
[{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'},
{'name': 'IT-support', 'number': '32136994'}]

What you can do is take a dict.values(), create a set of those to remove duplicates and then go through the old dictionary and find the first key with that value and add it to a new one. Keep the set around because when you get the next dict entry, try adding the element to that set and see if the length of the set is longer that before adding it. If it is, it's a unique element and you can add it to the dict.

If you're willing on changing how FilteredData is currently, you can just use a dict and use the number as your key:
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
# Change how FilteredData is structured
FilteredDataMap = {
'32136994':
{'name': 'IT-support', 'number': '32136994'},
'123987546':
{'name': 'Company Customer Service', 'number': '123987546'}
}
for item in RawData:
number = item.get('number')
if number not in FilteredDataMap:
FilteredDataMap[number] = item
# If you need the list of items
FilteredData = list(FilteredDataMap.values())
You can just pull the actual list from the Map using .values()

I take the numbers are unique. Then, another solution would be taking advantage of the uniqueness of dictionary keys. This means converting each list of dictionary to a dictionary of 'number:name' pairs. Then, you simple need to update RawData with FilteredData.
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
FilteredData = [
{'name': 'IT-support', 'number': '32136994'},
{'name': 'Company Customer Service', 'number': '123987546'}
]
def convert_list(input_list):
return {item['number']:item['name'] for item in input_list}
def unconvert_dict(input_dict):
return [{'name':val, 'number': key} for key, val in input_dict.items()]
NewRawData = convert_list(RawData)
NewFilteredData = conver_list(FilteredData)
DesiredResultConverted = NewRawData.update(NewFilteredData)
DesuredResult = unconvert_dict(DesiredResultConverted)
In this example, the variables will have the following values:
NewRawData = {'123987546':'Customer Service', '48621364': 'Switchboard'}
NewFilteredData = {'32136994': 'IT-support', '123987546': 'Company Customer Service'}
When you update NewRawData with NewFilteredData, Company Customer Service will overwrite Customer Service as the value associated with the key 123987546. So,
DesiredResultConverted = {'123987546':'Company Customer Service', '48621364': 'Switchboard', '32136994': 'IT-support'}
Then, if you still prefer the original format, you can "unconvert" back.

Filtering through a list with embedded dictionaries

I've got a json format list with some dictionaries within each list, it looks like the following:
[{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
The amount of entries within the list can be up to 100. I plan to present the 'name' for each entry, one result at a time, for those that have London as a town. The rest are of no use to me. I'm a beginner at python so I would appreciate a suggestion in how to go about this efficiently. I initially thought it would be best to remove all entries that don't have London and then I can go through them one by one.
I also wondered if it might be quicker to not filter but to cycle through the entire json and select the names of entries that have the town as London.

You can use filter:
data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
london_dicts = filter(lambda d: d['venue']['town'] == 'London', data)
for d in london_dicts:
print(d)
This is as efficient as it can get because:
The loop is written in C (in case of CPython)
filter returns an iterator (in Python 3), which means that the results are loaded to memory one by one as required

One way is to use list comprehension:
>>> data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
>>> [d for d in data if d['venue']['town'] == 'London']
[{'id': 17,
'name': 'Alfred',
'venue': {'id': 456, 'town': 'London'},
'month': 'February'},
{'id': 17,
'name': 'Mary',
'venue': {'id': 56, 'town': 'London'},
'month': 'December'}]

Filtering Items in dictionary

I would like to filter items out of one dictionary where that dictionary contains items of another dictionary. So, say that I have two dictionary's dict1 and dict2 where
dict1 = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some content'},
2:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some content'}
}
dict2 = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some different content'},
2:{'account_id':4321, 'case':4321, 'date': 6/12/15, 'content': 'some different content'},
3:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some different content'}
}
I would like to match on account_id, case and date and have the output be a third dictionary with matched entries from dict2 being 1 and 3.
out = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some different content'},
2:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some different content'}
}
How would I accomplish this? I am using Python 3.5

Well then, I believe this is what you are looking for:
from itertools import count
from operator import itemgetter
# Set the criteria for unique entry (prevents us from needing to write this twice)
get_identifier = itemgetter("account_id","case","date")
# Create a set of all unique items.
unique_entries = set(map(get_identifier, dict1.values()))
# Get all entries that match one of the unique entries
matched_entires = (d for d in dict2.values() if get_identifier(d) in unique_entries)
# Recreate a new dict together with a counter for items.
out = dict(zip(count(1), matched_entires))
For more info about count() and itemgetter(), see their respective docs.
Using a set and generator comprehensions ensures efficiency at the highest level.

Python: How to store multiple values for one dictionary key

I want to store a list of ingredients in a dictionary, using the ingredient name as the key and storing both the ingredient quantity and measurement as values. For example, I would like to be able to do something like this:
ingredientList = {'flour' 500 grams, 'tomato' 10, 'mozzarella' 250 grams}
With the 'tomato' key, tomatoes do not have a measurement, only a quantity. Would I be able to achieve this in Python? Is there an alternate or more efficient way of going about this?

If you want lists just use lists:
ingredientList = {'flour': [500,"grams"], 'tomato':[10], 'mozzarella' :[250, "grams"]}
To get the items:
weight ,meas = ingredientList['flour']
print(weight,meas)
(500, 'grams')
If you want to update just ingredientList[key].append(item)

You could use another dict.
ingredientList = {
'flour': {'quantity': 500, 'measurement': 'grams'},
'tomato': {'quantity': 10},
'mozzarella': {'quantity': 250, 'measurement': 'grams'}
}
Then you could access them like this:
print ingredientList['mozzarella']['quantity']
>>> 250

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.