Filtering Items in dictionary

Filtering Items in dictionary - python

I would like to filter items out of one dictionary where that dictionary contains items of another dictionary. So, say that I have two dictionary's dict1 and dict2 where
dict1 = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some content'},
2:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some content'}
}
dict2 = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some different content'},
2:{'account_id':4321, 'case':4321, 'date': 6/12/15, 'content': 'some different content'},
3:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some different content'}
}
I would like to match on account_id, case and date and have the output be a third dictionary with matched entries from dict2 being 1 and 3.
out = {
1:{'account_id':1234, 'case':1234, 'date': 12/31/15, 'content': 'some different content'},
2:{'account_id':1235, 'case':1235, 'date': 12/15/15, 'content': 'some different content'}
}
How would I accomplish this? I am using Python 3.5

Well then, I believe this is what you are looking for:
from itertools import count
from operator import itemgetter
# Set the criteria for unique entry (prevents us from needing to write this twice)
get_identifier = itemgetter("account_id","case","date")
# Create a set of all unique items.
unique_entries = set(map(get_identifier, dict1.values()))
# Get all entries that match one of the unique entries
matched_entires = (d for d in dict2.values() if get_identifier(d) in unique_entries)
# Recreate a new dict together with a counter for items.
out = dict(zip(count(1), matched_entires))
For more info about count() and itemgetter(), see their respective docs.
Using a set and generator comprehensions ensures efficiency at the highest level.

Related

Python 3.9 parse and filter json to have entire rows having specific keyword:value

For a JSON file, on doing json.load its getting converted into list of dictionaries.
sample rows are from list of dictionaries are as below: -
config_items_list = [{'hostname': '"abc2164"', 'Status': '"InUse"', 'source': '"excel"', 'port': '"[445]"', 'tech': '"Others"', 'ID': '"123456"'},
{'hostname': '"xyz2164"', 'Status': '"InUse"', 'source': '"web"', 'port': '"[123]"', 'tech': '"Others"', 'ID': '"456789"'},
{'hostname': '"pqr2164"', 'Status': '"NotInUse"', 'source': '"web"', 'port': '"[777]"', 'tech': '"Others"', 'ID': '"123456"'}]
Requirement is to parse this list of dictionaries and extract all rows having specific value in ID key. For example, 123456 (means two rows from above sample).

try this solution it should solve your problem
def filter_json_with_id(config_items_list, id):
items = []
for item in config_items_list:
if item['ID'] == id:
items.append(item)
return items

select_id = '"123456"'
[d for d in config_items_list if d["ID"] == select_id]
will produce a list with those dictionaries that have given select_id as "ID" in config_items_list.

By using the below solution, it searches for the id_value even if the id_value is surrounded by any symbol or any other number/character as well:
expectedResult = [d for d in config_items_list if "123456" in d['ApplicationInstanceID']]

Fastest way to match 2 lists of dicts on a key value

I have a script that retrieves user data from a CSV (~2.5m) and record data from Salesforce via API (~2m) and matches them based on a unique user_id.
For each user, I need the relevant record_id (if it exists). There is a one-to-one relationship with users and records, so the user_id should only appear on 1 record.
To try and increase performance both lists are sorted ascending by user_id, and I break the loop if record['user_id'] > user['user_id'] as that means there is no relevant record.
It's working, however it's slow when trying to match the 2 datasets taking ~1.5hrs. Is there a faster method of performing the matching to retrieve the relevant record_id?
Here is an example of the data, current function, and expected result:
users = [
{"user_id": 11111, "name": "Customer A", "age": 34, 'record_id': None},
{"user_id": 22222, "name": "Customer B", "age": 18, 'record_id': None},
{"user_id": 33333, "name": "Customer C", "age": 66, 'record_id': None}
]
records = [
{"user_id": 11111, "record_id": "ABC123"},
{"user_id": 33333, "record_id": "GHI789"}
]
upload = []
for user in users:
for record in records:
if user['user_id'] == record['user_id']:
user['record_id'] = record['record_id']
records.remove(record)
break
elif record['user_id'] > user['user_id']:
break
if user['record_id']:
upload.append(user)
print(upload)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]

Create a dictionary that maps from a user's id to its corresponding dictionary. Then, you can add the relevant record_id fields using a for loop. Finally, you can remove the entries without an assigned record_id using a list comprehension.
This doesn't require any preprocessing (e.g. sorting) to obtain speedup; the efficiency gain comes from the fact that lookups in a large dictionary are faster than searching a large list:
user_id_mapping = {entry["user_id"]: entry for entry in users}
for record in records:
if record["user_id"] in user_id_mapping:
user_id_mapping[record["user_id"]]["record_id"] = record["record_id"]
result = [item for item in user_id_mapping.values() if item["record_id"] is not None]
print(result)
This outputs:
[
{'user_id': 11111, 'name': 'Customer A', 'age': 34, 'record_id': 'ABC123'},
{'user_id': 33333, 'name': 'Customer C', 'age': 66, 'record_id': 'GHI789'}
]
With this being said, if you have to execute similar flavors of this operation repeatedly, I would recommend using some sort of a database rather than performing this in Python.

You could use pandas.read_csv() to read your CSV data into a dataframe, and then merge that with the records on the user_id value:
import pandas as pd
users = pd.read_csv('csv file')
records = pd.DataFrame('result of salesforce query')
result = users.drop('record_id', axis=1).merge(records, on='user_id')
If you want to keep the users which have no matching value in records, change the merge to
merge(records, on='user_id', how='left')
To output the result as a list of dictionaries, use to_dict():
result.to_dict('records')
Note - it may be possible to execute your Salesforce query directly into a dataframe. See for example this Q&A

For scalability, you can use pandas dataframes, like so:
result = pd.merge(pd.DataFrame(users), pd.DataFrame(records), on='user_id').to_dict('records')
If you want to keep the entries which do not have a record_id, you can add the how="left" to the arguments of the merge function.

Your approach isn't unreasonable. But removing record after it's used has a cost. Sorting your two lists ahead of time also has a cost. These costs may add up more than you think they do.
One possible approach would be to NOT sort the lists, but instead build a dict of record_ids, eg:
rdict = { r['user_id']:r['record_id'] for r in records }
for user in users:
user_id = user['user_id']
record_id = rdict.get(user_id)
if record_id:
user['record_id'] = record_id
upload.append(user)
This way you're paying the price once for building the hash, and everything else is very efficient.

Comparing values of two dictionary's items

I need to compare the values of the items in two different dictionaries.
Let's say that dictionary RawData has items that represent phone numbers and number names.
Rawdata for example has items like: {'name': 'Customer Service', 'number': '123987546'} {'name': 'Switchboard', 'number': '48621364'}
Now, I got dictionary FilteredData, which already contains some items from RawData: {'name': 'IT-support', 'number': '32136994'} {'name': 'Company Customer Service', 'number': '123987546'}
As you can see, Customer Service and Company Customer Service both have the same values, but different keys. In my project, there might be hundreds of similar duplicates, and we only want unique numbers to end up in FilteredData.
FilteredData is what we will be using later in the code, and RawData will be discarded.
Their names(keys) can be close duplicates, but not their numbers(values)**
There are two ways to do this.
A. Remove the duplicate items in RawData, before appending them into FilteredData.
B. Append them into FilteredData, and go through the numbers(values) there, removing the duplicates. Can I use a set here to do that? It would work on a list, obviously.
I'm not looking for the most time-efficient solution. I'd like the most simple and easy to learn one, if and when someone takes over my job someday. In my project it's mandatory for the next guy working on the code to get a quick grip of it.
I've already looked at sets, and tried to face the problem by nesting two for loops, but something tells me there gotta be an easier way.
Of course I might have missed the obvious solution here.
Thanks in advance!

I hope I understands your problem here:
data = [{'name': 'Customer Service', 'number': '123987546'}, {'name': 'Switchboard', 'number': '48621364'}]
newdata = [{'name': 'IT-support', 'number': '32136994'}, {'name': 'Company Customer Service', 'number': '123987546'}]
def main():
numbers = set()
for entry in data:
numbers.add(entry['number'])
for entry in newdata:
if entry['number'] not in numbers:
data.append(entry)
print data
main()
Output:
[{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'},
{'name': 'IT-support', 'number': '32136994'}]

What you can do is take a dict.values(), create a set of those to remove duplicates and then go through the old dictionary and find the first key with that value and add it to a new one. Keep the set around because when you get the next dict entry, try adding the element to that set and see if the length of the set is longer that before adding it. If it is, it's a unique element and you can add it to the dict.

If you're willing on changing how FilteredData is currently, you can just use a dict and use the number as your key:
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
# Change how FilteredData is structured
FilteredDataMap = {
'32136994':
{'name': 'IT-support', 'number': '32136994'},
'123987546':
{'name': 'Company Customer Service', 'number': '123987546'}
}
for item in RawData:
number = item.get('number')
if number not in FilteredDataMap:
FilteredDataMap[number] = item
# If you need the list of items
FilteredData = list(FilteredDataMap.values())
You can just pull the actual list from the Map using .values()

I take the numbers are unique. Then, another solution would be taking advantage of the uniqueness of dictionary keys. This means converting each list of dictionary to a dictionary of 'number:name' pairs. Then, you simple need to update RawData with FilteredData.
RawData = [
{'name': 'Customer Service', 'number': '123987546'},
{'name': 'Switchboard', 'number': '48621364'}
]
FilteredData = [
{'name': 'IT-support', 'number': '32136994'},
{'name': 'Company Customer Service', 'number': '123987546'}
]
def convert_list(input_list):
return {item['number']:item['name'] for item in input_list}
def unconvert_dict(input_dict):
return [{'name':val, 'number': key} for key, val in input_dict.items()]
NewRawData = convert_list(RawData)
NewFilteredData = conver_list(FilteredData)
DesiredResultConverted = NewRawData.update(NewFilteredData)
DesuredResult = unconvert_dict(DesiredResultConverted)
In this example, the variables will have the following values:
NewRawData = {'123987546':'Customer Service', '48621364': 'Switchboard'}
NewFilteredData = {'32136994': 'IT-support', '123987546': 'Company Customer Service'}
When you update NewRawData with NewFilteredData, Company Customer Service will overwrite Customer Service as the value associated with the key 123987546. So,
DesiredResultConverted = {'123987546':'Company Customer Service', '48621364': 'Switchboard', '32136994': 'IT-support'}
Then, if you still prefer the original format, you can "unconvert" back.

Import nested MongoDB to Pandas

I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df

You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.

Get an ordered dictionary from an ordinary one

What is a good way to get an ordered dictionary from a regular dictionary? I need the keys (and these keys are known ahead of time) to be in a certain order. I will be "dump"ing a list of these dictionaries into a JSON file and need things ordered a certain way.
--- Edited and added the following
For instance i have a dictionary ...
employee = { 'phone': '1234567890', 'department': 'HR', 'country': 'us', 'name': 'Smith' }
when i dump it into JSON format, i would like for it to print out as
{ 'name': 'Smith', 'department': 'HR', 'country': 'us', 'phone': '1234567890'}

Sort your dict items and create an OrderedDict from the sorted elements making sure to pass reverse=True to sort from highest to lowest:
from collections import OrderedDict
order = ("name","department","country","phone")
employee = { 'phone': '1234567890', 'department': 'HR', 'country': 'us', 'name': 'Smith' }
od = OrderedDict((k, employee[k]) for k in order)
But if you dump to a json file and load again the order will not be maintained and you will not get an OrderedDict back, when you dump it will look like:
{"name": "Smith", "department": "HR", "country": "us", "phone": "1234567890"}
But loading will will not be in the same order because normal dicts have no order like below:
{'phone': '1234567890', 'name': 'Smith', 'country': 'us', 'department': 'HR'}
If you are trying to just store the dicts to use again and want to maintain order you can pickle:
import pickle
with open("foo.pkl","wb") as f:
pickle.dump(od,f)
with open("foo.pkl","rb") as f:
d = pickle.load(f)
print(d)

You could do something like the following ... you collect the keys in order in a list of String, traverse through the list and look up in the dictionary, and create an ordered dictionary
def makeOrderedDict(dictToOrder, keyOrderList):
tupleList = []
for key in keyOrderList:
tupleList.append((key, dictToOrder[key]))
return OrderedDict(tupleList)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.