So I have a dict of working jobs each holding a dict
{
"hacker": {"crime": "high"},
"mugger": {"crime": "high", "morals": "low"},
"office drone": {"work_drive": "high", "tolerance": "high"},
"farmer": {"work_drive": "high"},
}
And I have roughly about 21000 more unique jobs to handle
How would I go about scanning through them faster?
And is there any type of data structure that makes this faster and better to scan through? Such as a lookup table for each of the tags?
I'm using python 3.10.4
NOTE: If it helps, everything is loaded up at the start of runtime and doesn't change during runtime at all
Here's my current code:
test_data = {
"hacker": {"crime": "high"},
"mugger": {"crime": "high", "morals": "low"},
"shop_owner": {"crime": "high", "morals": "high"},
"office_drone": {"work_drive": "high", "tolerance": "high"},
"farmer": {"work_drive": "high"},
}
class NULL: pass
class Conditional(object):
def __init__(self, data):
self.dataset = data
def find(self, *target, **tags):
dataset = self.dataset.items()
if target:
dataset = (
(entry, data) for entry, data in dataset
if all( (t in data) for t in target)
)
if tags:
return [
entry for entry, data in dataset
if all(
(data.get(tag, NULL) == val) for tag, val in tags.items()
)
]
else:
return [data[0] for data in dataset]
jobs = Conditional(test_data)
print(jobs.find(work_drive="high"))
>>> ['office_drone', 'farmer']
print(jobs.find("crime"))
>>> ['hacker', 'mugger', 'shop_owner']
print(jobs.find("crime", "morals"))
>>> ['mugger', 'shop_owner']
print(jobs.find("crime", morals="high"))
>>> ['shop_owner']
And is there any type of data structure that makes this faster and better to scan through?
Yes. And it is called dict =)
Just turn your dict into two dictionaries one by tag and another by tag and tag value which will contain sets:
from collections import defaultdict
...
by_tag = defaultdict(set)
by_tag_value = defaultdict(lambda: defaultdict(set))
for job, tags in test_data.items():
for tag, val in tags.items():
by_tag[tag].add(job)
by_tag_value[tag][val].add(job)
# example
# to search crime:high and morals
crime_high = by_tag_value["crime"]["high"]
morals = by_tag["morals"]
result = crime_high.intersection(morals) # {'mugger', 'shop_owner'}
And then use them to search needed sets and return jobs which are present in all of the sets.
When looking up the first-level in the dictionary, the way to do that is either with my_dict[key] or my_dict.get(key) (they do the same thing). So I think you just want to do that with your target lookup.
Then, if you want to look up which jobs include anything about one of the tags, then I think that yea making a lookup dictionary for that is reasonable. You could make a dictionary where each key maps to a list of those jobs.
The below code would be run once at the beginning and would make the lookup based off of the test_data. It loops through the entire dictionary and any time it encounters a tag in the values for an item, it'll add the key from it to the list of jobs for that tag
lookup = dict()
for k,v in test_data.items():
for kk,vv in v.items():
try:
lookup[kk].append(k)
except KeyError:
lookup[kk] = [k]
Output (lookup):
{'crime': ['hacker', 'mugger', 'shop_owner'],
'morals': ['mugger', 'shop_owner'],
'work_drive': ['office_drone', 'farmer'],
'tolerance': ['office_drone']}
With this lookup table, you could ask 'Which jobs have a crime stat?' with lookup['crime'], which would output ['hacker', 'mugger', 'shop_owner']
Related
I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)
I have a JSON with a dict of keys which are not always present, at least not all of them all the time at the same position. For example, "producers" is not always on array dict [2] present or "directors" not always on [1] at the JSON, it fully depends on the JSON I pass into my function. Depending on what is available at ['plist']['dict']['key'] the content is mapped to dict 0,1,2,3 (except of studio) ...
How can I find the corresponding array for cast, directors, producers etc. as each of them is not always located at the same array number?!
In the end I always want to be able to pull out the right data for the right field even if ['plist']['dict']['key'] may vary sometimes according to the mapped dict.
...
def get_plist_meta(element):
if isinstance(element, dict):
return element["string"]
return ", ".join(i["string"] for i in element)
...
### Default map if all fields are present
# 0 = cast
# 1 = directors
# 2 = producers
# 3 = screenwriters
plist_metadata = json.loads(dump_json)
### make fields match the given sequence 0 = cast, 1 = directors etc. ()
if 'cast' in plist_metadata['plist']['dict']['key']:
print("Cast: ", get_plist_meta(plist_metadata['plist']['dict']['array'][0]['dict']))
if 'directors' in plist_metadata['plist']['dict']['key']:
print("Directors: ", get_plist_meta(plist_metadata['plist']['dict']['array'][1]['dict']))
if 'producers' in plist_metadata['plist']['dict']['key']:
print("Producers: ", get_plist_meta(plist_metadata['plist']['dict']['array'][2]['dict']))
if 'screenwriters' in plist_metadata['plist']['dict']['key']:
print("Screenwriters: ", get_plist_meta(plist_metadata['plist']['dict']['array'][3]['dict']))
if 'studio' in plist_metadata['plist']['dict']['key']:
print("Studio: ", plist_metadata['plist']['dict']['string'])
JSON:
{
"plist":{
"#version":"1.0",
"dict":{
"key":[
"cast",
"directors",
"screenwriters",
"studio"
],
"array":[
{
"dict":[
{
"key":"name",
"string":"Martina Piro"
},
{
"key":"name",
"string":"Ralf Stark"
}
]
},
{
"dict":{
"key":"name",
"string":"Franco Camilio"
}
},
{
"dict":{
"key":"name",
"string":"Kai Meisner"
}
}
],
"string":"Helix Films"
}
}
}
JSON can also be obtained here: https://pastebin.com/JCXRs3Rw
Thanks in advance
If you prefer a more pythonic solution, try this:
# We will use this function to extract the names from the subdicts. We put single items in a new array so the result is consistent, no matter how many names there were.
def get_names(name_dict):
arrayfied = name_dict if isinstance(name_dict, list) else [name_dict]
return [o["string"] for o in arrayfied]
# Make a list of tuples
dict = plist_metadata['plist']['dict']
zipped = zip(dict["key"], dict["array"])
# Get the names from the subdicts and put it into a new dict
result = {k: get_names(v["dict"]) for k, v in zipped}
This will give you a new dict that looks like this
{'cast': ['Martina Piro', 'Ralf Stark'], 'directors': ['Franco Camilio'], 'screenwriters': ['Kai Meisner']}
The new dict will only have the keys present in the original dict.
I'd advise to check out things like zip, map and so on as well as list comprehensions and dict comprehensions.
I think this solves your problem:
import json
dump_json = """{"plist":{"#version":"1.0","dict":{"key":["cast","directors","screenwriters","studio"],"array":[{"dict":[{"key":"name","string":"Martina Piro"},{"key":"name","string":"Ralf Stark"}]},{"dict":{"key":"name","string":"Franco Camilio"}},{"dict":{"key":"name","string":"Kai Meisner"}}],"string":"Helix Films"}}}"""
plist_metadata = json.loads(dump_json)
roles = ['cast', 'directors', 'producers', 'screenwriters'] # all roles
names = {'cast': [], 'directors': [], 'producers': [], 'screenwriters': []} # stores the final output
j = 0 # keeps count of which array entry we are looking at in plist_metadata['plist']['dict']['array']
for x in names.keys(): # cycle through all the possible roles
if x in plist_metadata['plist']['dict']['key']: # if a role exists in the keys, we'll store it in names[role_name]
y = plist_metadata['plist']['dict']['array'][j]['dict'] # keep track of value
if isinstance(plist_metadata['plist']['dict']['array'][j]['dict'], dict): # if its a dict, encase it in a list
y = [plist_metadata['plist']['dict']['array'][j]['dict']]
j += 1 # add to our plist-dict-array index
names[x] = list(map(lambda x: x['string'], y)) # map each of the entries from {"key":"name","string":"Martina Piro"} to just "Martina Piro"
print(names)
def list_names(role_name):
if role_name not in names.keys():
return f'Invalid list request: Role name "{role_name}" not found.'
return f'{role_name.capitalize()}: {", ".join(names[role_name])}'
print(list_names('cast'))
print(list_names('audience'))
Output:
{'cast': ['Martina Piro', 'Ralf Stark'], 'directors': ['Franco Camilio'], 'producers': [], 'screenwriters': ['Kai Meisner']}
Cast: Martina Piro, Ralf Stark
Invalid list request: Role name "audience" not found.
I'm using this as a reference: Elegant way to remove fields from nested dictionaries
I have a large number of JSON-formatted data here and we've determined a list of unnecessary keys (and all their underlying values) that we can remove.
I'm a bit new to working with JSON and Python specifically (mostly did sysadmin work) and initially thought it was just a plain dictionary of dictionaries. While some of the data looks like that, several more pieces of data consists of dictionaries of lists, which can furthermore contain more lists or dictionaries with no specific pattern.
The idea is to keep the data identical EXCEPT for the specified keys and associated values.
Test Data:
to_be_removed = ['leecher_here']
easy_modo =
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
}
lunatic_modo =
{
'hello_wold':
{'
leecher_here':'nushiyowa','goodbye_world':'aokigahara'
},
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa',
'something_inside':
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
},
'list_o_dicts':
[
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
}
]
}
Obviously, the original question posted there isn't accounting for lists.
My code, modified appropriately to work with my requirements.
from copy import deepcopy
def remove_key(json,trash):
"""
<snip>
"""
keys_set = set(trash)
modified_dict = {}
if isinstance(json,dict):
for key, value in json.items():
if key not in keys_set:
if isinstance(value, dict):
modified_dict[key] = remove_key(value, keys_set)
elif isinstance(value,list):
for ele in value:
modified_dict[key] = remove_key(ele,trash)
else:
modified_dict[key] = deepcopy(value)
return modified_dict
I'm sure something's messing with the structure since it doesn't pass the test I wrote since the expected data is exactly the same, minus the removed keys. The test shows that, yes it's properly removing the data but for the parts where it's supposed to be a list of dictionaries, it's only getting returned as a dictionary instead which will have unfortunate implications down the line.
I'm sure it's because the function returns a dictionary but I don't know to proceed from here in order to maintain the structure.
At this point, I'm needing help on what I could have overlooked.
When you go through your json file, you only need to determine whether it is a list, a dict or neither. Here is a recursive way to modify your input dict in place:
def remove_key(d, trash=None):
if not trash: trash = []
if isinstance(d,dict):
keys = [k for k in d]
for key in keys:
if any(key==s for s in trash):
del d[key]
for value in d.values():
remove_key(value, trash)
elif isinstance(d,list):
for value in d:
remove_key(value, trash)
remove_key(lunatic_modo,to_be_removed)
remove_key(easy_modo,to_be_removed)
Result:
{
"hello_wold": {
"goodbye_world": "aokigahara"
},
"leeching_forbidden": "wanpan kinshi",
"something_inside": {
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
},
"list_o_dicts": [
{
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
}
]
}
{
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
}
So I have these two lists:
image_names = ["IMG_1.jpg", "IMG_2.jpg"]
data = [{"name": "IMG_1.jpg", "id": "53567"},
{"name": "IMG_2.jpg", "id": "53568"},
{"name": "IMG_3.jpg", "id": "53569"},
{"name": "IMG_4.jpg", "id": "53570"}]
I want to search for the first item then the next one and so on in images_names in data and if it has the same name to get the id and add it to a list.
This is how I'm doing this:
for image_name in image_names:
for datum in data:
datum_name = datum.get("name", None)
if datum_name == image_name:
images_ids.append(datum.get("id", None))
Right now it works great but I think this is really inefficient once I get a lot of data in images_names and data. What's the best way in Python to do this? I'm using python 2.7
The main problem is that your data structure isn't set up to give you the access you want. Instead of a list of dicts, make this the natural dict that you want to use:
data = {"IMG_1.jpg": "53567",
"IMG_2.jpg": "53568",
"IMG_3.jpg": "53569",
"IMG_4.jpg": "53570"}
Now, all you need to make the list of corresponding ids is
images_ids = [data[img] for img in image_names]
If you have a need for both methods of access (if you still need the name and id labels), then I recommend that you learn to use a Pandas data frame, with name and id as the columns. This will give you the best of both methods.
>>> images_ids = [filter(lambda x: x['name'] == name, data) for name in image_names]
>>> images_ids = [i[0]['id'] for i in images_ids if i]
>>> images_ids
['53567', '53568']
Other option:
[ item["id"] for item in data if item["name"] in image_names]
#=> ['53567', '53568']
It works also when images with same name exist with different ids:
data = [{"name": "IMG_1.jpg", "id": "53500"},{"name": "IMG_1.jpg", "id": "53501"}]
#=> ['53500', '53501']
You are correct, it is inefficient. Instead of using a list of dictionaries, you should use either a dictionary of dictionaries or a dictionary of objects:
data = {"IMG_1.jpg": {"id": "53567"},
"IMG_2.jpg": {"id": "53568"},
"IMG_3.jpg": {"id": "53569"},
"IMG_4.jpg": {"id": "53570"}}
for image_name in image_names:
if (image_name in data):
image_ids.append(data[image_name]["id"])
Instead of O(n) for lookup in a list, you'll get O(1) for lookup in a dictionary.
Of course, you can still have name as a key in your sub-dictionary if you want, I just removed it for simplicity. But the real holy grail here would be to build a class:
class ImageData:
def __init__(self, name, id):
self.Name = name
self.Id = id
data = {"IMG_1.jpg": ImageData("IMG_1.jpg", "53567"),
"IMG_2.jpg": ImageData("IMG_2.jpg", "53568"),
"IMG_3.jpg": ImageData("IMG_3.jpg", "53569"),
"IMG_4.jpg": ImageData("IMG_4.jpg", "53570")}
for image_name in image_names:
if (image_name in data):
image_ids.append(data[image_name].Id)
Using list comprehension and filter, you can try this out. This works with your existing data, though I would highly recommend you restructure your dictionary as per recommendation of others-
images_ids = [datum.get("id", None) for datum in data for image_name in
image_names if datum.get("name", None) == image_name ]
No need for 2 loops here. You can iterate the first loop and search for Image Name in second list, If matches add the id to the Image Ids. Like below
for datum in data:
datum_name = datum.get("name", None)
if any(datum_name in s for s in image_names):
images_ids.append(datum.get("id", None))
I have a dictionary that looks like:
{u'message': u'Approved', u'reference': u'A71E7A739E24', u'success': True}
I would like to retrieve the key-value pair for reference, i.e. { 'reference' : 'A71E7A739E24' }.
I'm trying to do this using iteritems which does return k, v pairs, and then I'm adding them to a new dictionary. But then, the resulting value is unicode rather than str for some reason and I'm not sure if this is the most straightforward way to do it:
ref = {}
for k, v in charge.iteritems():
if k == 'reference':
ref['reference'] = v
print ref
{'reference': u'A71E7A739E24'}
Is there a built-in way to do this more easily? Or, at least, to avoid using iteritems and simply return:
{ 'reference' : 'A71E7A739E24' }
The trouble with using iteritems is that you increase lookup time to O(n) where n is dictionary size, because you are no longer using a hash table
If you only need to get one key-value pair, it's as simple as
ref = { key: d[key] }
If there may be multiple pairs that are selected by some condition,
either use dict from iterable constructor (the 2nd version is better if your condition depends on values, too):
ref = dict(k,d[k] for k in charge if <condition>)
ref = dict(k,v for k,v in charge.iteritems() if <condition>)
or (since 2.7) a dict comprehension (which is syntactic sugar for the above):
ref = {k,d[k] for k in charge if <condition>}
<same as above>
I dont understand the question:
is this what you are trying to do:
ref={'reference',charge["reference"]}