Iterate and append faster through millions of dictionnaries in python - python

I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)

Related

Efficient and fast way to search through dict of dicts

So I have a dict of working jobs each holding a dict
{
"hacker": {"crime": "high"},
"mugger": {"crime": "high", "morals": "low"},
"office drone": {"work_drive": "high", "tolerance": "high"},
"farmer": {"work_drive": "high"},
}
And I have roughly about 21000 more unique jobs to handle
How would I go about scanning through them faster?
And is there any type of data structure that makes this faster and better to scan through? Such as a lookup table for each of the tags?
I'm using python 3.10.4
NOTE: If it helps, everything is loaded up at the start of runtime and doesn't change during runtime at all
Here's my current code:
test_data = {
"hacker": {"crime": "high"},
"mugger": {"crime": "high", "morals": "low"},
"shop_owner": {"crime": "high", "morals": "high"},
"office_drone": {"work_drive": "high", "tolerance": "high"},
"farmer": {"work_drive": "high"},
}
class NULL: pass
class Conditional(object):
def __init__(self, data):
self.dataset = data
def find(self, *target, **tags):
dataset = self.dataset.items()
if target:
dataset = (
(entry, data) for entry, data in dataset
if all( (t in data) for t in target)
)
if tags:
return [
entry for entry, data in dataset
if all(
(data.get(tag, NULL) == val) for tag, val in tags.items()
)
]
else:
return [data[0] for data in dataset]
jobs = Conditional(test_data)
print(jobs.find(work_drive="high"))
>>> ['office_drone', 'farmer']
print(jobs.find("crime"))
>>> ['hacker', 'mugger', 'shop_owner']
print(jobs.find("crime", "morals"))
>>> ['mugger', 'shop_owner']
print(jobs.find("crime", morals="high"))
>>> ['shop_owner']
And is there any type of data structure that makes this faster and better to scan through?
Yes. And it is called dict =)
Just turn your dict into two dictionaries one by tag and another by tag and tag value which will contain sets:
from collections import defaultdict
...
by_tag = defaultdict(set)
by_tag_value = defaultdict(lambda: defaultdict(set))
for job, tags in test_data.items():
for tag, val in tags.items():
by_tag[tag].add(job)
by_tag_value[tag][val].add(job)
# example
# to search crime:high and morals
crime_high = by_tag_value["crime"]["high"]
morals = by_tag["morals"]
result = crime_high.intersection(morals) # {'mugger', 'shop_owner'}
And then use them to search needed sets and return jobs which are present in all of the sets.
When looking up the first-level in the dictionary, the way to do that is either with my_dict[key] or my_dict.get(key) (they do the same thing). So I think you just want to do that with your target lookup.
Then, if you want to look up which jobs include anything about one of the tags, then I think that yea making a lookup dictionary for that is reasonable. You could make a dictionary where each key maps to a list of those jobs.
The below code would be run once at the beginning and would make the lookup based off of the test_data. It loops through the entire dictionary and any time it encounters a tag in the values for an item, it'll add the key from it to the list of jobs for that tag
lookup = dict()
for k,v in test_data.items():
for kk,vv in v.items():
try:
lookup[kk].append(k)
except KeyError:
lookup[kk] = [k]
Output (lookup):
{'crime': ['hacker', 'mugger', 'shop_owner'],
'morals': ['mugger', 'shop_owner'],
'work_drive': ['office_drone', 'farmer'],
'tolerance': ['office_drone']}
With this lookup table, you could ask 'Which jobs have a crime stat?' with lookup['crime'], which would output ['hacker', 'mugger', 'shop_owner']

Pymongo updating documents using list of dictionaries [duplicate]

First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write

Parsing JSON output efficiently in Python?

The below block of code works however I'm not satisfied that it is very optimal due to my limited understanding of using JSON but I can't seem to figure out a more efficient method.
The steam_game_db is like this:
{
"applist": {
"apps": [
{
"appid": 5,
"name": "Dedicated Server"
},
{
"appid": 7,
"name": "Steam Client"
},
{
"appid": 8,
"name": "winui2"
},
{
"appid": 10,
"name": "Counter-Strike"
}
]
}
}
and my Python code so far is
i = 0
x = 570
req_name_from_id = requests.get(steam_game_db)
j = req_name_from_id.json()
while j["applist"]["apps"][i]["appid"] != x:
i+=1
returned_game = j["applist"]["apps"][i]["name"]
print(returned_game)
Instead of looping through the entire app list is there a smarter way to perhaps search for it? Ideally the elements in the data structure with 'appid' and 'name' were numbered the same as their corresponding 'appid'
i.e.
appid 570 in the list is Dota2
However element 570 in the data structure in appid 5069 and Red Faction
Also what type of data structure is this? Perhaps it has limited my searching ability for this answer already. (I.e. seems like a dictionary of 'appid' and 'element' to me for each element?)
EDIT: Changed to a for loop as suggested
# returned_id string for appid from another query
req_name_from_id = requests.get(steam_game_db)
j_2 = req_name_from_id.json()
for app in j_2["applist"]["apps"]:
if app["appid"] == int(returned_id):
returned_game = app["name"]
print(returned_game)
The most convenient way to access things by a key (like the app ID here) is to use a dictionary.
You pay a little extra performance cost up-front to fill the dictionary, but after that pulling out values by ID is basically free.
However, it's a trade-off. If you only want to do a single look-up during the life-time of your Python program, then paying that extra performance cost to build the dictionary won't be beneficial, compared to a simple loop like you already did. But if you want to do multiple look-ups, it will be beneficial.
# build dictionary
app_by_id = {}
for app in j["applist"]["apps"]:
app_by_id[app["appid"]] = app["name"]
# use it
print(app_by_id["570"])
Also think about caching the JSON file on disk. This will save time during your program's startup.
It's better to have the JSON file on disk, you can directly dump it into a dictionary and start building up your lookup table. As an example I've tried to maintain your logic while using the dict for lookups. Don't forget to encode the JSON it has special characters in it.
Setup:
import json
f = open('bigJson.json')
apps = {}
with open('bigJson.json', encoding="utf-8") as handle:
dictdump = json.loads(handle.read())
for item in dictdump['applist']['apps']:
apps.setdefault(item['appid'], item['name'])
Usage 1:
That's the way you have used it
for appid in range(0, 570):
if appid in apps:
print(appid, apps[appid].encode("utf-8"))
Usage 2: That's how you can query a key, using getinstead of [] will prevent a KeyError exception if the appid isn't recorded.
print(apps.get(570, 0))

removing json items from array if value is duplicate python

I am incredibly new to python.
I have an array full of json objects. Some of the json objects contain duplicated values. The array looks like this:
[{"id":"1","name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"},
{"id":"3","name":"Paul","age":"23"}]
What I am trying to do is to remove an item if the name is the same as another json object, and leave the first one in the array.
So in this case I should be left with
[{"id":"1"."name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"}]
The code I currently have can be seen below and is largely based on this answer:
import json
ds = json.loads('python.json') #this file contains the json
unique_stuff = { each['name'] : each for each in ds }.values()
all_ids = [ each['name'] for each in ds ]
unique_stuff = [ ds[ all_ids.index(text) ] for text in set(texts) ]
print unique_stuff
I am not even sure that this line is working ds = json.loads('python.json') #this file contains the json as when I try and print ds nothing shows up in the console.
You might have overdone in your approach. I might tend to rewrite the list as a dictionary with "name" as a key and then fetch the values
ds = [{"id":"1","name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"},
{"id":"3","name":"Paul","age":"23"}]
{elem["name"]:elem for elem in ds}.values()
Out[2]:
[{'age': '23', 'id': '3', 'name': 'Paul'},
{'age': '22', 'id': '2', 'name': 'Peter'}]
Off-course the items within the dictionary and the list may not be ordered, but I do not see much of a concern. If it is, let us know and we can think over it.
If you need to keep the first instance of "Paul" in your data a dictionary comprehension gives you the opposite result.
A simple solution could be as following
new = []
seen = set()
for record in old:
name = record['name']
if name not in seen:
seen.add(name)
new.append(record)
del seen
First of all, your json snippet has invalid format - there are dot instead of commas separating some keys.
You can solve your problem using a dictionary with names as keys:
import json
with open('python.json') as fp:
ds = json.load(fp) #this file contains the json
mem = {}
for record in ds:
name = record["name"]
if name not in mem:
mem[name] = record
print mem.values()

How to access and modify complex list/dictionary data structure in Python?

I am trying to analyze web server logs to get IP addresses, user agents, request paths data. I would like to store different paths visited by a particular IP. In addition, some clients spoof user agents so an IP can present many user agent strings. I would like to store this user agent and path data for each IP.
Right now I have created a data structure as follows:
ip_dict[ip_address] = [ total_count_int, [{'path_name_1': path_name_1-count_int }, {'path_name_2': path_name_2_count} ], [ {'crawler': crawler_count_int} ] ]
First item in the list - Total number of requests
Second item in the list - List of {'visited site path' : count }
Third item in the list - List of {'visited user agent' : count }
However, it's getting complicated to implement it for modifying existing items. I would like to increment count if respective key element matches.
Any help on creating better data structure or modifying elements in above data structure would be appreciated.
Seems like a dict of dicts is what you want:
def update(ip_dict, ip_address, site_path, user_agent):
if ip_address in ip_dict:
ip_entry = ip_dict[ip_address]
ip_entry['total_count'] += 1
if site_path in ip_entry['site_paths']:
ip_entry['site_paths'][site_path] += 1
else:
ip_entry['site_paths'][site_path] = 0
if user_agent in ip_entry['user_agents']:
ip_entry['user_agents'][user_agent] += 1
else:
ip_entry['user_agents'][user_agent] = 0
else:
ip_dict[ip_address] = {
'total_count': 1,
'site_paths': {site_path: 1},
'user_agent': {user_agent: 1}
}
# initialize the ip dict
ip_dict = {}
# read from your log file and for every entry, call
update(ip_dict, '1.2.3.4', site_path, user_agent)
Of course, you can optimize this by using defaultdict, but that's outside the scope of this question.

Categories

Resources