First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write
Related
I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)
I'm trying to clean AWS Cloudwatch's log data, which is delivered in JSON format when queried via boto3. Each log line is stored as an array of dictionaries. For example, one log line takes the following form:
[
{
"field": "field1",
"value": "abc"
},
{
"field": "field2",
"value": "def"
},
{
"field": "field3",
"value": "ghi"
}
]
If this were in a standard key-value format (e.g., {'field1':'abc'}), I would know exactly what to do with it. I'm just getting stuck on untangling the extra layer of hierarchy introduced by the field/value keys. The ultimate goal is to convert the entire response object into a data frame like the following:
| field1 | field2 | field3 |
|--------|--------|--------|
| abc | def | ghi
(and so on for the rest of the response object, one row per log line.)
Last bit of info: each array has the same set of fields, and there is no nesting deeper than the example I've provided here. Thank you in advance :)
I was able to do this using nested loops. Not my favorite - I always feel like there has to be a more elegant solution than crawling through every single inch of the object, but this data is simple enough that it's still very fast.
logList = [] # Empty array to store list of dictionaries (i.e., log lines)
for line in logs: # logs = response object
line_dict = {}
# Flatten each dict into a single key-value pair
for i in range( len(line) ):
line_dict[ line[i]['field'] ] = line[i]['value']
logList.append(line_dict)
df = pd.json_normalize(logList)
For anyone else working with CloudWatch logs, the actual log lines (like those I displayed above) are nested in an array called 'results' in the boto3 response object. So you'd need to extract that array first, or point the outer loop to it (i.e., for line in response['results']).
So I have mongo db that I starts dorm docker and put several entries:
(
id - int,
mod - boolean,
data - text,
)
So i want to search for N entries form my data base with mod=false:
results = self.dbcol.find({'mod': False}).limit(3)
So my results return 3 elements And for each entry at this result i want to change mod field to true.
So this is what i have try:
for entry in results:
entry['mod'] = True
And this is not changed my field.
You can use update_one
update = { "$set": { 'mod': True } }
for entry in results:
self.dbcol.update_one({ _id: entry['_id'] }, update)
Iterating over a cursor is not a performance-efficient way if the N is a big number.
An efficient way would be to take advantage of find_one_and_update method of pymongo and call it N times, thereby keeping the disk usage low if N is big.
import pymongo
from pymongo.collection import ReturnDocument
for _ in range(0,N):
returned_document = self.dbcol.find_one_and_update(filter={ 'mod': False }, update={"$set": { 'mod': True }}, return_document=ReturnDocument.AFTER)
# return_document=ReturnDocument.AFTER --> returns modified document after updating it
# return_document=ReturnDocument.BEFORE --> returns original document before updating it
The below block of code works however I'm not satisfied that it is very optimal due to my limited understanding of using JSON but I can't seem to figure out a more efficient method.
The steam_game_db is like this:
{
"applist": {
"apps": [
{
"appid": 5,
"name": "Dedicated Server"
},
{
"appid": 7,
"name": "Steam Client"
},
{
"appid": 8,
"name": "winui2"
},
{
"appid": 10,
"name": "Counter-Strike"
}
]
}
}
and my Python code so far is
i = 0
x = 570
req_name_from_id = requests.get(steam_game_db)
j = req_name_from_id.json()
while j["applist"]["apps"][i]["appid"] != x:
i+=1
returned_game = j["applist"]["apps"][i]["name"]
print(returned_game)
Instead of looping through the entire app list is there a smarter way to perhaps search for it? Ideally the elements in the data structure with 'appid' and 'name' were numbered the same as their corresponding 'appid'
i.e.
appid 570 in the list is Dota2
However element 570 in the data structure in appid 5069 and Red Faction
Also what type of data structure is this? Perhaps it has limited my searching ability for this answer already. (I.e. seems like a dictionary of 'appid' and 'element' to me for each element?)
EDIT: Changed to a for loop as suggested
# returned_id string for appid from another query
req_name_from_id = requests.get(steam_game_db)
j_2 = req_name_from_id.json()
for app in j_2["applist"]["apps"]:
if app["appid"] == int(returned_id):
returned_game = app["name"]
print(returned_game)
The most convenient way to access things by a key (like the app ID here) is to use a dictionary.
You pay a little extra performance cost up-front to fill the dictionary, but after that pulling out values by ID is basically free.
However, it's a trade-off. If you only want to do a single look-up during the life-time of your Python program, then paying that extra performance cost to build the dictionary won't be beneficial, compared to a simple loop like you already did. But if you want to do multiple look-ups, it will be beneficial.
# build dictionary
app_by_id = {}
for app in j["applist"]["apps"]:
app_by_id[app["appid"]] = app["name"]
# use it
print(app_by_id["570"])
Also think about caching the JSON file on disk. This will save time during your program's startup.
It's better to have the JSON file on disk, you can directly dump it into a dictionary and start building up your lookup table. As an example I've tried to maintain your logic while using the dict for lookups. Don't forget to encode the JSON it has special characters in it.
Setup:
import json
f = open('bigJson.json')
apps = {}
with open('bigJson.json', encoding="utf-8") as handle:
dictdump = json.loads(handle.read())
for item in dictdump['applist']['apps']:
apps.setdefault(item['appid'], item['name'])
Usage 1:
That's the way you have used it
for appid in range(0, 570):
if appid in apps:
print(appid, apps[appid].encode("utf-8"))
Usage 2: That's how you can query a key, using getinstead of [] will prevent a KeyError exception if the appid isn't recorded.
print(apps.get(570, 0))
I'm using MongoDB as my datastore and wish to store a "clustered" configuration of my documents in a separate collection.
So in one collection, I'd have my original set of objects, and in my second, it'd have
kMeansCollection: {
1: [mongoObjectCopy1], [mongoObjectCopy2]...
2: [mongoObjectCopy3], [mongoObjectCopy4]...
}
I'm following the implementation a K-means for text clustering here, http://tech.swamps.io/recipe-text-clustering-using-nltk-and-scikit-learn/, but I'm having a hard time thinking about how I'd tie the outputs back into MongoDB.
An example (taken from the link):
if __name__ == "__main__":
tags = collection.find({}, {'tag_data': 1, '_id': 0})
clusters = cluster_texts(tags, 5) #algo runs here with 5 clusters
pprint(dict(clusters))
The var "tags" is the required input for the algo to run.
It must be in the form of an array, but currently tags returns an array of objects (I must therefore extract the text values from the query)
However, after magically clustering my collection 5 ways, how can I reunite them with their respective object entry from mongo?
I am only feeding specific text content from one property of the object.
Thanks a lot!
You would need to have some identifier for the documents. It is probably a good idea to include the _id field in your query so that you do have a unique document identifier. Then you can create parallel lists of ids and tag_data.
docs = collection.find({}, {'tag_data': 1, '_id': 1})
ids = [doc['_id'] for doc in docs]
tags = [doc['tag_data'] for doc in docs]
Then call the cluster function on the tag data.
clusters = cluster_text(tags)
And zip the results back with the ids.
doc_clusters = zip(ids, clusters)
From here you have built tuples of (_id, cluster) so you can update the cluster labels on your mongo documents.
The efficient way to do this is to use the aggregation framework to create the list of "_id" and "tag-data" using server-side operation. This also reduces both the amount of data sent over the wire and the time and memory used to decode documents on the client-side.
You need to $group your documents and use the $push accumulator operator to return the list of _id and the list of tag-data. Of course the aggregate() method gives access to the aggregation pipeline.
cursor = collection.aggregate([{
'$group': {
'_id': None,
'ids': {'$push': '$_id'},
'tags': {'$push': '$tag-data'}
}
}])
You then retrieve you data using the .next() method on the CommandCursor because we group by None thus our cursor hold one element.
data = cursor.next()
After that, simply call your function and zip the result.
clusters = cluster_text(data['tags'])
doc_clusters = zip(data['ids'], clusters)