I want to know if its faster(importing) using updateone or updatemany with bulk write.My code for importing the data into the collection with pymongo look is this:
for file in sorted_files:
df = process_file(file)
for row, item in df.iterrows():
data_dict = item.to_dict()
bulk_request.append(UpdateOne(
{"nsamples": {"$lt": 12}},
{
"$push": {"samples": data_dict},
"$inc": {"nsamples": 1}
},
upsert=True
))
result = mycol1.bulk_write(bulk_request)
When i tried update many the only thing i change is this:
...
...
bulk_request.append(UpdateMany(..
..
..
I didnt see any major difference in insertion time.Shouldnt updateMany be way faster?
Maybe i am doing something wrong.Any advice would be helpful!
Thanks in advance!
Note:My data consist of 1.2m rows .I need each document to contain 12 subdocuments.
#Wernfried Domscheit's answer is correct.
This answer is specific to your scenario.
If you don't mind not updating records to existing documents and insert new documents altogether, use the below code which is the best optimized for your use case.
sorted_files = []
process_file = None
for file in sorted_files:
df = process_file(file)
sample_data = []
for row, item in df.iterrows():
sample_data.append(item.to_dict())
if len(sample_data) == 12:
mycol1.insertOne({
"samples": sample_data,
"nsamples": len(sample_data),
})
sample_data = []
mycol1.insertOne({
"samples": sample_data,
"nsamples": len(sample_data),
})
If you want to fill up your existing records with 12 objects and then,
create new records, use the below code logic.
Note: I have not tested the code in my local, its just to understand the flow for you to use.
for file in sorted_files:
df = process_file(file)
sample_data = []
continuity_flag = False
for row, item in df.iterrows():
sample_data.append(item.to_dict())
if not continuity_flag:
sample_rec = mycol1.find_one({"nsamples": {"$lt": 12}}, {"nsamples": 1})
if sample_rec is None:
continuity_flag = True
elif sample_rec["nsamples"] + len(sample_data) == 12:
mycol1.update_one({
"_id": sample_rec["_id"]
}, {
"$push": {"samples": {"$each": sample_data}},
"$inc": {"nsamples": len(sample_data)}
})
if len(sample_data) == 12:
mycol1.insert_one({
"samples": sample_data,
"nsamples": len(sample_data),
})
sample_data = []
if sample_data:
mycol1.insert_one({
"samples": sample_data,
"nsamples": len(sample_data),
})
updateOne updates only the first matching document for nsamples: {$lt: 12}. So updateOne should be faster.
However, why do you insert them one by one? Put all in one document and make a single update. Similar to this one:
sample_data = [];
for row, item in df.iterrows():
data_dict = item.to_dict();
sample_data.append(data_dict);
db.mycol1.updateOne(
{"nsamples": {"$lt": 12}},
{
"$push": { samples: { $each: sample_data } },
"$inc": {"nsamples": len(sample_data) }
}
)
Related
I'm using python3 and and i have data set. That contains the following data. I'm trying to get the desire value from this data list. I have tried many ways but unable to figure out how to do that.
slots_data = [
{
"id":551,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":552,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":525,
"user_id":3,
"time":"199322002",
"expire":"199322002"
},
{
"id":524,
"user_id":3,
"time":"199322002",
"expire":"199322002"
},
{
"id":553,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":550,
"user_id":2,
"time":"199322002",
"expire":"199322002"
}
]
# Desired output
# [
# {"user_id":1,"slots_ids":[551,552,553]}
# {"user_id":2,"slots_ids":[550]}
# {"user_id":3,"slots_ids":[524,525]}
# ]
I have tried in the following way and obviously this is not correct. I couldn't figure out the solution of this problem :
final_list = []
for item in slots_data:
obj = obj.dict()
obj = {
"user_id":item["user_id"],
"slot_ids":item["id"]
}
final_list.append(obj)
print(set(final_list))
The other answer added here has a nice solution, but here's one without using pandas:
users = {}
for item in slots_data:
# Check if we've seen this user before,
if item['user_id'] not in users:
# if not, create a new entry for them
users[item['user_id']] = {'user_id': item['user_id'], 'slot_ids': []}
# Add their slot ID to their dictionary
users[item['user_id']]['slot_ids'].append(item['id'])
# We only need the values (dicts)
output_list = list(users.values())
Lots of good answers here.
If I was doing this, I would base my answer on setdefault and/or collections.defaultdict that can be used in a similar way. I think the defaultdict version is very readable but if you are not already importing collections you can do without it.
Given your data:
slots_data = [
{
"id":551,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":552,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
#....
]
You can reshape it into your desired output via:
## -------------------
## get the value for the key user_id if it exists
## if it does not, set the value for that key to a default
## use the value to append the current id to the sub-list
## -------------------
reshaped = {}
for slot in slots_data:
user_id = slot["user_id"]
id = slot["id"]
reshaped.setdefault(user_id, []).append(id)
## -------------------
## -------------------
## take a second pass to finish the shaping in a sorted manner
## -------------------
reshaped = [
{
"user_id": user_id,
"slots_ids": sorted(reshaped[user_id])
}
for user_id
in sorted(reshaped)
]
## -------------------
print(reshaped)
That will give you:
[
{'user_id': 1, 'slots_ids': [551, 552, 553]},
{'user_id': 2, 'slots_ids': [550]},
{'user_id': 3, 'slots_ids': [524, 525]}
]
I would say try using pandas to group the user id's together and convert it back to a dictionary
pd.DataFrame(slots_data).groupby('user_id')['id'].agg(list).reset_index().to_dict('records')
[{'user_id': 1, 'id': [551, 552, 553]},
{'user_id': 2, 'id': [550]},
{'user_id': 3, 'id': [525, 524]}]
thriough just simple loop way
>>> result = {}
>>> for i in slots_data:
... if i['user_id'] not in result:
... result[i['user_id']] = []
... result[i['user_id']].append(i['id'])
...
>>> output = []
>>> for i in result:
... dict_obj = dict(user_id=i, slots_id=result[i])
... output.append(dict_obj)
...
>>> output
[{'user_id': 1, 'slots_id': [551, 552, 553]}, {'user_id': 3, 'slots_id': [525, 524]}, {'user_id': 2, 'slots_id': [550]}]
You can use the following to get it done. Purely Python. Without any dependencies.
slots_data = [
{
"id":551,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":552,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":525,
"user_id":3,
"time":"199322002",
"expire":"199322002"
},
{
"id":524,
"user_id":3,
"time":"199322002",
"expire":"199322002"
},
{
"id":553,
"user_id":1,
"time":"199322002",
"expire":"199322002"
},
{
"id":550,
"user_id":2,
"time":"199322002",
"expire":"199322002"
}
]
user_wise_slots = {}
for slot_detail in slots_data:
if not slot_detail["user_id"] in user_wise_slots:
user_wise_slots[slot_detail["user_id"]] = {
"user_id": slot_detail["user_id"],
"slot_ids": []
}
user_wise_slots[slot_detail["user_id"]]["slot_ids"].append(slot_detail["id"])
print(user_wise_slots.values())
This can be made in a using listcomprehension:
final_list = [{"user_id": user_id, "id":sorted([slot["id"] for slot in slots_data if slot["user_id"] == user_id])} for user_id in sorted(set([slot["user_id"] for slot in slots_data]))]
A more verbose and better formatted version of the same code:
all_user_ids = [slot["user_id"] for slot in slots_data]
unique_user_ids = sorted(set(all_user_ids))
final_list = [
{
"user_id": user_id,
"id": sorted([slot["id"] for slot in slots_data if slot["user_id"] == user_id])
}
for user_id in unique_user_ids]
Explanation:
get all the user ids with list comprehension
get the unique user ids by creating a set
create the final list of dictionaries using list comprehension.
each field id is of itself a list with list comprehension. We get the id of the slot, and only add it to the list, if the user ids match
Using pandas you can easily achieve the result.
First install pandas if you don't have as follow
pip install pandas
import pandas as pd
df = pd.DataFrame(slots_data) #create dataframe
df1 = df.groupby("user_id")['id'].apply(list).reset_index(name="slots_ids") #groupby on user_id and combine elements of id in list and give the column name is slots_ids
final_slots_data = df1.to_dict('records') # convert dataframe into a list of dictionary
final_slots_data
Output:
[{'user_id': 1, 'slots_ids': [551, 552, 553]},
{'user_id': 2, 'slots_ids': [550]},
{'user_id': 3, 'slots_ids': [525, 524]}]
I am looking at building lists of lists within a dictionary from an Excel spreadsheet.
My spreadsheet looks like this:
source_item_id
target_item_id
find_sting
replace_sting
source_id1
target_id1
abcd1
efgh1
source_id1
target_id1
ijkl1
mnop1
source_id1
target_id2
abcd2
efgh2
source_id1
target_id2
ijkl2
mnop2
source_id2
target_id3
qrst
uvwx
source_id2
target_id3
yzab
cdef
source_id2
target_id4
ghij
klmn
source_id2
target_id4
opqr
stuv
My output dictionary should looks like this:
{
"source_id1": [{
"target_id1": [{
"find_string": "abcd1",
"replace_string": "efgh1"
},
{
"find_string": "ijkl1",
"replace_string": "mnop1"
}]
},
{
"target_id2": [{
"find_string": "abcd2",
"replace_string": "efgh2"
},
{
"find_string": "ijkl2",
"replace_string": "mnop2"
}]
}],
"source_id2": [{
"target_id3": [{
"find_string": "qrst",
"replace_string": "uvwx"
},
{
"find_string": "yzab",
"replace_string": "cdef"
}]
},
{
"target_id4": [{
"find_string": "ghij",
"replace_string": "klmn"
},
{
"find_string": "opqr",
"replace_string": "stuv"
}]
}]
}
With the following code I only get the last values in each of the lists:
import xlrd
xls_path = r"C:\data\ItemContent.xlsx"
book = xlrd.open_workbook(xls_path)
sheet_find_replace = book.sheet_by_index(1)
find_replace_dict = dict()
for line in range(1, sheet_find_replace.nrows):
source_item_id = sheet_find_replace.cell(line, 0).value
target_item_id = sheet_find_replace.cell(line, 1).value
find_string = sheet_find_replace.cell(line, 2).value
replace_sting = sheet_find_replace.cell(line, 3).value
find_replace_list = [{"find_string": find_string, "replace_sting": replace_sting}]
find_replace_dict[source_item_id] = [target_item_id]
find_replace_dict[source_item_id].append(find_replace_list)
print(find_replace_dict)
--> result
{
"source_id1": ["target_id2", [{
"find_string": "ijkl2",
"replace_sting": "mnop2"
}
]],
"source_id2": ["target_id4", [{
"find_string": "opqr",
"replace_sting": "stuv"
}
]]
}
Your problem is rather complicated by the fact that you have a list of single-key dictionaries as the value of your source ids, but you can follow a pattern of parsing each line for the relevant items and, and then using those to target where you insert appends, or alternatively create new lists:
def process_line(line) -> Tuple[str, str, dict]:
source_item_id = sheet_find_replace.cell(line, 0).value
target_item_id = sheet_find_replace.cell(line, 1).value
find_string = sheet_find_replace.cell(line, 2).value
replace_string = sheet_find_replace.cell(line, 3).value
return source_item_id, target_item_id, {
"find_string": find_string,
"replace_string": replace_string
}
def find_target(target: str, ls: List[dict]) -> int:
# Find the index of the target id in the list
for i in len(ls):
if ls[i].get(target):
return i
return -1 # Or some other marker
import xlrd
xls_path = r"C:\data\ItemContent.xlsx"
book = xlrd.open_workbook(xls_path)
sheet_find_replace = book.sheet_by_index(1)
result_dict = dict()
for line in range(1, sheet_find_replace.nrows):
source, target, replacer = process_line(line)
# You can check here that the above three are correct
source_list = result_dict.get(source, []) # Leverage the default value of the get function
target_idx = find_target(target, source_list)
target_dict = source_list[target_idx] if target_idx >=0 else {}
replace_list = target_dict.get(target, [])
replace_list.append(replacer)
target_dict[target] = replace_list
if target_idx >= 0:
source_list[target_idx] = target_dict
else:
source_list.append(target_dict)
result_dict[source] = source_list
print(result_dict)
I would note that if source_id pointed to a dictionary rather than a list, this could be radically simplified, since we wouldn't need to search through the list for a potentially already-existing list item and then awkwardly replace or append as needed. If you can change this constraint (remember, you can always convert a dictionary to a list downstream), I might consider doing that.
I have a mongodb document I am trying to update. This answer was helpful, but every time I insert into the database, the data is inserted as an array inside of the array whereas I just want to insert the object directly into the array.
Here is what I am doing.
# My function to update the array
def append_site(gml_id, new_site):
col.update_one({'gml_id': gml_id}, {'$push': {'websites': new_site}}, upsert = True)
# My Dataframe
data = {'name':['ABC'],
'gml_id':['f9395e09'],
'url':['ABC.com']
}
df = pd.DataFrame(data)
# Grouping data for upsert
df = df.groupby(['gml_id']).apply(lambda x: x[['name','url']].to_dict('r')).reset_index().rename(columns={0:'websites'})
# Apply function to every row
df.apply(lambda row: append_site(row['gml_id'], row['websites']), axis = 1)
Here is the outcome:
{
"gml_id": "f9395e09",
"websites": [
{
"name": "XYZ.com",
"url": "...xyz.com"
},
[
{
"name": "ABC.com",
"url": "...abc.com"
}
]
]
}
Here is the goal:
{
"gml_id": "f9395e09",
"websites": [
{
"name": "XYZ.com",
"url": "...xyz.com"
},
{
"name": "ABC.com",
"url": "...abc.com"
}
]
}
Your issue is that the websites array is being appended with a list object rather than a dict, i.e. new_site is a list.
As you haven't posted where you call append_site(), this is a litle speculative, but you could try changing this line and seeing if it gives the effect you need.
col.update_one({'gml_id': gml_id}, {'$push': {'websites': new_site[0]}}, upsert = True)
Alternatively make sure you are passing a dict object to the function.
Instead of doing an unncessary groupby, I decided to leave the dataframe flat and then adjust the function like this:
def append_site(gml_id, name, url):
col.update_one({'gml_id': gml_id}, {'$push': {'websites': {'name': name, 'url': url}}}, upsert = True)
I now call it like this: df.apply(lambda row: append_site(row['gml_id'], row['url'], row['name']), axis = 1)
Works perfectly fine.
I would like to create a tree style structure from a dataframe in Python.
The desired output would look something like this:
{
"Cat1": {
"Band1" : {
"PE1": {
"1995-01": {
"GroupCalcs": {
"Count": 382.0,
"EqWghtRtn": -0.015839621267015727,
"Idx_EqWghtRtn": 0.9841603787329842,
"Idx_WghtRtn": 0.9759766819565102,
"WghtRtn": -0.02402331804348973
}
},
"1995-02": {
"GroupCalcs": {
"Count": 382.0,
"EqWghtRtn": -0.015839621267015727,
"Idx_EqWghtRtn": 0.9841603787329842,
"Idx_WghtRtn": 0.9759766819565102,
"WghtRtn": -0.02402331804348973
}
}
}
}
}
I am looking for the most efficient way of building this from a Dataframe. I have 20k+ rows to parse out which currently look like this snippet
Data Snippett
I was thinking something like this but I know this is very inefficient.
dctcat = {}
for cat in alldf.index.levels[0]:
dctMktBand = {}
for mktband in alldf.index.levels[1]:
dctpe = {}
for pe in alldf.index.levels[2]:
dctMonth = {}
for month in alldf.index.levels[3]:
dctMonth[month]=alldf.loc[[cat,mktband,pe,month]].filter(items=['Count', 'EqWghtRtn', 'Idx_EqWghtRtn','Idx_WghtRtn', 'WghtRtn']).to_dict()
dctpe[str(pe)]=dctMonth
dctMktBand[str(mktband)] = dctpe
dctcat[str(cat)] = dctpe
Possible Solution
I stumbled across this article https://gist.github.com/hrldcpr/2012250 and using defauldict(tree) I was able to accomplish what I needed in under 2 seconds. Which while that sounds long, it is considerably faster than what I had. Sharing the code below, if someone has updates or improvements on this I would appreciate it.
def tree():
return defaultdict(tree)
def dicts(t):
if isinstance(t ,defaultdict):
outval = {k: dicts(t[k]) for k in t }
else:
outval = t
return outval
alldf.set_index(['category','marketCapBand','peGreaterThanMarket','Month'], inplace=True)
tmp = alldf.filter(items=['Count', 'EqWghtRtn', 'Idx_EqWghtRtn','Idx_WghtRtn', 'WghtRtn']).to_dict('index')
outjson = tree()
for k,v in tmp.items():
(cat,mkb,pe,mon) = k
outjson[str(cat)][str(mkb)][str(pe)][str(mon)] = v
# convert back to dictionary
outjson = dicts(outjson)
This is my JSON string, I want to make it read into dataframe in the following tabular format.
I have no idea what should I do after pd.Dataframe(json.loads(data))
JSON data, edited
{
"data":[
{
"data":{
"actual":"(0.2)",
"upper_end_of_central_tendency":"-"
},
"title":"2009"
},
{
"data":{
"actual":"2.8",
"upper_end_of_central_tendency":"-"
},
"title":"2010"
},
{
"data":{
"actual":"-",
"upper_end_of_central_tendency":"2.3"
},
"title":"longer_run"
}
],
"schedule_id":"2014-03-19"
}
That's a somewhat overly nested JSON. But if that's what you have to work with, and assuming your parsed JSON is in jdata:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ] + ['schedule_id' ]
sched_id = jdata['schedule_id']
rows = [ [item['data'][rn] for item in datapts ] + [sched_id] for rn in rownames]
df = pd.DataFrame(rows, index=rownames, columns=colnames)
df is now:
If you wanted to simplify that a bit, you could construct the core data without the asymmetric schedule_id field, then add that after the fact:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ]
rows = [ [item['data'][rn] for item in datapts ] for rn in rownames]
d2 = pd.DataFrame(rows, index=rownames, columns=colnames)
d2['schedule_id'] = jdata['schedule_id']
That will make an identical DataFrame (i.e. df == d2). It helps when learning pandas to try a few different construction strategies, and get a feel for what is more straightforward. There are more powerful tools for unfolding nested structures into flatter tables, but they're not as easy to understand first time out of the gate.
(Update) If you wanted a better structuring on your JSON to make it easier to put into this format, ask pandas what it likes. E.g. df.to_json() output, slightly prettified:
{
"2009": {
"actual": "(0.2)",
"upper_end_of_central_tendency": "-"
},
"2010": {
"actual": "2.8",
"upper_end_of_central_tendency": "-"
},
"longer_run": {
"actual": "-",
"upper_end_of_central_tendency": "2.3"
},
"schedule_id": {
"actual": "2014-03-19",
"upper_end_of_central_tendency": "2014-03-19"
}
}
That is a format from which pandas' read_json function will immediately construct the DataFrame you desire.