Nested arrays: How to append a new datapoint? - python

Sorry for the confusing title but not sure how to shortly explain this.
I have a database with Arrays within Arrays, example:
r.db('test').table('Example').insert({'id':'Object1',
'History':{'16-07-2018':{'Price':25,'Volume':200}}
})
What I want to do is to add a new object. If the ID doesn't exists, create it. If it already exists then add new dates to the history (This is my first question, how to do this? Using insert and then conflict=update ?), something like:
Insert Object1 -> History -> 17-07-2018 -> {Price:40,Volume:150}
So the result would be
{
"History": {
"16-07-2018": {
"Volume": 200,
"Price": 25
} ,
"17-07-2018": {
"Volume": 150,
"Price": 40
}
} ,
"id": "Object1"
}
Summarizing:
1) How to tell Rethink to insert a new line if it doesn't exist, and update it if it actually exist based on id?
2) How to append to arrays within db?
Thanks!

As commented, the title is misleading since you're working with objects and not arrays.
For objects, the conflict: 'update' option does everything required (creates/updates the object, updates/"appends" the history).
For (real) arrays, one can use the conflict option as well, since it can take a function as a value:
r.db....insert({
id: 'Object1',
history: [{
date: '2018-01-01',
price: 40,
volume: 250
}]
}, {
conflict: function(id, oldDoc, newDoc) {
return r.branch(
newDoc('history').count().ne(1),
r.error("When updating, only one history item is allowed"),
r.do(function() {
var offsets = oldDoc('history').offsetsOf(function(item) {
return item('date').eq(newDoc('history').nth(0)('date'));
});
return oldDoc.merge({
history: r.branch(
offsets.isEmpty(),
oldDoc('history').append(newDoc('history').nth(0)),
oldDoc('history').changeAt(offsets.nth(0), newDoc('history').nth(0))
)
});
})
);
}})

Related

"Document interning" in Mongo

I a lot of documents which I know will rarely change and are very similar to each other, specifically I know they have a nested field in the document that is always the same (for some of them)
{
"docid": 1
"nested_field_that_will_always_be_the_same": {
"title": "this will always be the same"
"desc": "this will always be the same, too"
}
}
{
"docid": 2
"nested_field_that_will_always_be_the_same": {
"title": "this will always be the same"
"desc": "this will always be the same, too"
}
}
I don't want to store the same document over and over again, instead I want Mongo to "intern" this field, i.e only store it once and the rest will only store pointers to it.
Something like:
{
"docid": 1
"nested_field_that_will_always_be_the_same": {
"title": "this will always be the same"
"desc": "this will always be the same, too"
}
}
{
"docid": 2
"nested_field_that_will_always_be_the_same": <pointer to doc1.nested_field_that_will_always_be_the_same>
}
Now, of course, I can take out this nested field into a different document and then have Mongo reference its _id field, but I am not looking for app-side solution, because this collection is being accessed via multiple workers and I don't have all the documents that have the same nested_field_that_will_always_be_the_same at any given moment.
Instead, I want a solution provided by Mongo to only store this field once for every instance it is unique.
How can I do that?
I am using Pymongo.
This is quite an interesting challenge - I don't think a "pure" mongo solution is possible - you'll still have to modify your app code at insertion time. Quite interested to see if anyone does come up with a pure solution
What I'd probably do is add a unique index on the nested document with a partialFilterExpression, the index ensures you can quickly find the ID of the matching document, and the unique enforces this strictly.
Something like this (I shortened your field to nested for brevity):
collection.createIndex(
{ nested: 1 },
{ unique: true, partialFilterExpression: { nested: { $type: "object" } } }
);
Then for my inserts, I'd do the following (pseudo code)
try {
found = collection.findOne({ nested }, { projection: { _id: 1 } })
if (found) {
collection.insert({ docId, nested: found._id })
}
else {
collection.insert({ docId, nested })
}
}
catch (e) {
// test for E11000 and retry
}

The best way to transform a response to a json format in the example

Appreciate if you could help me for the best way to transform a result into json as below.
We have a result like below, where we are getting an information on the employees and the companies. In the result, somehow, we are getting a enum like T, but not for all the properties.
[ {
"T.id":"Employee_11",
"T.category":"Employee",
"node_id":["11"]
},
{
"T.id":"Company_12",
"T.category":"Company",
"node_id":["12"],
"employeecount":800
},
{
"T.id":"id~Employee_11_to_Company_12",
"T.category":"WorksIn",
},
{
"T.id":"Employee_13",
"T.category":"Employee",
"node_id":["13"]
},
{
"T.id":"Parent_Company_14",
"T.category":"ParentCompany",
"node_id":["14"],
"employeecount":900,
"childcompany":"Company_12"
},
{
"T.id":"id~Employee_13_to_Parent_Company_14",
"T.category":"Contractorin",
}]
We need to transform this result into a different structure and grouping based on the category, if category in Employee, Company and ParentCompany, then it should be under the node_properties object, else, should be in the edge_properties. And also, apart from the common properties(property_id, property_category and node), different properties to be added if the category is company and parent company. There are few more logic also where we have to get the from and to properties of the edge object based on the 'to' . the expected response is,
"node_properties":[
{
"property_id":"Employee_11",
"property_category":"Employee",
"node":{node_id: "11"}
},
{
"property_id":"Company_12",
"property_category":"Company",
"node":{node_id: "12"},
"employeecount":800
},
{
"property_id":"Employee_13",
"property_category":"Employee",
"node":{node_id: "13"}
},
{
"property_id":"Company_14",
"property_category":"ParentCompany",
"node":{node_id: "14"},
"employeecount":900,
"childcompany":"Company_12"
}
],
"edge_properties":[
{
"from":"Employee_11",
"to":"Company_12",
"property_id":"Employee_11_to_Company_12",
},
{
"from":"Employee_13",
"to":"Parent_Company_14",
"property_id":"Employee_13_to_Parent_Company_14",
}
]
In java, we have used the enhanced for loop, switch etc. How we can write the code in the python to get the structure as above from the initial result structure. ( I am new to python), thank you in advance.
Regards
Here is a method that I quickly made, you can adjust it to your requirements. You can use regex or your own function to get the IDs of the edge_properties then assign it to an object like the way I did. I am not so sure of your full requirements but if that list that you gave is all the categories then this will be sufficient.
def transform(input_list):
node_properties = []
edge_properties = []
for input_obj in input_list:
# print(obj)
new_obj = {}
if input_obj['T.category'] == 'Employee' or input_obj['T.category'] == 'Company' or input_obj['T.category'] == 'ParentCompany':
new_obj['property_id'] = input_obj['T.id']
new_obj['property_category'] = input_obj['T.category']
new_obj['node'] = {input_obj['node_id'][0]}
if "employeecount" in input_obj:
new_obj['employeecount'] = input_obj['employeecount']
if "childcompany" in input_obj:
new_obj['childcompany'] = input_obj['childcompany']
node_properties.append(new_obj)
else: # You can do elif == to as well based on your requirements if there are other outliers
# You can use regex or whichever method here to split the string and add the values like above
edge_properties.append(new_obj)
return [node_properties, edge_properties]

How to index list of object in Elasticsearch?

A document format I ingest into ElasticSearch looks like this:
{
'id':'514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0'
'created':'2019-09-06 06:09:33.044433',
'meta':{
'userTags':[
{
'intensity':'1',
'sentiment':'0.84',
'keyword':'train'
},
{
'intensity':'1',
'sentiment':'-0.76',
'keyword':'amtrak'
}
]
}
}
...ingested with python:
r = requests.put(itemUrl, auth = authObj, json = document, headers = headers)
The idea here is that ElasticSearch will treat keyword, intensity and sentiment as fields that can be later queried. However, on ElasticSearch side I can observe that this is not happening (I use Kibana for search UI) -- instead, I see field "meta.userTags" with the value that is the whole list of objects.
How can I make ElasticSearch index elements within a list?
I used the document body you provided to create a new index 'testind' and type 'testTyp' using the Postman REST client.:
POST http://localhost:9200/testind/testTyp
{
"id":"514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0",
"created":"2019-09-06 06:09:33.044433",
"meta":{
"userTags":[
{
"intensity":"1",
"sentiment":"0.84",
"keyword":"train"
},
{
"intensity":"1",
"sentiment":"-0.76",
"keyword":"amtrak"
}
]
}
}
When I queried for the index's mapping this is what i get :
GET http://localhost:9200/testind/testTyp/_mapping
{
"testind":{
"mappings":{
"testTyp":{
"properties":{
"created":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"meta":{
"properties":{
"userTags":{
"properties":{
"intensity":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"keyword":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"sentiment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}
}
}
}
}
As you can see in the mapping the fields are part of the mapping and can be queried as per need in future, so I don't see the problem here as long as the field names are not one of these - https://www.elastic.co/guide/en/elasticsearch/reference/6.4/sql-syntax-reserved.html ( you might want to avoid the term 'keyword' as it might be confusing later when writing search queries as the fieldname and type are both same - 'keyword') . Also, note one thing, the mapping gets created via dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/dynamic-field-mapping.html#dynamic-field-mapping ) in Elasticsearch and so the data types are determined by elasticsearch based on the values you have provided.However, this may not be always accurate , so to prevent that you can use the PUT _mapping API to define your own mapping for the index, and then prevent new fields within a type from being added to mappings.
You don't need a special mapping to index a list - every field can contain one or more values of the same type. See array datatype.
In the case of a list of objects, they can be indexed as object or nested datatype. Per default elastic uses object datatype. In this case you can query meta.userTags.keyword or/and meta.userTags.sentiment. The result will allways contains whole documents with values matched independently, ie. searching keyword=train and sentiment=-0.76 you WILL find document with id=514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0.
If this is not what you want, you need to define nested datatype mapping for field userTags and use a nested query.

Upsert in mongoengine is not generating ObjectId

I am trying to execute an upsert function in mongoengine. That is, if a document is present, I want to update it with new values, and if it isn't present, I want to create and insert.
I have list of objects. These objects can or cannot have ObjectIds. Example is:
[
{
"id" : ObjectId("5c1791b7397df4a9c8518342"),
"type": "Line"
},
{
"type": "Line"
}
]
As you can see the second object does not have an Id.
I have written my query as:
updates = Collection.objects(
id=obj.get('id', None)).modify(
new=True,
upsert= True,
**update_dict
)
obj is each object when I iterate through the list.
Note: update_dict is another dict that gets its value from a function that returns the attributes to set. (For example: set__type: "Line")
Problem
The first object is getting modified just fine. However there is an error:
"'None' is not a valid ObjectId, it must be a 12-byte input or a
24-character hex string"
Clearly it's because of the obj.get('id', None) part.
So, is there a way that an id can be generated if it is passed as None?
I tried same thing with mongoose and nodejs and it works for me if i am using like below:
Here is i my array Object:
var arr = [
{
_id: "5c13de7d47zfe91e3484362f",
email: 'test1#gmail.com',
},
{
_id: "5c13de7d47zfe91e3484362f",
email: 'test2#gmail.com',
},
{
// _id: "5c66aa87751fz5368759f9bc", // Commented
email: 'test3#gmail.com',
}
]
Now i am iterating through the array as below with nodejs.
arr.forEach(async element => {
await Driver.findOneAndUpdate(
{
_id: Types.ObjectId(element._id)
},
{
email: element.email
},{ upsert: true, new: true }
).lean().exec();
});
And it works for me. It's updating documents in first two cases and inserting new doc for last case.
The main thing is to use Types.ObjectId which is used to specify a type of ObjectId. If i am doing it without specifying Schema.Types.ObjectId then it does not working.

MongoDB count distinct items in an array

My actors collection contains an array-of-documents field, called acted_in. Instead of returning the size of acted_in.idmovies like so: {$size: $acted_in.idmovies}, I want to return the number of distinct values inside $acted_in.idmovies. How can I do that ?
c1 = actors.aggregate([{"$match": {'$and': [{'fname': f_name},
{'lname': l_name}]}},
{"$project": {'first_name': '$fname',
'last_name': '$lname',
'gender': '$gender',
'distinct_movies_played_in': {'$size': '$acted_in.idmovies'}}}])
You basically need to include $setDifference in there to obtain the "distinct" items. All "sets" are "distinct" by design and by obtaining the "difference" from the present array to an empty one [] you get the desired result. Then you can apply the $size.
You also have some common mistakes/misconceptions. Firstly when using $match or any MongoDB query expression you do not need to use $and unless there is an explicit case to do so. All query expression arguments are "already" AND conditions unless explicitly stated otherwise, as with $or. So don't explicitly use for this case.
Secondly your $project was using the explicit field path variables for every field. You do not need to do that just to return the field, and outside of usage in an "expression", you can simply use a 1 to notate you want it included:
c1 = actors.aggregate([
{ "$match": { "fname"': f_name, "lname": l_name } },
{ "$project": {
"first_name": 1,
"last_name": 1,
"gender": 1,
"distinct_movies_played_in": {
"$size": { "$setDifference": [ "$acted_in.idmovies", [] ] }
}
}}
])
In fact, if you are actually using MongoDB 3.4 or greater ( and your notation of an element within an array "$acted_in.idmovies" says you have at least MongoDB 3.2 ) which has support for $addFields then use that instead of specifying all other fields in the document.
c1 = actors.aggregate([
{ "$match": { "fname"': f_name, "lname": l_name } },
{ "$addFields": {
"distinct_movies_played_in": {
"$size": { "$setDifference": [ "$acted_in.idmovies", [] ] }
}
}}
])
Unless you explicitly need to just specify "some" other fields.
The basic case here is do not use $unwind for array operations unless you specifically need to perform a $group operation on with it's _id key pointing at a value obtained from "within" the array.
In all other cases, MongoDB has far more efficient operators for working with arrays that what $unwind does.
This should give you what you want:
actors.aggregate([
{
$match: {fname: f_name, lname: l_name}
},
{
$unwind: '$tags'
},
{
$group: {
_id: '$_id',
first_name: {$first: '$fname'},
last_name: {$last: '$lname'},
gender: {$first: '$gender'},
tags: {$addToSet: '$tags'}
}
},
{
$project: {
first_name: 1,
last_name: 1,
gender: 1,
distinct: {$size: '$tags'}
}
}
])
After the tags array is deconstructed and then put back into a set of itself, then you just need to get the number of items or length of that set.

Categories

Resources