How to restructure a collection in MongoDB - python

I'm looking to restructure my MongoDB collection and haven't been able to do so. I'm quite new to it and looking for some help. I'm struggling to access move the data within the "itemsList" field.
My collection documents are currently structured like this:
{
"_id": 1,
"pageName": "List of Fruit",
"itemsList":[
{
"myID": 101,
"itemName": "Apple"
},
{
"myID": 102,
"itemName": "Orange"
}
]
},
{
"_id": 2,
"pageName": "List of Computers",
"itemsList":[
{
"myID": 201,
"itemName": "MacBook"
},
{
"myID": 202,
"itemName": "Desktop"
}
]
}
The end result
But I would like the data to be restructured so that the value for "itemName" is it's own document.
I would also like to change the name of "myID" to "itemID".
And save the new documents to another collection.
{
"_id": 1,
"itemName": "Apple",
"itemID": 101,
"pageName": "List of Fruit"
},
{
"_id": 2,
"itemName": "Orange",
"itemID": 102,
"pageName": "List of Fruit"
},
{
"_id": 3,
"itemName": "MacBook",
"itemID": 201,
"pageName": "List of Computers"
},
{
"_id": 4,
"itemName": "Desktop",
"itemID": 202,
"pageName": "List of Computers"
}
What I've tried
I have tried using MongoDB's aggregate functionality, but because there are multiple "itemName" fields in each document, it will add both of them to one Array - instead of one in each document.
db.collection.aggregate([
{$Project:{
itemName: "$itemsList.itemName",
itemID: "$itemsList.otherID",
pageName: "$pageName"
}},
{$out: "myNewCollection"}
])
I've also tried using PyMongo 3.x to loop through the document's fields and save as a new document, but haven't been successful.
Ways to implement it
I'm open to using MongoDB's aggregate functionality, if it can move these items to their own documents, or a Python script (3.x) - or any other means you think can help.
Thanks in advance for your help!

You just need a $unwind to "break" the array. Then you can do some data wrangling and output to your collection.
Note that as you didn't specify the exact requirement for the _id. You might need to take extra handling. Below demonstration use the native _id generation, which will auto assigned ObjectIds.
db.collection.aggregate([
{
"$unwind": "$itemsList"
},
{
"$project": {
"_id": 0,
"itemName": "$itemsList.itemName",
"itemID": "$itemsList.myID",
"pageName": "$pageName"
}
},
{
$out: "myNewCollection"
}
])
Here is the Mongo playground for your reference.

Related

Filter with jsonpath-ng

Working with the following json data:
{
"data":
{
"level1":
[
{
"levelName": "level11",
"cost": 1,
"child":
{
"childName": "first",
"status": "running"
}
},
{
"levelName": "level12",
"cost": 2,
"child":
{
"childName": "second",
"status": "asleep"
}
}
]
}
}
A jsonpath search/filter using the expression
"$.data.level1[*][?(childName=='first')]"
correctly locates the data.
However, using the expression
"$.data.level1[*][?(levelName=='level11')]"
returns blank
How do I search at the "levelName": "level11" level?
In the latter case, if I have the "levelName": "level11" in the json data at the same level as "childName": "first", the search works successfully.
If I understand correctly you only need to slightly change your syntax to select the node in question:
$.data.level1[?(#.levelName="level11")]
We are already in the level1 array and can directly filter.

Get field value in MongoDB without parent object name

I'm trying to find a way to retrieve some data on MongoDB trough python scripts
but I got stuck on a situation as follows:
I have to retrieve some data, check a field value and compare with another data (MongoDB Documents).
But the Object's name may vary from each module, see bellow:
Document 1
{
"_id": "001",
"promotion": {
"Avocado": {
"id": "01",
"timestamp": "202005181407",
},
"Banana": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "11"
}
Document 2
{
"_id": "002",
"promotion": {
"Grape": {
"id": "02",
"timestamp": "202005181407",
},
"Dragonfruit": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "15"
}
}
I'll aways have an Object called promotion but the child's name may vary, sometimes it's an ordered number, sometimes it is not. The field I need the value is the id inside promotion, it will aways have the same name.
So if the document matches the criteria I'll retrieve with python and get the rest of the work done.
PS.: I'm not the one responsible for this kind of Document Structure.
I've already tried these docs, but couldn't get them to work the way I need.
$all
$elemMatch
Try this python pipeline:
[
{
'$addFields': {
'fruits': {
'$objectToArray': '$promotion'
}
}
}, {
'$addFields': {
'FruitIds': '$fruits.v.id'
}
}, {
'$project': {
'_id': 0,
'FruitIds': 1
}
}
]
Output produced:
{FruitIds:["01","02"]},
{FruitIds:["02","02"]}
Is this the desired output?

How to join multiple collections in MongoDB (one to many relationship)?

I have two collections: document and citation. Their structures are shown below:
# document
{id:001, title:'foo'}
{id:002, title:'bar'}
{id:003, title:'abc'}
# citation
{from_id:001, to_id:002}
{from_id:001, to_id:003}
I want to query the information of cited documents (called references, which is denoted by to_id) of each document. In SQL, I would use the document table left joins citation, and then left joins document to get full information of the references (not just their ids).
However, I can only achieve the first step with $lookup in MongoDB. Here is my aggregate pipeline:
[
{'$lookup':{
'from': 'citation',
'localField': 'id',
'foreignField': 'from_id',
'as': 'references'
}}
]
I am able to get the following results with this pipeline:
{
id:001,
title:'foo',
references:[{from_id:001, to_id:002}, {from_id:001, to_id:003}]
}
The desired result is:
{
id:001,
title:'foo',
references:[{id:002, title:'bar'}, {id:003, title:'abc'}]
}
I have found this answer but it seems to be a one-to-one relationship that is not applicable in my case.
EDIT: Some people said that join should be avoided in MongoDB as it's not a relational database. I choose MongoDB because it's much faster than MySQL in my case.
You need to use $unwind and again $lookup on same collection, then you should $group by _id to get the desired result.
Try the below:
[
{
"$lookup": {
"from": "citation",
"localField": "_id",
"foreignField": "from_id",
"as": "references"
}
},
{
"$unwind": "$references"
},
{
"$lookup": {
"from": "doc",
"localField": "references.to_id",
"foreignField": "_id",
"as": "map"
}
},
{
"$unwind": "$map"
},
{
"$project": {
"_id": 1,
"title": 1,
"map_id": "$map._id",
"map_title": "$map.title"
}
},
{
"$group": {
"_id": "$_id",
"title": {
"$first": "$title"
},
"references": {
"$push": {
"id": "$map_id",
"title": "$map_title"
}
}
}
}
]

Searching period and hyphen-delimited fields in Elasticsearch

I'm trying to find a way to use Elasticsearch to query a field that is both period and hyphen-delimited.
I have a (MySQL) data-set like this (using SQLAlchemy to access it):
id text tag
====================================
1 some-text A.B.c3
2 more. text A.B-C.c4
3 even more. B.A-32.D-24.f9
The core reason I use ES for search in the first place is that I want to query against the text field. That part works awesome!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
A.B.c3 1
A.B-C.c4 2
B.A-C2.D-24.f9 3
Then, I want to search the tag field like this:
{ "query": {
"prefix" : { "tag" : "A.B" }
}
}
And have the query return id/rows/documents 1 and 2.
Basically, I want the query to match the index(es) in this truth table:
"A." = 1, 2
"A-" = 3
How do I accomplish both the "A." match at the beginning, differentiate between a period and a hyphen (possibly boost this), and match mid-phrase based on those same delimiters?
I'd also like to weight these matches higher if they occur at the beginning of the tag field if possible.
How do I do this, or is Elasticsearch not the right tool for the job? It seems like Elasticsearch works great for my text-field comparisons on normally delimited English text, but the tag-based searches seem much harder.
UPDATE: It seems that when I index only a subset of the data that my searches return the results I would expect but when querying against the full data-set, I get fewer hits.
This can be done via N-Gram tokenizer.
Based on what you've provided in question, I've created its corresponding mapping, documents and a sample query to give you what you are looking for.
Mapping
PUT idtesttag
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"id": {
"type": "long"
},
"text": {
"type": "text",
"analyzer": "my_analyzer"
},
"tag": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
What this would do is, if you have a document with id = 1 has a tag A.B it would store following group of characters in its inverted index.
A. -> 1
.B -> 1
A.B -> 1
So if your query has any of these three words, your document with id=1 would be returned.
Sample Documents
POST idtesttag/mydocs/1
{
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
POST idtesttag/mydocs/2
{
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
POST idtesttag/mydocs/3
{
"id": 3,
"text": "even more.",
"tag": "B.A-32.D-24.f9"
}
POST idtesttag/mydocs/4
{
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
Sample Query
POST idtesttag/_search
{
"query": {
"match": {
"tag": "A.B"
}
}
}
Query Response
{
"took": 139,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8630463,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 0.66078395,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 0.46659434,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
}
]
}
}
Note that the documents 1, 2 and 4 are returned in the response. The document 4 is the mid sentence match while documents 1 & 2 are at the beginning.
Also note the score value as how it appears.
Boosting based on hypen
Now with regards to boosting based on hypen character, I'd suggest you to have Bool query along with Regex Query with Boosting. Below is the sample query I came up with.
Note that just for sake of simplicity I've added regex where it would only boost if hypen is next to A.B.
POST idtesttag/_search
{
"query": {
"bool": {
"must" : {
"match" : { "tag" : "A.B" }
},
"should": [
{
"regexp": {
"tag": {
"value": "A.B-.*",
"boost": 3
}
}
}
]
}
}
}
Boosting Query Response
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 3.660784,
"hits": [
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "2",
"_score": 3.660784,
"_source": {
"id": 2,
"text": "more. text",
"tag": "A.B-C.c4"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "4",
"_score": 3.4665942,
"_source": {
"id": 3,
"text": "even more.",
"tag": "B.A.B-32.D-24.f9"
}
},
{
"_index": "idtesttag",
"_type": "mydocs",
"_id": "1",
"_score": 0.8630463,
"_source": {
"id": 1,
"text": "some-text",
"tag": "A.B.c3"
}
}
]
}
}
Just ensure that your testing is thorough when it comes to boosting because its all about influencing the score & make sure you do that with prod data ingested in DEV/TEST Elastic index.
That way you'd not be spooked when you see totally different results if you move to PROD Elastic.
I'm sorry its pretty long answer but I hope this helps!
But, (I think) I want the the tag to appear in the inverted index like this (I probably won't take case into account, just including it for illustration):
Then, I want to search the tag field like this:
Based on what you've described in your post reg. the 'tag' field, here's my 2 cents.
Your Mysql data should be in 1 type (in 6.5 it's 'doc' by default). You do need to explicitly define your Index Mapping though - especially on the 'tag' field, as you seem to have search requirements.
I would define your 'tag' field as a multi-field of:
type 'keyword' for aggregations
type 'text' for searches, with a custom analyzer (that might use 'whitespace' tokenizer, and an 'edge ngram' token filter
(if you don't need aggregations, then just define a 'text' type field with the custom analyzer)
FYI, The Analyze API will show you what ES is doing with your 'tag' data, and will help you define the Mapping that meets your requirements.

Extracting values from deeply nested JSON structures

This is a structure I'm getting from elsewhere, that is, a list of deeply nested dictionaries:
{
"foo_code": 404,
"foo_rbody": {
"query": {
"info": {
"acme_no": "444444",
"road_runner": "123"
},
"error": "no_lunch",
"message": "runner problem."
}
},
"acme_no": "444444",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
{
"foo_code": 200,
"foo_rbody": {
"query": {
"result": {
"acme_no": "260060730303258",
"road_runner": "123",
"abyss": "26843545600"
}
}
},
"acme_no": "260060730303258",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
Asking for different structures is out of question (legacy apis etc).
So I'm wondering if there's some clever way of extracting selected values from such a structure.
The candidates I was thinking of:
flatten particular dictionaries, building composite keys, smth like:
{
"foo_rbody.query.info.acme_no": "444444",
"foo_rbody.query.info.road_runner": "123",
...
}
Pro: getting every value with one access and if predictable key is not there, it means that the structure was not there (as you might have noticed, dictionaries may have different structures depending on whether it was successful operation, error happened, etc).
Con: what to do with lists?
Use some recursive function that would do successive key lookups, say by "foo_rbody", then by "query", "info", etc.
Any better candidates?
You can try this rather trivial function to access nested properties:
import re
def get_path(dct, path):
for i, p in re.findall(r'(\d+)|(\w+)', path):
dct = dct[p or int(i)]
return dct
Usage:
value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")
Maybe the function byPath in my answer to this post might help you.
You could create your own path mechanism and then query the complicated dict with paths. Example:
/ : get the root object
/key: get the value of root_object['key'], e.g. /foo_code --> 404
/key/key: nesting: /foo_rbody/query/info/acme_no -> 444444
/key[i]: get ith element of that list, e.g. /xyzzy_rbody/api/items[0]/desc --> "OK"
The path can also return a dict which you then run more queries on, etc.
It would be fairly easy to implement recursively.
I think about two more solutions:
You can try package Pynq, described here - structured query language for JSON (in Python). As far as a I understand, it's some kind of LINQ for python.
You may also try to convert your JSON to XML and then use Xquery language to get data from it - XQuery library under Python

Categories

Resources