Export part of data filed - MongoDB

Export part of data filed - MongoDB - python

I'm using MongoDB Compass to export my data as csv file, but I have only the choice to select which field I want and not elements in a specific field.
MongoDB export data:
Actually, I'm interested to save only the "scores" for object "0,1,2".
Here a ScreenShot from MongDB Compas:
It is something that I should deal with python?

One option could be to "rewrite" "scoreTable" so that there are a maximum of 3 elements in the "scores" array and then "$out" to a new collection that can be exported in full.
db.devicescores.aggregate([
{
"$set": {
"scoreTable": {
"$map": {
"input": "$scoreTable",
"as": "player",
"in": {
"$mergeObjects": [
"$$player",
{"scores": {"$slice": ["$$player.scores", 3]}}
]
}
}
}
}
},
{"$out": "outCollection"}
])
Try it on mongoplayground.net.

Related

JSON jq/python file manipulation with specific key name aggregation

I need to modify the structure of this json file:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop"
}
},
{
"id":"9998",
"type":"file_system",
"properties":{
"mount_point":"/opt",
"name":"/dev/mapper/rhel-opt",
"root_container":"3333"
},
"label":"FileSystem"
},
{
"id":"9999",
"type":"file_system",
"properties":{
"mount_point":"/var",
"name":"/dev/mapper/rhel-var",
"root_container":"3333"
},
"label":"FileSystem"
}
]
in order to have this kind of output:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop",
"file_system":[
"/opt",
"/var"
]
}
}
]
The idea is to have, in the new json structure, the visibility of my laptop with the two file-system partition in an array named "file_system".
As you can see the two partition are related to the first by the id and root_container.
So, imagine to have not only one laptop, bat thousands of laptop, with different id and every one of these have different partition, related to the laptop by the root_container key.
Is there an option to do this with jq functions or python script?
Many thanks

You could employ reduce to iterate over the items while extracting their id, mount_point and root_container. Then, if a root_container was present, delete that entry and add its mount_point to the entry whose id matches their root_container. For convenience, I also employed INDEX on the items' id fields to simplify their access as .[$id] and .[$root_container], which had to be undone at the end using map(.).
jq '
reduce .[] as {$id, properties: {$mount_point, $root_container}} (
INDEX(.id);
if $root_container then
del(.[$id])
| .[$root_container].properties.file_system += [$mount_point]
else . end
)
| map(.)
'
[
{
"id": "3333",
"properties": {
"label": "Computer",
"name": "My-Laptop",
"file_system": [
"/opt",
"/var"
]
}
}
]
Demo

Importing Json to MongoDB with Python

I am currently trying to import a lot of json files to Mongodb, some of the jsons are simple with just object:Key:value and those json uploads I can query just fine within python.
Example
[
{
"platform_id": 28,
"mhz": 2400,
"version": "1.1.1l"
}
[
The MongoDB compass shows it like this
Where the problem lies in one of the tools, creates a doc in Mongo, that I can not figure out how to query. The tool creates a json with system information, that's being pushed into the db. Example:
...
{
"systeminfo": [
{
"component": "system board",
"description": "sys board123"
},
{
"component": "bios",
"version": "xyz",
"date": "06/28/2021"
},
{
"component": "processors",
"htt": true,
"turbo": false
},
...
etc for a total of 23 objects.
If I push it directly into Mongo DB it looks like this in compass
So the question is, is there a way to collapse the hardware json one level or a way to query the db. I have found a way to collapse the json, but it moves each value pair into a new dictionary for upload and every parameter is done individually. Not sustainable as the tool is constantly adding new fields and need my app to handle the changes
Here is an example of the hw query, using same pattern works fine for the other collection
db=myclient[('db_name'])]
col = db[(HW_collection]
myquery={"component":"processors"}
mydoc=col.find(myquery)

The followup issue that almost always arises from {"systeminfo.component":"processors"} is that the whole doc will be returned for any array that contains at least one processors entry. Matching does not mean filtering. Below is a slightly more comprehensive solution that includes "collapsing" the info into the top level doc.
Assume input is something like this:
{
"doc":1, "systeminfo": [
{"component": "system board","description": "sys board123"},
{"component": "bios","version": "xyz","date": "06/28/2021"},
{"component": "processors","htt": true,"turbo": false}
]
},{
"doc":2, "systeminfo": [
{"component": "RAM","description": "64G DIMM"},
{"component": "processors","htt": false,"turbo": false},
{"component": "bios","version": "abc","date": "06/28/2018"}
]
},{
"doc":3, "systeminfo": [
{"component": "RAM","description": "32G DIMM"},
{"component": "SCSI","version": "X","date": "01/01/2000"}
]
}
then
db.foo.aggregate([
{$project: {
doc: true, // carry doc num along for ride
// Walk the $systeminfo array and filter for component = processors and
// assign to field P (temporary field, any name is fine):
P: {$filter: {input: "$systeminfo", as: "z",
cond: {$eq:["$$z.component","processors"]} }}
}}
// Remove docs that had no processors:
,{$match: {P: {$ne:[]}}}
// A little complex but read it "backwards" to better understand. The P
// array will be left with 1 entry for processors. "Lift" that doc out of
// the array with $arrayElemAt[0] and merge it with the info in the containing
// top level doc which is $$CURRENT, and then make that merged entity the
// new root (essentially the new $$CURRENT)
,{$replaceRoot: {newRoot: {$mergeObjects: [ {$arrayElemAt:["$P",0]}, "$$CURRENT" ]}} }
// Get rid of the tmp field:
,{$unset: "P"}
]);
yields
{
"component" : "processors",
"htt" : true,
"turbo" : false,
"_id" : ObjectId("61eab547ba7d8bb5090611ee"),
"doc" : 1
}
{
"component" : "processors",
"htt" : false,
"turbo" : false,
"_id" : ObjectId("61eab547ba7d8bb5090611ef"),
"doc" : 2
}

Pymongo include only the fields which are starting with a name

For example, if this is my record
{
"_id":"123",
"name":"google",
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1",
"description":""}
I want to get only those fields starting with 'ip_'. Consider I have 500 fields & only 15 of them start with 'ip_'
Can we do something like this to get the output -
db.collection.find({id:"123"}, {'ip*':1})
Output -
{
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1"
}

The following aggregate query, using PyMongo, returns documents with the field names starting with "ip_".
Note the various aggregation operators used: $filter, $regexMatch, $objectToArray, $arrayToObject. The aggregation pipeline the two stages $project and $replaceWith.
pipeline = [
{
"$project": {
"ipFields": {
"$filter" : {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$regexMatch": { "input": "$$this.k" , "regex": "^ip" } }
}
}
}
},
{
"$replaceWith": { "$arrayToObject": "$ipFields" }
}
]
pprint.pprint(list(collection.aggregate(pipeline)))

I am unaware of a way to specify an expression that would decide which hash keys would be projected. MongoDB has projection operators but they deal with arrays and text search.
If you have a fixed possible set of ip fields, you can simply request all of them regardless of which fields are present in a particular document, e.g. project with
{ip_1: true, ip_2: true, ...}

How to index list of object in Elasticsearch?

A document format I ingest into ElasticSearch looks like this:
{
'id':'514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0'
'created':'2019-09-06 06:09:33.044433',
'meta':{
'userTags':[
{
'intensity':'1',
'sentiment':'0.84',
'keyword':'train'
},
{
'intensity':'1',
'sentiment':'-0.76',
'keyword':'amtrak'
}
]
}
}
...ingested with python:
r = requests.put(itemUrl, auth = authObj, json = document, headers = headers)
The idea here is that ElasticSearch will treat keyword, intensity and sentiment as fields that can be later queried. However, on ElasticSearch side I can observe that this is not happening (I use Kibana for search UI) -- instead, I see field "meta.userTags" with the value that is the whole list of objects.
How can I make ElasticSearch index elements within a list?

I used the document body you provided to create a new index 'testind' and type 'testTyp' using the Postman REST client.:
POST http://localhost:9200/testind/testTyp
{
"id":"514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0",
"created":"2019-09-06 06:09:33.044433",
"meta":{
"userTags":[
{
"intensity":"1",
"sentiment":"0.84",
"keyword":"train"
},
{
"intensity":"1",
"sentiment":"-0.76",
"keyword":"amtrak"
}
]
}
}
When I queried for the index's mapping this is what i get :
GET http://localhost:9200/testind/testTyp/_mapping
{
"testind":{
"mappings":{
"testTyp":{
"properties":{
"created":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"meta":{
"properties":{
"userTags":{
"properties":{
"intensity":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"keyword":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"sentiment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}
}
}
}
}
As you can see in the mapping the fields are part of the mapping and can be queried as per need in future, so I don't see the problem here as long as the field names are not one of these - https://www.elastic.co/guide/en/elasticsearch/reference/6.4/sql-syntax-reserved.html ( you might want to avoid the term 'keyword' as it might be confusing later when writing search queries as the fieldname and type are both same - 'keyword') . Also, note one thing, the mapping gets created via dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/dynamic-field-mapping.html#dynamic-field-mapping ) in Elasticsearch and so the data types are determined by elasticsearch based on the values you have provided.However, this may not be always accurate , so to prevent that you can use the PUT _mapping API to define your own mapping for the index, and then prevent new fields within a type from being added to mappings.

You don't need a special mapping to index a list - every field can contain one or more values of the same type. See array datatype.
In the case of a list of objects, they can be indexed as object or nested datatype. Per default elastic uses object datatype. In this case you can query meta.userTags.keyword or/and meta.userTags.sentiment. The result will allways contains whole documents with values matched independently, ie. searching keyword=train and sentiment=-0.76 you WILL find document with id=514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0.
If this is not what you want, you need to define nested datatype mapping for field userTags and use a nested query.

Elasticsearch/Python - Re-index data after changing the mappings?

I'm a little stuck on how to re-index data in elastic search after a mapping or a data type has been changed.
According to elastic search docs
Pull the documents in from your old index, using a scrolled search and index them into the new index using the bulk API. Many of the client APIs provide a reindex() method which will do all of this for you. Once you are done, you can delete the old index.
This is my old mapping
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"full_address": {
"type": "string"
}
}
}
}
}
}
}
}
New Index mapping, I'm changing full_address -> location_address
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"location_address": {
"type": "string"
}
}
}
}
}
}
}
}
I'm using the python client for elasticsearch
https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex
from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex
es = Elasticsearch(["es.node1"])
reindex(es, "source_index", "target_index")
However this transfers the data from one index to another.
How may i use this to change the mappings/(data types etc) for my case above?

It's Straightforward if you use the scan&scroll and the Bulk API already implemented in the python client of elasticsearch
First -> Fetch all the documents by scan&scroll method
Loop through and make neccessary modifications to each document
Insert the modified documents into a new index using the Bulk API
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch()
# Use the scan&scroll method to fetch all documents from your old index
res = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index="old_index")
new_insert_data = []
# Change the mapping and everything else by looping through all your documents
for x in res:
x['_index'] = 'new_index'
# Change "address" to "location_address"
x['_source']['location_address'] = x['_source']['address']
del x['_source']['address']
# This is a useless field
del x['_score']
es.indices.refresh(index="testing_index3")
# Add the new data into a list
new_insert_data.append(x)
es.indices.refresh(index="new_index")
print new_insert_data
#Use the Bulk API to insert the list of your modified documents into the database
helpers.bulk(es,new_insert_data)

The reindex() API simply "moves" documents from one index to another. There is no way it can detect/infer that the field name full_address in documents of the old index should be location_address in documents in the new index. I doubt there is any API provided by standard Elasticsearch clients that can do what you desire. The only way I can think of achieving this is through additional custom logic on the client side which maintains a dictionary of field names from old index to new index and then read documents from old index and indexes the corresponding document to the new index with new field names obtained from the field name dictionary.

After updating the mapping, this can be done by updating the exiting documents using bulk API.
POST /_bulk
{"update":{"_id":"59519","_type":"asset","_index":"assets"}}
{"doc":{"facility_id":491},"detect_noop":false}
Note - Use 'detect_noop' for detecting the noop update.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Export part of data filed - MongoDB - python

Related

JSON jq/python file manipulation with specific key name aggregation

Importing Json to MongoDB with Python

Pymongo include only the fields which are starting with a name

How to index list of object in Elasticsearch?

Elasticsearch/Python - Re-index data after changing the mappings?

Categories

Resources