Python [mongo] - convert return fields of find() - python

need to get specific fields from Mongo,
The DB is huge so I prefer getting the values in right format and
not post processing it .
in example,
There are 2 fields which need to convert the format:
1_id: ObjectId('604e0dbc96a0c93a45bfc5b0') to string as "604e0dbc96a0c93a45bfc5b0:
2.birthdate: ISODate('1999-11-10T00:00:00.000Z') - to string in date format "10/11/1999".
Example of json in MongoDB:
{
_id: ObjectId('604e0dbc96a0c93a45bfc5b0'),
address: 'BOB addrees',
name: 'BOB',
last_name: 'Habanero',
birthdate: ISODate('1000-11-10T00:00:00.000Z')
}
retrieving the Jsons specific fields:
customers_cursor = DB.customer.find({},{"_id": 1,"name" :1 ,"last_name":1 ,"customer_type":1,"address.0":1 ,"email":1 ,"birthdate" :1 ,"customer_status":1} )
Is there an option to use convert functions for returning values in find() ?
case not , what are my best option to do it while i have several fields required to format the values and there are MILLION of records in the MongoDB?

Demo - https://mongoplayground.net/p/4OcF0O74PvU
You've to use an aggregation query to do that.
convert object to string using $toString
Use $dateToString to format your date
db.collection.aggregate([
{
"$project": {
"_id": {
"$toString": "$_id"
},
"name": 1,
"last_name": 1,
"customer_type": 1,
"address.0": 1,
"email": 1,
"birthdate": {
"$dateToString": {
"format": "%d/%m/%Y",
"date": "$birthdate"
}
},
"customer_status": 1
}
}
])

Related

Export part of data filed - MongoDB

I'm using MongoDB Compass to export my data as csv file, but I have only the choice to select which field I want and not elements in a specific field.
MongoDB export data:
Actually, I'm interested to save only the "scores" for object "0,1,2".
Here a ScreenShot from MongDB Compas:
It is something that I should deal with python?
One option could be to "rewrite" "scoreTable" so that there are a maximum of 3 elements in the "scores" array and then "$out" to a new collection that can be exported in full.
db.devicescores.aggregate([
{
"$set": {
"scoreTable": {
"$map": {
"input": "$scoreTable",
"as": "player",
"in": {
"$mergeObjects": [
"$$player",
{"scores": {"$slice": ["$$player.scores", 3]}}
]
}
}
}
}
},
{"$out": "outCollection"}
])
Try it on mongoplayground.net.

Bulk Update for elasticsearch documents using Python

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate
age = creationtime - currentdate
:
hits = [
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2018-05-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2013-07-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2014-08-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2015-09-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
}
]
I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.
Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.
The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.
Can anyone give me some ideas for this or point me in the right direction.
As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!
A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:
POST your-index-name/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
def created = LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));
def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
def months = ChronoUnit.MONTHS.between(created, currentdate);
ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
""",
"lang": "painless"
}
}
The official python client has this method too. Here's a working example.
🔑 Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.
💡 It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.
You see, if your index mapping's dates are properly defined like so:
{
"mappings": {
"properties": {
"creationtime": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss"
},
"currentdate": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
...
}
}
}
the age can be calculated as a script field:
POST ttimes/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"age_calculated": {
"script": {
"source": """
def months = ChronoUnit.MONTHS.between(
doc['creationtime'].value,
doc['currentdate'].value );
return months + ' month' + (months > 1 ? 's' : '');
"""
}
}
}
}
The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).
"hits" : [
{
...
"_id" : "FFfPuncBly0XYOUcdIs5",
"fields" : {
"age_calculated" : [ "32 months" ] <--
}
},
...

Pymongo include only the fields which are starting with a name

For example, if this is my record
{
"_id":"123",
"name":"google",
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1",
"description":""}
I want to get only those fields starting with 'ip_'. Consider I have 500 fields & only 15 of them start with 'ip_'
Can we do something like this to get the output -
db.collection.find({id:"123"}, {'ip*':1})
Output -
{
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1"
}
The following aggregate query, using PyMongo, returns documents with the field names starting with "ip_".
Note the various aggregation operators used: $filter, $regexMatch, $objectToArray, $arrayToObject. The aggregation pipeline the two stages $project and $replaceWith.
pipeline = [
{
"$project": {
"ipFields": {
"$filter" : {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$regexMatch": { "input": "$$this.k" , "regex": "^ip" } }
}
}
}
},
{
"$replaceWith": { "$arrayToObject": "$ipFields" }
}
]
pprint.pprint(list(collection.aggregate(pipeline)))
I am unaware of a way to specify an expression that would decide which hash keys would be projected. MongoDB has projection operators but they deal with arrays and text search.
If you have a fixed possible set of ip fields, you can simply request all of them regardless of which fields are present in a particular document, e.g. project with
{ip_1: true, ip_2: true, ...}

How to index list of object in Elasticsearch?

A document format I ingest into ElasticSearch looks like this:
{
'id':'514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0'
'created':'2019-09-06 06:09:33.044433',
'meta':{
'userTags':[
{
'intensity':'1',
'sentiment':'0.84',
'keyword':'train'
},
{
'intensity':'1',
'sentiment':'-0.76',
'keyword':'amtrak'
}
]
}
}
...ingested with python:
r = requests.put(itemUrl, auth = authObj, json = document, headers = headers)
The idea here is that ElasticSearch will treat keyword, intensity and sentiment as fields that can be later queried. However, on ElasticSearch side I can observe that this is not happening (I use Kibana for search UI) -- instead, I see field "meta.userTags" with the value that is the whole list of objects.
How can I make ElasticSearch index elements within a list?
I used the document body you provided to create a new index 'testind' and type 'testTyp' using the Postman REST client.:
POST http://localhost:9200/testind/testTyp
{
"id":"514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0",
"created":"2019-09-06 06:09:33.044433",
"meta":{
"userTags":[
{
"intensity":"1",
"sentiment":"0.84",
"keyword":"train"
},
{
"intensity":"1",
"sentiment":"-0.76",
"keyword":"amtrak"
}
]
}
}
When I queried for the index's mapping this is what i get :
GET http://localhost:9200/testind/testTyp/_mapping
{
"testind":{
"mappings":{
"testTyp":{
"properties":{
"created":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"meta":{
"properties":{
"userTags":{
"properties":{
"intensity":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"keyword":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"sentiment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}
}
}
}
}
As you can see in the mapping the fields are part of the mapping and can be queried as per need in future, so I don't see the problem here as long as the field names are not one of these - https://www.elastic.co/guide/en/elasticsearch/reference/6.4/sql-syntax-reserved.html ( you might want to avoid the term 'keyword' as it might be confusing later when writing search queries as the fieldname and type are both same - 'keyword') . Also, note one thing, the mapping gets created via dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/dynamic-field-mapping.html#dynamic-field-mapping ) in Elasticsearch and so the data types are determined by elasticsearch based on the values you have provided.However, this may not be always accurate , so to prevent that you can use the PUT _mapping API to define your own mapping for the index, and then prevent new fields within a type from being added to mappings.
You don't need a special mapping to index a list - every field can contain one or more values of the same type. See array datatype.
In the case of a list of objects, they can be indexed as object or nested datatype. Per default elastic uses object datatype. In this case you can query meta.userTags.keyword or/and meta.userTags.sentiment. The result will allways contains whole documents with values matched independently, ie. searching keyword=train and sentiment=-0.76 you WILL find document with id=514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0.
If this is not what you want, you need to define nested datatype mapping for field userTags and use a nested query.

MongoDB count distinct items in an array

My actors collection contains an array-of-documents field, called acted_in. Instead of returning the size of acted_in.idmovies like so: {$size: $acted_in.idmovies}, I want to return the number of distinct values inside $acted_in.idmovies. How can I do that ?
c1 = actors.aggregate([{"$match": {'$and': [{'fname': f_name},
{'lname': l_name}]}},
{"$project": {'first_name': '$fname',
'last_name': '$lname',
'gender': '$gender',
'distinct_movies_played_in': {'$size': '$acted_in.idmovies'}}}])
You basically need to include $setDifference in there to obtain the "distinct" items. All "sets" are "distinct" by design and by obtaining the "difference" from the present array to an empty one [] you get the desired result. Then you can apply the $size.
You also have some common mistakes/misconceptions. Firstly when using $match or any MongoDB query expression you do not need to use $and unless there is an explicit case to do so. All query expression arguments are "already" AND conditions unless explicitly stated otherwise, as with $or. So don't explicitly use for this case.
Secondly your $project was using the explicit field path variables for every field. You do not need to do that just to return the field, and outside of usage in an "expression", you can simply use a 1 to notate you want it included:
c1 = actors.aggregate([
{ "$match": { "fname"': f_name, "lname": l_name } },
{ "$project": {
"first_name": 1,
"last_name": 1,
"gender": 1,
"distinct_movies_played_in": {
"$size": { "$setDifference": [ "$acted_in.idmovies", [] ] }
}
}}
])
In fact, if you are actually using MongoDB 3.4 or greater ( and your notation of an element within an array "$acted_in.idmovies" says you have at least MongoDB 3.2 ) which has support for $addFields then use that instead of specifying all other fields in the document.
c1 = actors.aggregate([
{ "$match": { "fname"': f_name, "lname": l_name } },
{ "$addFields": {
"distinct_movies_played_in": {
"$size": { "$setDifference": [ "$acted_in.idmovies", [] ] }
}
}}
])
Unless you explicitly need to just specify "some" other fields.
The basic case here is do not use $unwind for array operations unless you specifically need to perform a $group operation on with it's _id key pointing at a value obtained from "within" the array.
In all other cases, MongoDB has far more efficient operators for working with arrays that what $unwind does.
This should give you what you want:
actors.aggregate([
{
$match: {fname: f_name, lname: l_name}
},
{
$unwind: '$tags'
},
{
$group: {
_id: '$_id',
first_name: {$first: '$fname'},
last_name: {$last: '$lname'},
gender: {$first: '$gender'},
tags: {$addToSet: '$tags'}
}
},
{
$project: {
first_name: 1,
last_name: 1,
gender: 1,
distinct: {$size: '$tags'}
}
}
])
After the tags array is deconstructed and then put back into a set of itself, then you just need to get the number of items or length of that set.

Categories

Resources