Mongodb join two collections giving duplicate results

Mongodb join two collections giving duplicate results - python

def get(self):
res = json.loads(dumps(
self.devices_col.aggregate([
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
},
{
"$unwind": "$participants"
}
])
))
return res
participants document
{
"_id" : ObjectId("5f7230502930714468ed892c"),
"hash" : "83a84e8bf170114cffcc3b1e178d6468",
"name" : "BOMW0000029529",
"persona_id" : "i123",
"command" : "start",
"va_info" : [
{
"device_id" : "5f722a742930714468ed8929",
"automation_config" : "",
"status" : "false",
"remote_path" : "/datadrive/gatewayfolder",
"version" : "1.3.0.9",
"latest_va_version" : "1.3.1.2",
"version_updated_on" : "",
"latest_va_build_number" : "20200525",
"last_connected_on" : "02/08/2020 11:25:55",
"last_seen_on" : "02/08/2020 11:25:55",
"last_activity_processed_on" : "02/07/2020 11:25:55"
}
],
"inclusions" : [
"myfinancewnscom",
"OUTLOOK",
"jp2launcher",
"EXCEL"
],
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ""
}
devices document
{
"_id" : ObjectId("5f722a742930714468ed8929"),
"name" : "",
"unique_id" : "u168381",
"os" : {
"version" : "6.2.9200.0",
"name" : "Microsoft Windows 10 Home",
"locale" : {
"geo_location" : null,
"time_zone" : "IST",
"day_light_saving_support" : false
},
"culture" : {
"name" : "en-US",
"LCID" : "1032",
"language" : "English (United States)"
},
"browser" : [
{
"name" : "IE",
"value" : "9.11.17763.0"
},
{
"name" : "Chrome",
"value" : "84.0.4147.105"
},
{
"name" : "Firefox",
"value" : "Not Found"
}
]
},
"created_by" : "",
"created_on" : "",
"modified_by" : "",
"modified_on" : ISODate("2020-07-21T06:08:50.876Z")
}
Here is my data.
Here is my piece of python code. i am using pymongo client to make query from mongodb
In above code i am trying to join two collection (devices and participants) with device_id (Which is inside participants)
collections.
I have only two records in each collections.
But, output giving me 4 result.
Two duplicate records it is giving.
Please have a look where i am making wrong.

It doesn't doubles, it multiplies: number of devices * number of participants.
In your pipeline you join the collections as:
{"$lookup": {
"from": "participants",
"localField": "_id.docgroupid",
"foreignField": "device_id",
"as": "participants"
}
}
There is no _id.docgroupid field in devices and there are no device_id field in participants so it makes a perfect match of each participant to each device.
After the lookup stage the participants field hold whole participants collection. When you unwind it you see the same parent document with each single participant. Even tho the _id values of the documents are the same they are not identical duplicates and differ by participants field.

Related

Python JSON transformation from explicit to generic by configuration

I have an explicit JSON input that I want to transform into metadata driven generic objects within an array. I have successfully done this on an individual basis however, I want to be able to do it using a configuration file instead.
What I have below is an example of the input data, the configuration I want to apply and the output data.
Since it is outputing into a generic schema, no matter what the input value data type is, I want it to always output as a string.
In addition, the origin data may not always exist in the origin payload so, when I did an individual one of these, I used try which worked really well however, when doing it via a second file of configuration, I am not sure if it will still be the same method, I guess it would, I expect a loop through the configuration file and it create whatever it can else skips to the next one until completed.
INPUT ORIGIN DATA
{
"activities_acceptance" : {
"contractors_sub_contractors" : {
"contractors_subcontractors_engaged" : "yes"
},
"cooking_deep_frying" : {
"deep_frying_engaged" : "yes",
"deep_fryer_vat_limit" : 10
}
},
"situation_acceptance" : {
"building_construction" : {
"wall_materials" : "CONCRETE"
}
}
}
CONFIGURATION PARAMETERS
{
"processiong_configuration" : [
{
"origin_path" : "activities_acceptance.contractors_sub_contractors",
"set_category" : "business-activity",
"set_type" : "contractors-subcontractors",
"set_value" : [
{
"use_value" : "activities_acceptance.contractors_sub_contractors.contractors_subcontractors_engaged",
"set_value" : "value"
}
]
},
{
"origin_path" : "activities_acceptance.cooking_deep_frying",
"set_category" : "business-activity",
"set_type" : "cooking-deep-frying",
"set_value" : [
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_frying_engaged",
"set_value" : "value"
},
{
"use_value" : "activities_acceptance.cooking_deep_frying.deep_fryer_vat_limit",
"set_value" : "details"
}
]
},
{
"origin_path" : "situation_acceptance.building_construction",
"set_category" : "situation-materials",
"set_type" : "wall-materials",
"set_value" : [
{
"use_value" : "situation_acceptance.building_construction.wall_materials",
"set_value" : "CONCRETE"
}
]
}
]
}
EXPECTED OUTPUT
{
"characteristics" : [
{
"category" : "business-activity",
"type" : "contractors-subcontractors",
"value" : "yes"
},
{
"category" : "business-activity",
"type" : "deep-frying",
"value" : "yes",
"details" : "10"
},
{
"category" : "situation-materials",
"type" : "wall-materials",
"value" : "CONCRETE"
}
]
}
What I currently have for a single transform without configuration is the following:
# Create Business Characteristics
business_characteristics = {
"characteristics" : []
}
# Create Characteristics - Business - Liability
# if liability section exists logic to go in here
try:
acc_liability = {
"category" : "business-activities",
"type" : "contractors-sub-contractors-engaged",
"description" : "",
"value" : "",
"details" : ""
}
acc_liability['value'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['contractors_and_subcontractors_engaged']
acc_liability['details'] = d['line_of_businesses'][0]['assets']['commercial_operations'][0]['liability_asset']['acceptance']['contractors_and_subcontractors']['types_of_work_contractors_performed']
business_characteristics['characteristics'].append(acc_liability)
except:
acc_liability = {}
CURRENT OUTPUT in Jupyter
{
"characteristics": [
{
"category": "business-activities",
"type": "contractors-sub-scontractors-engaged",
"description": "",
"value": "YES",
"details": ""
}
]
}

Rank records on the basis of a field value in Elasticsearch

I have a field distribution in record schema that looks likes this:
...
"distribution": {
"properties": {
"availability": {
"type": "keyword"
}
}
}
...
I want to rank the records with distribution.availability == "ondemand" lower than other records.
I looked in Elasticsearch docs but can't find a way to reduce the scores of this type of records in index-time to appear lower in search results.
How can I achieve this, any pointers to related source would be enough as well.
More Info:
I was completely omitting these ondemand records with help of python client in query-time like this:
from elasticsearch_dsl.query import Q
_query = Q("query_string", query=query_string) & ~Q('match', **{'availability.keyword': 'ondemand'})
Now, I want to include these records but I want to place them lower than other records.
If it is not possible to implement something like this in index-time, please suggest how can I achieve this in query-time with python client.
After applying the suggestion from llermaly, the python client query looks like this:
boosting_query = Q(
"boosting",
positive=Q("match_all"),
negative=Q(
"bool", filter=[Q({"term": {"distribution.availability.keyword": "ondemand"}})]
),
negative_boost=0.5,
)
if query_string:
_query = Q("query_string", query=query_string) & boosting_query
else:
_query = Q() & boosting_query

EDIT2 : elasticsearch-dsl-py version of boosting query
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q
client = Elasticsearch()
q = Q('boosting', positive=Q("match_all"), negative=Q('bool', filter=[Q({"term": {"test.available.keyword": "ondemand"}})]), negative_boost=0.5)
s = Search(using=client, index="test_parths007").query(q)
response = s.execute()
print(response)
for hit in response:
print(hit.meta.score, hit.test.available)
EDIT : Just read you need to do it on index time.
Elasticsearch deprecated index time boosting on 5.0
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-boost.html
You can use a Boosting query to achieve that on query time.
Ingest Documents
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "ondemand"
}
}
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "higherscore"
}
}
POST test_parths007/_doc
{
"name": "doc2",
"test": {
"available": "higherscore"
}
}
Query (index time)
POST test_parths007/_search
{
"query": {
"boosting": {
"positive": {
"match_all": {}
},
"negative": {
"term": {
"test.available.keyword": "ondemand"
}
},
"negative_boost": 0.5
}
}
}
Response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "VMdY7XcB50NMsuQPelRx",
"_score" : 1.0,
"_source" : {
"name" : "doc2",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "Vcda7XcB50NMsuQPiVRB",
"_score" : 1.0,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "U8dY7XcB50NMsuQPdlTo",
"_score" : 0.5,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "ondemand"
}
}
}
]
}
}
For more advanced manipulation you can check the Function Score Query

Iterate over a nested array and insert new field in Pymongo

New to MongoDB and the documentation doesnt seem to tell me what I'm looking for.
I have a document like this:
{
"_id" : "5G",
" dump" : [
{
"severity" : "B - Major",
"problemReportId" : "x",
"groupInCharge" : "test",
"feature" : null,
"_id" : "1",
},
{
"severity" : "BM",
"problemReportId" : "x",
"groupInCharge" : "test",
"feature" : null,
"_id" : "1",
}, ]
}
Where dump could have any number... [0,1,2....X, Y, Z]
I want to input a new field let's call it duplicate into the dump dictionarys, so it will look like this:
{
"_id" : "5G",
" dump" : [
{
"severity" : "B - Major",
"problemReportId" : "x",
"groupInCharge" : "test",
"feature" : null,
"_id" : "1",
"duplicate": "0",
},
{
"severity" : "M",
"problemReportId" : "y",
"groupInCharge" : "testX",
"feature" : null,
"_id" : "1",
"duplicate": "0",
}, ]
}
I have tried the below code, but all it does is replace the array and I cannot figure out how to iterate through the array.
for issue in issues:
# Adding a field to tell if issue is a duplicate, 1 = Yes. 0 = No.
duplicate_value = {'dump' : { 'duplicate': 0}}
_key = {"_id": project}
db.dump.update (_key, duplicate_value, upsert=True)

You can try following $map aggregation
db.collection.aggregate([
{
$project: {
dump: {
$map: {
input: "$dump",
as: "dp",
in: {
"severity": "$$dp.severity",
"problemReportId": "$$dp.problemReportId",
"groupInCharge": "$$dp.groupInCharge",
"feature": "$$dp.feature",
"_id": "$$dp._id",
"duplicate": "0"
}
}
}
}
}
])

If you only need to update your field, you should implement it using Mongo's native operators as #Anthony Winzlet suggested you.
However, if for some reasons you really need to parse and update your array within your Python code, I believe using .save() should work better:
client = MongoClient(host,port)
collection = client.collection
for item in collection.find({ ... }):
// Do anything you want with the item...
item["dump"]["duplicate"] = 0
// This will update the item if the dict has a "_id" field, else will insert as a new doc.
collection.save(item)

Python - regex to get dynamic usernames from JSON text

I am trying to extract value of 'login' from a dump of JSON which is in the form of text (response.text)
Here's the string:
{
"name":"master",
"commit":{
"sha":"adc3208a9ac76262250a",
"commit":{
"author":{
"name":"root",
"email":"dan.ja#foo.ca",
"date":"2018-02-26T20:14:41Z"
},
"committer":{
"name":"GitHub Enterprise",
"date":"2018-02-26T20:14:41Z"
},
"message":"Update README.md",
"tree":{
"sha":"3e4710d0e021a0a7",
"comment_count":0,
"verification":{
"verified":false,
"reason":"unsigned",
"signature":null,
"payload":null
}
},
"author":{
"login":"kyle",
"id":5
}
I am just trying to pull the value 'kyle' from the login in the last line. The value of 'kyle' can change as it can be a different login each time. Thus I need string in "login":"string"
Here's what I have right now but that only gets me "login" :
/"login"[^\a]*"/g

Never parse JSON with regex, use a JSON parser.
With jq :
Input file :
{
"commit" : {
"commit" : {
"tree" : {
"verification" : {
"payload" : null,
"verified" : false,
"signature" : null,
"reason" : "unsigned"
},
"sha" : "3e4710d0e021a0a7",
"comment_count" : 0
},
"author" : {
"id" : 5,
"login" : "kyle"
},
"committer" : {
"name" : "GitHub Enterprise",
"date" : "2018-02-26T20:14:41Z"
},
"message" : "Update README.md"
},
"sha" : "adc3208a9ac76262250a"
},
"name" : "master"
}
Command :
$ jq '.commit.commit.author.login' file.json
Or via a python script :
#!/usr/bin/env python3
import json
string = """
{
"commit" : {
"commit" : {
"tree" : {
"verification" : {
"payload" : null,
"verified" : false,
"signature" : null,
"reason" : "unsigned"
},
"sha" : "3e4710d0e021a0a7",
"comment_count" : 0
},
"author" : {
"id" : 5,
"login" : "kyle"
},
"committer" : {
"name" : "GitHub Enterprise",
"date" : "2018-02-26T20:14:41Z"
},
"message" : "Update README.md"
},
"sha" : "adc3208a9ac76262250a"
},
"name" : "master"
}
"""
j = json.loads(string)
print(j['commit']['commit']['author']['login'])
Output :
"kyle"

MongoDB find last submission for each user

I have many documents with such structure:
"_id" : ObjectId("52be9d8dbfbc2c17e6a4e06b"),
"contest" : "Teamcode",
"data" : [
{
"status" : "0",
"message" : "Correct",
"runtime" : 0.10917782783508301,
"score" : 20
},
{
"status" : "0",
"message" : "Correct",
"runtime" : 0.12033200263977051,
"score" : 20
},
{
"status" : "0",
"message" : "Correct",
"runtime" : 0.35556793212890625,
"score" : 20
},
{
"status" : "0",
"message" : "Correct",
"runtime" : 1.8789710998535156,
"score" : 20
},
{
"status" : "0",
"message" : "Correct",
"runtime" : 0.9521079063415527,
"score" : 20
}
],
"id" : 242,
"lang" : "c",
"problem" : "roate",
"result" : [ ],
"score" : 100,
"status" : "done",
"time" : 1388223885.051975,
"user" : {
"email" : "orizont1",
"user_class" : 0,
"name" : "orizont1"
}
}
Each user has many submissions for each problem in one contest.
I have variable called "contest", and I want to take the last submission of each user per each problem. I use pymongo.
How can I do that?

Query can be formed like this:
for each problem (for say Teamcode problem), give me last submission of all users
-> while querying you need to keep in mind that size of object array (data) is greater than equal to 1.
-> query: { "contest": "Teamcode" , "data": { $size: {$gte:1} } }
-> projection: {"data":{$slice:-1}, id:1}. $slice:-1 will give you last element of object array (data) in each document which match the query.
For $slice read this:
http://docs.mongodb.org/manual/reference/operator/projection/slice/#proj._S_slice
YOUR_COLLECTION_NAME.find( { "contest": "Teamcode" , "data": { $size: {$gte:1} } }, {"data":{$slice:-1}, id:1} )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mongodb join two collections giving duplicate results - python

Related

Python JSON transformation from explicit to generic by configuration

Rank records on the basis of a field value in Elasticsearch

Iterate over a nested array and insert new field in Pymongo

Python - regex to get dynamic usernames from JSON text

MongoDB find last submission for each user

Categories

Resources