Python-How to find duplicated name/document in mongo db? - python

I want to find the duplicated document in my mongodb based on name, I have the following code:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
collection = db[options.collection]
names = [{'$project': {'name':'$name'}}]
name_cursor = collection.aggregate(names, cursor={})
for name in name_cursor:
issue_list.append(name)
print(name)
It will print all names, how can I print only the duplicated ones?
Appritiated for any help!

The following query will show only duplicates:
db['collection_name'].aggregate([{'$group': {'_id':'$name', 'count': {'$sum': 1}}}, {'$match': {'count': {'$gt': 1}}}])
How it works:
Step 1:
Go over the whole collection, and group the documents by the property called name, and for each name count how many times it is used in the collection.
Step 2:
filter (using the keyword match) only documents in which the count is greater than 1 (the gt operator).
An example (written for mongo shell, but can be easily adapted for python):
db.a.insert({name: "name1"})
db.a.insert({name: "name1"})
db.a.insert({name: "name2"})
db.a.aggregate([{"$group": {_id:"$name", count: {"$sum": 1}}}, {$match: {count: {"$gt": 1}}}])
Result is { "_id" : "name1", "count" : 2 }
So your code should look something like this:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
name_cursor = db[options.collection].aggregate([
{'$group': {'_id': '$name', 'count': {'$sum': 1}}},
{'$match': {'count': {'$gt': 1}}}
])
for document in name_cursor:
name = document['_id']
issue_list.append(name)
print(name)
BTW (not related to the question), python naming convention for function names is lowercase letters, so you might want to call it check_bfa_db()

Related

How do i get the document id in Marqo?

i added a document to marqo add_documents() but i didn't pass an id and now i am trying to get the document but i don't know what the document_id is?
Here is what my code look like:
mq = marqo.Client(url='http://localhost:8882')
mq.index("my-first-index").add_documents([
{
"Title": title,
"Description": document_body
}]
)
i tried to check whether the document got added or not but ;
no_of_docs = mq.index("my-first-index").get_stats()
print(no_of_docs)
i got;
{'numberOfDocuments': 1}
meaning it was added.
if you don't add the "_id" as part of key/value then by default marqo will generate a random id for you, to access it you can search the document using the document's Title,
doc = mq.index("my-first-index").search(title_of_your_document, searchable_attributes=['Title'])
you should get a dictionary as the result something like this;
{'hits': [{'Description': your_description,
'Title': title_of_your_document,
'_highlights': relevant part of the doc,
'_id': 'ac14f87e-50b8-43e7-91de-ee72e1469bd3',
'_score': 1.0}],
'limit': 10,
'processingTimeMs': 122,
'query': 'The Premier League'}
the part that says _id is the id of your document.

What to write in pipeline of MangoDb to find one element of 'keys'?

I would like to create a query to find the number of trees whose species name ends by 'um'
by arrondissement.
My code is here:
from pymongo import MongoClient
from utils import get_my_password, get_my_username
from pprint import pprint
client = MongoClient(
host='127.0.0.1',
port=27017,
username=get_my_username(),
password=get_my_password(),
authSource='admin'
)
db = client['paris']
col = db['trees']
pprint(col.find_one())
{'_id': ObjectId('5f3276d8c22f704983b3f681'),
'adresse': 'JARDIN DU CHAMP DE MARS / C04',
'arrondissement': 'PARIS 7E ARRDT',
'circonferenceencm': 115.0,
'domanialite': 'Jardin',
'espece': 'hippocastanum',
'genre': 'Aesculus',
'geo_point_2d': [48.8561906007, 2.29586827747],
'hauteurenm': 11.0,
'idbase': 107224.0,
'idemplacement': 'P0040937',
'libellefrancais': 'Marronnier',
'remarquable': '0',
'stadedeveloppement': 'A',
'typeemplacement': 'Arbre'}
I tryed to do it with next lines:
import re
regex = re.compile('um')
pipeline = [
{'$group': {'_id': '$arrondissement',
'CountNumberTrees': {'$count': '${'espece': regex}'}
}
}
]
results = col.aggregate(pipeline)
pprint(list(results))
But it returns:
File "<ipython-input-114-fba3a8bf5bfd>", line 8
'CountNumberTrees': {'$count': '${'espece': regex}'}
^
SyntaxError: invalid syntax
When I check like this, it shows results: '25245'
results = col.count_documents(filter={'espece': regex})
print(results)
Could you help me please to understand what should I put in pipeline?
Best regards
Try this syntax for your aggregate query:
The $match stage filters on espace ending in um.
The $group stage counts each returned record grouped by arrondissement
The $project stage is optional but it provides a tidier list of fields.
cursor = col.aggregate([
{'$match': {'espece': {'$regex': 'um$'}}},
{'$group': {'_id': '$arrondissement', 'CountNumberTrees': {'$sum': 1}}},
{'$project': {'_id': 0, 'arrondissement': '$_id', 'CountNumberTrees': '$CountNumberTrees'}}
])
print(list(cursor))

PyMongo query - check if value exists or not

I'm looking to perform a query on a database and extract some data to be processed. Here is my query so far:
pipeline = [{'$match':{"Timestamp":{'$gte':m(), '$lt':current()},
'Frequency Survey Reference':{'$regex':'Ch2'}}},
{'$group': {
'_id': '$Timestamp',
'Trace' : {'$push': '$TR Trace'}
}},
{'$sort': {'_id': -1}},
{'$limit': 1}
]
get_tr = collection.aggregate(pipeline, allowDiskUse=True)
However, some of the records don't have any value for TR Trace (an empty array), and I want to perform a check where it ignores those entries and doesn't include them in the pipeline. How would I perform such a check?
Filter them out as part of the $match with a $exists query operator:
pipeline = [{'$match':{"Timestamp":{'$gte':m(), '$lt':current()},
'Frequency Survey Reference':{'$regex':'Ch2'},
'TR Trace': {'$exists': True, '$ne': ''}}},
...

Pymongo - aggregating from two databases

Noob here. I need to get a report that is an aggregate of two collections from two databases. Tried to wrap my head around the problem but failed. The examples I have found are for aggregating two collections from the same database.
Collection 1
SweetsResults = client.ItemsDB.sweets
Collection sweets : _id, type, color
Collection2
SearchesResults = client.LogsDB.searches
Collection searches : _id, timestamp, type, color
The report I need will list all the sweets from the type “candy” with all the listed colors in the sweets collection, and for each line, the number (count) of searches for any available combination of “candy”+color.
Any help will be appreciated.
Thanks.
You can use the below script in mongo shell.
Get the distinct color for each type followed by count for each type and color combination.
var itemsdb = db.getSiblingDB('ItemsDB');
var logsdb = db.getSiblingDB('LogsDB');
var docs = [];
itemsdb.getCollection("sweets").aggregate([
{$match:{"type":"candy"}},
{$group: {_id:{type:"$type", color:"$color"}},
{$project: {_id:0, type:"$_id.type", color:"$_id.color"}}
]).forEach(function(doc){
doc.count = logsdb.getCollection("searches").count({ "type":"candy","color":doc.color});
docs.push(doc)
});
Exactly the same as in #Veeram answer but with python:
uri = 'mongodb://localhost'
client = MongoClient(uri)
items_db = client.get_database('ItemsDB')
logs_db = client.get_database('LogsDB')
docs = []
aggr = items_db.get_collection('sweets').aggregate([
{'$match': {"type": "candy"}},
{'$group': {'_id': {'type': "$type", 'color': "$color"}}},
{'$project': {'_id': 0, 'type': "$_id.type", 'color': "$_id.color"}},
])
for doc in aggr:
doc['count'] = logs_db.get_collection("searches").count({"type": "candy", "color": doc['color']})
docs.append(doc)

Passing variables onto a MongoDB Query

My collections has the following documents
{
cust_id: "0044234",
Address: "1234 Dunn Hill",
city: "Pittsburg",
comments : "4"
},
{
cust_id: "0097314",
Address: "5678 Dunn Hill",
city: "San Diego",
comments : "99"
},
{
cust_id: "012345",
Address: "2929 Dunn Hill",
city: "Pittsburg",
comments : "41"
}
I want to write a block of code that extracts and stores all cust_id's from the same city. I am able to get the answer by running the below query on MongoDB :
db.custData.find({"city" : 'Pittsburg'},{business_id:1}).
However, I am unable to do the same using Python. Below is what I have tried :
ctgrp=[{"$group":{"_id":"$city","number of cust":{"$sum":1}}}]
myDict={}
for line in collection.aggregate(ctgrp) : #for grouping all the cities in the dataset
myDict[line['_id']]=line['number of cust']
for key in myDict:
k=db.collection.find({"city" : 'key'},{'cust_id:1'})
print k
client.close()
Also, I am unable to figure out how can I store this. The only thing that is coming to my mind is a dictionary with a 'list of values' corresponding to a particular 'key'. However, I could not come up with an implementation about the same.I was looking for an output like this
For Pitssburg, the values would be 0044234 and 012345.
You can use the .distinct method which is the best way to do this.
import pymongo
client = pymongo.MongoClient()
db = client.test
collection = db.collection
then:
collection.distinct('cust_id', {'city': 'Pittsburg'})
Yields:
['0044234', '012345']
or do this client side which is not efficient:
>>> cust_ids = set()
>>> for element in collection.find({'city': 'Pittsburg'}):
... cust_ids.add(element['cust_id'])
...
>>> cust_ids
{'0044234', '012345'}
Now if you want all "cust_id" for a given city here it is
>>> list(collection.aggregate([{'$match': {'city': 'Pittsburg'} }, {'$group': {'_id': None, 'cust_ids': {'$push': '$cust_id'}}}]))[0]['cust_ids']
['0044234', '012345']
Now if what you want is group your document by city then here and find distinct "cust_id" then here is it:
>>> from pprint import pprint
>>> pipeline = [{'$group': {'_id': '$city', 'cust_ids': {'$addToSet': '$cust_id'}, 'count': {'$sum': 1}}}]
>>> pprint(list(collection.aggregate(pipeline)))
[{'_id': 'San Diego', 'count': 1, 'cust_ids': ['0097314']},
{'_id': 'Pittsburg', 'count': 2, 'cust_ids': ['012345', '0044234']}]

Categories

Resources