My collections has the following documents
{
cust_id: "0044234",
Address: "1234 Dunn Hill",
city: "Pittsburg",
comments : "4"
},
{
cust_id: "0097314",
Address: "5678 Dunn Hill",
city: "San Diego",
comments : "99"
},
{
cust_id: "012345",
Address: "2929 Dunn Hill",
city: "Pittsburg",
comments : "41"
}
I want to write a block of code that extracts and stores all cust_id's from the same city. I am able to get the answer by running the below query on MongoDB :
db.custData.find({"city" : 'Pittsburg'},{business_id:1}).
However, I am unable to do the same using Python. Below is what I have tried :
ctgrp=[{"$group":{"_id":"$city","number of cust":{"$sum":1}}}]
myDict={}
for line in collection.aggregate(ctgrp) : #for grouping all the cities in the dataset
myDict[line['_id']]=line['number of cust']
for key in myDict:
k=db.collection.find({"city" : 'key'},{'cust_id:1'})
print k
client.close()
Also, I am unable to figure out how can I store this. The only thing that is coming to my mind is a dictionary with a 'list of values' corresponding to a particular 'key'. However, I could not come up with an implementation about the same.I was looking for an output like this
For Pitssburg, the values would be 0044234 and 012345.
You can use the .distinct method which is the best way to do this.
import pymongo
client = pymongo.MongoClient()
db = client.test
collection = db.collection
then:
collection.distinct('cust_id', {'city': 'Pittsburg'})
Yields:
['0044234', '012345']
or do this client side which is not efficient:
>>> cust_ids = set()
>>> for element in collection.find({'city': 'Pittsburg'}):
... cust_ids.add(element['cust_id'])
...
>>> cust_ids
{'0044234', '012345'}
Now if you want all "cust_id" for a given city here it is
>>> list(collection.aggregate([{'$match': {'city': 'Pittsburg'} }, {'$group': {'_id': None, 'cust_ids': {'$push': '$cust_id'}}}]))[0]['cust_ids']
['0044234', '012345']
Now if what you want is group your document by city then here and find distinct "cust_id" then here is it:
>>> from pprint import pprint
>>> pipeline = [{'$group': {'_id': '$city', 'cust_ids': {'$addToSet': '$cust_id'}, 'count': {'$sum': 1}}}]
>>> pprint(list(collection.aggregate(pipeline)))
[{'_id': 'San Diego', 'count': 1, 'cust_ids': ['0097314']},
{'_id': 'Pittsburg', 'count': 2, 'cust_ids': ['012345', '0044234']}]
Related
I would like to create a query to find the number of trees whose species name ends by 'um'
by arrondissement.
My code is here:
from pymongo import MongoClient
from utils import get_my_password, get_my_username
from pprint import pprint
client = MongoClient(
host='127.0.0.1',
port=27017,
username=get_my_username(),
password=get_my_password(),
authSource='admin'
)
db = client['paris']
col = db['trees']
pprint(col.find_one())
{'_id': ObjectId('5f3276d8c22f704983b3f681'),
'adresse': 'JARDIN DU CHAMP DE MARS / C04',
'arrondissement': 'PARIS 7E ARRDT',
'circonferenceencm': 115.0,
'domanialite': 'Jardin',
'espece': 'hippocastanum',
'genre': 'Aesculus',
'geo_point_2d': [48.8561906007, 2.29586827747],
'hauteurenm': 11.0,
'idbase': 107224.0,
'idemplacement': 'P0040937',
'libellefrancais': 'Marronnier',
'remarquable': '0',
'stadedeveloppement': 'A',
'typeemplacement': 'Arbre'}
I tryed to do it with next lines:
import re
regex = re.compile('um')
pipeline = [
{'$group': {'_id': '$arrondissement',
'CountNumberTrees': {'$count': '${'espece': regex}'}
}
}
]
results = col.aggregate(pipeline)
pprint(list(results))
But it returns:
File "<ipython-input-114-fba3a8bf5bfd>", line 8
'CountNumberTrees': {'$count': '${'espece': regex}'}
^
SyntaxError: invalid syntax
When I check like this, it shows results: '25245'
results = col.count_documents(filter={'espece': regex})
print(results)
Could you help me please to understand what should I put in pipeline?
Best regards
Try this syntax for your aggregate query:
The $match stage filters on espace ending in um.
The $group stage counts each returned record grouped by arrondissement
The $project stage is optional but it provides a tidier list of fields.
cursor = col.aggregate([
{'$match': {'espece': {'$regex': 'um$'}}},
{'$group': {'_id': '$arrondissement', 'CountNumberTrees': {'$sum': 1}}},
{'$project': {'_id': 0, 'arrondissement': '$_id', 'CountNumberTrees': '$CountNumberTrees'}}
])
print(list(cursor))
I wanted to add new keys to an existing object in a MongoDB docuemnt, I am trying to update the specific abject with update query but I don't see new keys in database.
I have a object like this:
{'_id': 'patent_1023',
'raw': {'id': 'CN-109897889-A',
'title': 'A kind of LAMP(ring mediated isothermal amplification) product visible detection method',
'assignee': '北京天恩泽基因科技有限公司',
'inventor/author': '徐堤',
'priority_date': '2019-04-17',
'filing/creation_date': '2019-04-17',
'publication_date': '2019-06-18',
'grant_date': None,
'result_link': 'https://patents.google.com/patent/CN109897889A/en', 'representative_figure_link': None
},
'source': 'Google Patent'}
I added two new keys in raw and want to update only 'raw' with new keys 'abstract' and 'description'
Here is what I have done.
d = client.find_one({'_id': {'$in': ids}})
d['raw'].update(missing_data) # missing_data contain new keys to be added in raw.
here = client.find_one_and_update({'_id': d['_id']}, {'$set': {"raw": d['raw']}})
Both update_one and update_many will work with this:
missing_data = {'abstract':'a book', 'description':'a fun book'};
ids = [ 'patent_1023', 'X'];
rc=db.foo.update_one(
{'_id': {'$in': ids}},
# Use pipeline form of update to exploit richer agg framework
# function like $mergeObjects. Below we are saying "take the
# incoming raw object, overlay the missing_data object on top of
# it, and then set that back into raw and save":
[ {'$set': {
'raw': {'$mergeObjects': [ '$$ROOT.raw', missing_data ] }
}}
]
)
I want to find the duplicated document in my mongodb based on name, I have the following code:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
collection = db[options.collection]
names = [{'$project': {'name':'$name'}}]
name_cursor = collection.aggregate(names, cursor={})
for name in name_cursor:
issue_list.append(name)
print(name)
It will print all names, how can I print only the duplicated ones?
Appritiated for any help!
The following query will show only duplicates:
db['collection_name'].aggregate([{'$group': {'_id':'$name', 'count': {'$sum': 1}}}, {'$match': {'count': {'$gt': 1}}}])
How it works:
Step 1:
Go over the whole collection, and group the documents by the property called name, and for each name count how many times it is used in the collection.
Step 2:
filter (using the keyword match) only documents in which the count is greater than 1 (the gt operator).
An example (written for mongo shell, but can be easily adapted for python):
db.a.insert({name: "name1"})
db.a.insert({name: "name1"})
db.a.insert({name: "name2"})
db.a.aggregate([{"$group": {_id:"$name", count: {"$sum": 1}}}, {$match: {count: {"$gt": 1}}}])
Result is { "_id" : "name1", "count" : 2 }
So your code should look something like this:
def Check_BFA_DB(options):
issue_list=[]
client = MongoClient(options.host, int(options.port))
db = client[options.db]
name_cursor = db[options.collection].aggregate([
{'$group': {'_id': '$name', 'count': {'$sum': 1}}},
{'$match': {'count': {'$gt': 1}}}
])
for document in name_cursor:
name = document['_id']
issue_list.append(name)
print(name)
BTW (not related to the question), python naming convention for function names is lowercase letters, so you might want to call it check_bfa_db()
I have a pretty big dictionary which looks like this:
{
'startIndex': 1,
'username': 'myemail#gmail.com',
'items': [{
'id': '67022006',
'name': 'Adopt-a-Hydrant',
'kind': 'analytics#accountSummary',
'webProperties': [{
'id': 'UA-67522226-1',
'name': 'Adopt-a-Hydrant',
'websiteUrl': 'https://www.udemy.com/,
'internalWebPropertyId': '104343473',
'profiles': [{
'id': '108333146',
'name': 'Adopt a Hydrant (Udemy)',
'type': 'WEB',
'kind': 'analytics#profileSummary'
}, {
'id': '132099908',
'name': 'Unfiltered view',
'type': 'WEB',
'kind': 'analytics#profileSummary'
}],
'level': 'STANDARD',
'kind': 'analytics#webPropertySummary'
}]
}, {
'id': '44222959',
'name': 'A223n',
'kind': 'analytics#accountSummary',
And so on....
When I copy this dictionary on my Jupyter notebook and I run the exact same function I run on my django code it runs as expected, everything is literarily the same, in my django code I'm even printing the dictionary out then I copy it to the notebook and run it and I get what I'm expecting.
Just for more info this is the function:
google_profile = gp.google_profile # Get google_profile from DB
print(google_profile)
all_properties = []
for properties in google_profile['items']:
all_properties.append(properties)
site_selection=[]
for single_property in all_properties:
single_propery_name=single_property['name']
for single_view in single_property['webProperties'][0]['profiles']:
single_view_id = single_view['id']
single_view_name = (single_view['name'])
selections = single_propery_name + ' (View: '+single_view_name+' ID: '+single_view_id+')'
site_selection.append(selections)
print (site_selection)
So my guess is that my notebook has some sort of json parser installed or something like that? Is that possible? Why in django I can't access dictionaries the same way I can on my ipython notebooks?
EDITS
More info:
The error is at the line: for properties in google_profile['items']:
Django debug is: TypeError at /gconnect/ string indices must be integers
Local Vars are:
all_properties =[]
current_user = 'myemail#gmail.com'
google_profile = `the above dictionary`
So just to make it clear for who finds this question:
If you save a dictionary in a database django will save it as a string, so you won't be able to access it after.
To solve this you can re-convert it to a dictionary:
The answer from this post worked perfectly for me, in other words:
import json
s = "{'muffin' : 'lolz', 'foo' : 'kitty'}"
json_acceptable_string = s.replace("'", "\"")
d = json.loads(json_acceptable_string)
# d = {u'muffin': u'lolz', u'foo': u'kitty'}
There are many ways to convert a string to a dictionary, this is only one. If you stumbled in this problem you can quickly check if it's a string instead of a dictionary with:
print(type(var))
In my case I had:
<class 'str'>
before converting it with the above method and then I got
<class 'dict'>
and everything worked as supposed to
I am trying to get the country name from the latitude and longitude points from my pandas dataframe.
Currently I have used geolocator.reverse(latitude,longitude) to get the full address of the geographic location. But there is no option to retrieve the country name from the full address as it returns a list.
Method used:
def get_country(row):
pos = str(row['StartLat']) + ', ' + str(row['StartLong'])
locations = geolocator.reverse(pos)
return locations
Call to get_country by passing the dataframe:
df4['country'] = df4.apply(lambda row: get_country(row), axis = 1)
Current output:
StartLat StartLong Address
52.509669 13.376294 Potsdamer Platz, Mitte, Berlin, Deutschland, Europe
Just wondering whether there is some Python library to retrieve the country when we pass the geographic points.
Any help would be appreciated.
In your get_country function, your return value location will have an attribute raw, which is a dict that looks like this:
{
'address': {
'attraction': 'Potsdamer Platz',
'city': 'Berlin',
'city_district': 'Mitte',
'country': 'Deutschland',
'country_code': 'de',
'postcode': '10117',
'road': 'Potsdamer Platz',
'state': 'Berlin'
},
'boundingbox': ['52.5093982', '52.5095982', '13.3764983', '13.3766983'],
'display_name': 'Potsdamer Platz, Mitte, Berlin, 10117, Deutschland',
... and so one ...
}
so location.raw['address']['country'] gives 'Deutschland'
If I read your question correctly, a possible solution could be:
def get_country(row):
pos = str(row['StartLat']) + ', ' + str(row['StartLong'])
locations = geolocator.reverse(pos)
return location.raw['address']['country']
EDIT: The format of the location.raw object will differ depending on which geolocator service you are using. My example uses geopy.geocoders.Nominatim, from the example on geopy's documentation site, so your results might differ.
My code,hopefully that helps:
from geopy.geocoders import Nominatim
nm = Nominatim()
place, (lat, lng) = nm.geocode("3995 23rd st, San Francisco,CA 94114")
print('Country' + ": " + place.split()[-1])
I'm not sure what service you're using with geopy, but as a small plug which I'm probably biased towards, this I think could be a simpler solution for you.
https://github.com/Ziptastic/ziptastic-python
from ziptastic import Ziptastic
# Set API key.
api = Ziptastic('<your api key>')
result = api.get_from_coordinates('42.9934', '-84.1595')
Which will return a list of dictionaries like so:
[
{
"city": "Owosso",
"geohash": "dpshsfsytw8k",
"country": "US",
"county": "Shiawassee",
"state": "Michigan",
"state_short": "MI",
"postal_code": "48867",
"latitude": 42.9934,
"longitude": -84.1595,
"timezone": "America/Detroit"
}
]