MongoDB watch() aggregation match by field value

MongoDB watch() aggregation match by field value - python

When I use the watch() function on my collection, I am passing a aggregation to filter what comes through. I was able to get operationType to work correctly, but I also only want to include documents in which the city field is equal to Vancouver. The current syntax I am using does not work:
change_stream = client.mydb.mycollection.watch([
{
'$match': {
'operationType': { '$in': ['replace', 'insert'] },
'fullDocument': {'city': {'$eq': 'Vancouver'} }
}
}
])
And for reference, this is the what the dictionary that I'm aggregating looks like:
{'_id': {'_data': '825F...E0004'},
'clusterTime': Timestamp(1595565179, 2),
'documentKey': {'_id': ObjectId('70fc7871...')},
'fullDocument': {'_id': ObjectId('70fc7871...'),
'city': 'Vancouver',
'ns': {'coll': 'notification', 'db': 'pipeline'},
'operationType': 'replace'}

I found I just have to use a dot to access the nested dictionary:
change_stream = client.mydb.mycollection.watch([
{
'$match': {
'operationType': { '$in': ['replace', 'insert'] },
'fullDocument.city': 'Vancouver' }
}
}
])

Related

Generalize algorithm for a loop comparing to last record?

I have a data set which I can represent by this toy example of a list of dictionaries:
data = [{
"_id" : "001",
"Location" : "NY",
"start_date" : "2022-01-01T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "002",
"Location" : "NY",
"start_date" : "2022-01-02T00:00:00Z",
"Foo" : "fruits"
},
{
"_id" : "011",
"Location" : "NY",
"start_date" : "2022-02-01T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "012",
"Location" : "NY",
"Start_Date" : "2022-02-02T00:00:00Z",
"Bar" : "vegetables"
},
{
"_id" : "101",
"Location" : "NY",
"Start_Date" : "2022-03-01T00:00:00Z",
"Baz" : "pizza"
},
{
"_id" : "102",
"Location" : "NY",
"Start_Date" : "2022-03-2T00:00:00Z",
"Baz" : "pizza"
},
]
Here is an algorithm in Python which collects each of the keys in each 'collection' and whenever there is a key change, the algorithm adds those keys to output.
data_keys = []
for i, lst in enumerate(data):
all_keys = []
for k, v in lst.items():
all_keys.append(k)
if k.lower() == 'start_date':
start_date = v
this_coll = {'start_date': start_date, 'all_keys': all_keys}
if i == 0:
data_keys.append(this_coll)
else:
last_coll = data_keys[-1]
if this_coll['all_keys'] == last_coll['all_keys']:
continue
else:
data_keys.append(this_coll)
The correct output given here records each change of field name: Foo, Bar, Baz as well as the change of case in field start_date to Start_Date:
[{'start_date': '2022-01-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Foo']},
{'start_date': '2022-02-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'start_date', 'Bar']},
{'start_date': '2022-02-02T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Bar']},
{'start_date': '2022-03-01T00:00:00Z',
'all_keys': ['_id', 'Location', 'Start_Date', 'Baz']}]
Is there a general algorithm which covers this pattern comparing current to previous item in a stack?
I need to generalize this algorithm and find a solution to do exactly the same thing with MongoDB documents in a collection. In order for me to discover if Mongo has an Aggregation Pipeline Operator which I could use, I must first understand if this basic algorithm has other common forms so I know what to look for.
Or someone who knows MongoDB aggregation pipelines really well could suggest operators which would produce the desired result?

EDIT: If you want to use a query for this, one option is something like:
The $objectToArray allow to format the keys as values, and the $ifNull allows to check several options of start_date.
The $unwind allows us to sort the keys.
The $group allow us to undo the $unwind, but now with sorted keys
$reduce to create a string from all keys, so we'll have something to compare.
group again, but now with our string, so we'll only have documents for changes.
db.collection.aggregate([
{
$project: {
data: {$objectToArray: "$$ROOT"},
start_date: {$ifNull: ["$start_date", "$Start_Date"]}
}
},
{$unwind: "$data"},
{$project: {start_date: 1, key: "$data.k", _id: 0}},
{$sort: {start_date: 1, key: 1}},
{$group: {_id: "$start_date", all_keys: {$push: "$key"}}},
{
$project: {
all_keys: 1,
all_keys_string: {
$reduce: {
input: "$all_keys",
initialValue: "",
in: {$concat: ["$$value", "$$this"]}
}
}
}
},
{
$group: {
_id: "$all_keys_string",
all_keys: {$first: "$all_keys"},
start_date: {$first: "$_id"}
}
},
{$unset: "_id"}
])
Playground example

itertools.groupby iterates subiterators when a key value has changed. It does the work of tracking a changing key for you. In your case, that's the keys of the dictionary. You can create a list comprehension that takes the first value from each of these subiterators.
import itertools
data = ... your data ...
data_keys = [next(val)
for _, val in itertools.groupby(data, lambda record: record.keys())]
for row in data_keys:
print(row)
Result
{'_id': '001', 'Location': 'NY', 'start_date': '2022-01-01T00:00:00Z', 'Foo': 'fruits'}
{'_id': '011', 'Location': 'NY', 'start_date': '2022-02-01T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '012', 'Location': 'NY', 'Start_Date': '2022-02-02T00:00:00Z', 'Bar': 'vegetables'}
{'_id': '101', 'Location': 'NY', 'Start_Date': '2022-03-01T00:00:00Z', 'Baz': 'pizza'}

How do I get the value of a dict item within a list, within a dict?

How do I get the value of a dict item within a list, within a dict in Python? Please see the following code for an example of what I mean.
I use the following lines of code in Python to get data from an API.
res = requests.get('https://api.data.amsterdam.nl/bag/v1.1/nummeraanduiding/', params)
data = res.json()
data then returns the following Python dictionary:
{
'_links': {
'next': {
'href': null
},
'previous': {
"href": null
},
'self': {
'href': 'https://api.data.amsterdam.nl/bag/v1.1/nummeraanduiding/'
}
},
'count': 1,
'results': [
{
'_display': 'Maple Street 99',
'_links': {
'self': {
'href': 'https://api.data.amsterdam.nl/bag/v1.1/nummeraanduiding/XXXXXXXXXXXXXXXX/'
}
},
'dataset': 'bag',
'landelijk_id': 'XXXXXXXXXXXXXXXX',
'type_adres': 'Hoofdadres',
'vbo_status': 'Verblijfsobject in gebruik'
}
]
}
Using Python, how do I get the value for 'landelijk_id', represented by the twelve Xs?

This should work:
>>> data['results'][0]['landelijk_id']
"XXXXXXXXXXXXXXXX"
You can just chain those [] for each child you need to access.

I'd recommend using the jmespath package to make handling nested Dictionaries easier. https://pypi.org/project/jmespath/
import jmespath
import requests
res = requests.get('https://api.data.amsterdam.nl/bag/v1.1/nummeraanduiding/', params)
data = res.json()
print(jmespath.search('results[].landelijk_id', data)

Validating arbitrary dict keys with strict schemas with Cerberus

I am trying to validate JSON, the schema for which specifies a list of dicts with arbitrary string keys, the corresponding values of which are dicts with a strict schema (i.e, the keys of the inner dict are strictly some string, here 'a'). From the Cerberus docs, I think that what I want is the 'keysrules' rule. The example in the docs seems to only show how to use 'keysrules' to validate arbitrary keys, but not their values. I wrote the below code as an example; the best I could do was assume that 'keysrules' would support a 'schema' argument for defining a schema for these values.
keysrules = {
'myDict': {
'type': 'dict',
'keysrules': {
'type': 'string',
'schema': {
'type': 'dict',
'schema': {
'a': {'type': 'string'}
}
}
}
}
}
keysRulesTest = {
'myDict': {
'arbitraryStringKey': {
'a': 'arbitraryStringValue'
},
'anotherArbitraryStringKey': {
'shouldNotValidate': 'arbitraryStringValue'
}
}
}
def test_rules():
v = Validator(keysrules)
if not v.validate(keysRulesTest):
print(v.errors)
assert(0)
This example does validate, and I would like it to not validate on 'shouldNotValidate', because that key should be 'a'. Does the flexibility implied by 'keysrules' (i.e, keys governed by 'keysrules' have no constraint other than {'type': 'string'}) propagate down recursively to all schemas underneath it? Or have I made some different error? How can I achieve my desired outcome?

I didn't want keysrules, I wanted valuesrules:
keysrules = {
'myDict': {
'type': 'dict',
'valuesrules': {
'type': 'dict',
'schema': {
'a': {'type': 'string'}
}
}
}
}
keysRulesTest = {
'myDict': {
'arbitraryStringKey': {
'a': 'arbitraryStringValue'
},
'anotherArbitraryStringKey': {
'shouldNotValidate': 'arbitraryStringValue'
}
}
}
def test_rules():
v = Validator(keysrules)
if not v.validate(keysRulesTest):
print(v.errors)
assert(0)
This produces my desired outcome.

A Python dictionary with repeated fields

I'm constructing a dictionary with Python to use with a SOAP API.
My SOAP API takes an input like this:
<dataArray>
<AccountingYearData>
<Handle>
<Year>string</Year>
</Handle>
<Year>string</Year>
<FromDate>dateTime</FromDate>
<ToDate>dateTime</ToDate>
<IsClosed>boolean</IsClosed>
</AccountingYearData>
<AccountingYearData>
<Handle>
<Year>string</Year>
</Handle>
<Year>string</Year>
<FromDate>dateTime</FromDate>
<ToDate>dateTime</ToDate>
<IsClosed>boolean</IsClosed>
</AccountingYearData>
</dataArray>
Se this for the full string
https://api.e-conomic.com/secure/api1/EconomicWebService.asmx?op=AccountingYear_CreateFromDataArray
Notice how the field appears multiple times.
How can I create a Python dict with this data?
If I do this:
data = {
'dataArray':{
'AccountingYearData':{
'Handle':{'Year':'2017'},
'Year':'2017',
'FromDate':'2017-01-01',
'ToDate':'2017-12-31',
'IsClosed':'False'
},
'AccountingYearData':{
'Handle':{'Year':'2017'},
'Year':'2017',
'FromDate':'2017-01-01',
'ToDate':'2017-12-31',
'IsClosed':'False'
}
}
}
I get:
>>> type (data)
<type 'dict'>
>>> data {
'dataArray': {
'AccountingYearData': {
'IsClosed': 'False',
'FromDate': '2017-01-01',
'Handle': {'Year': '2017'},
'ToDate': '2017-12-31',
'Year': '2017'
}
}
}
It's as expected I think, but now what I need.

Well, the answer seems obvious and is even hinted by the "dataArray" name: if you have a list of items, then you want to use a list to store them:
data = {
'dataArray':[
{
'AccountingYearData':{
'Handle':{'Year':'2017'},
'Year':'2017',
'FromDate':'2017-01-01',
'ToDate':'2017-12-31',
'IsClosed':'False'
},
},
{
'AccountingYearData':{
'Handle':{'Year':'2017'},
'Year':'2017',
'FromDate':'2017-01-01',
'ToDate':'2017-12-31',
'IsClosed':'False'
},
},
]
}

Unable to append data to array

I am retrieving a record set from a database.
Then using a for statement I am trying to construct my data to match a 3rd party API.
But I get this error and can't figure it out:
"errorType": "TypeError", "errorMessage": "list indices must be
integers, not str"
"messages['english']['merge_vars']['vars'].append({"
Below is my code:
cursor = connect_to_database()
records = get_records(cursor)
template = dict()
messages = dict()
template['english'] = "SOME_TEMPLATE reminder-to-user-english"
messages['english'] = {
'subject': "Reminder (#*|code|*)",
'from_email': 'mail#mail.com',
'from_name': 'Notifier',
'to': [],
'merge_vars': [],
'track_opens': True,
'track_clicks': True,
'important': True
}
for record in records:
record = dict(record)
if record['lang'] == 'english':
messages['english']['to'].append({
'email': record['email'],
'type': 'to'
})
messages['english']['merge_vars'].append({
'rcpt': record['email']
})
for (key, value) in record.iteritems():
messages['english']['merge_vars']['vars'].append({
'name': key,
'content': value
})
else:
template['other'] = "SOME_TEMPLATE reminder-to-user-other"
close_database_connection()
return messages
The goal is to get something like this below:
messages = {
'subject': "...",
'from_email': "...",
'from_name': "...",
'to': [
{
'email': '...',
'type': 'to',
},
{
'email': '...',
'type': 'to',
}
],
'merge_vars': [
{
'rcpt': '...',
'vars': [
{
'content': '...',
'name': '...'
},
{
'content': '...',
'name': '...'
}
]
},
{
'rcpt': '...',
'vars': [
{
'content': '...',
'name': '...'
},
{
'content': '...',
'name': '...'
}
]
}
]
}

This code seems to indicate that messages['english']['merge_vars'] is a list, since you initialize it as such:
messages['english'] = {
...
'merge_vars': [],
...
}
And call append on it:
messages['english']['merge_vars'].append({
'rcpt': record['email']
})
However later, you treat it as a dictionary when you call:
messages['english']['merge_vars']['vars']
It seems what you want is something more like:
vars = [{'name': key, 'content': value} for key, value in record.iteritems()]
messages['english']['merge_vars'].append({
'rcpt': record['email'],
'vars': vars,
})
Then, the for loop is unnecessary.

What the error is saying is that you are trying to access an array element with the help of string not index (int).
I believe your mistake is in this line:
messages['english']['merge_vars']['vars'].append({..})
You declared merge_vars as array like so:
'merge_vars': []
So, you either make it dict like this:
'merge_vars': {}
Or, use it as array:
messages['english']['merge_vars'].append({..})
Hope it helps

Your issues, as the Error Message is saying, is here: messages['english']['merge_vars']['vars'].append({'name': key,'content': value})
The item messages['english']['merge_vars'] is a list and thus you're trying to access an element when you do something like list[i] and i cannot be a string, as is the case with 'vars'. You probably either need to drop the ['vars'] part or set messages['english']['merge_vars'] to be a dict so that it allows for additional indexing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

MongoDB watch() aggregation match by field value - python

I found I just have to use a dot to access the nested dictionary: change_stream = client.mydb.mycollection.watch([ { '$match': { 'operationType': { '$in': ['replace', 'insert'] }, 'fullDocument.city': 'Vancouver' } } } ])

Related

Generalize algorithm for a loop comparing to last record?

How do I get the value of a dict item within a list, within a dict?

Validating arbitrary dict keys with strict schemas with Cerberus

A Python dictionary with repeated fields

Unable to append data to array

Categories

Resources