Parsing a json string within pymongo - python

I have a field called financials that contains a json string.
{
"_id" : ObjectId("57506d74c469888f0d631be6"),
"financials" : "{"year":[2015], ...}"
}
What I currently do is extract the data, convert it to a pandas dataframe, parse the string using json.loads and fiddle with the financial data from there.
Is there any way to parse the json string in pymongo, preferably as part of the aggregate pipeline as I wish to use some functions (namely $unwind) within pymongo?

I do not know how to do it via pymongo (which could probably mean that there is no option to do it via pymongo, for example $convert operator does not have option of parsing string to json), but different solution could be via mongo shell with using JSON.parse.
db.YourCollection.find().forEach( function(Object)
{var modified_data = JSON.parse(Object.financials);db.YourCollection.updateOne({_id:Object._id},{$set:{financials:modified_data}})} )

Related

PySpark: validate JSON schema matches expected schema, for a JSON column in dataframe

I am using PySpark to create ETL that loads 25M log records per day, each having a column containing a string whose content is nested JSON. The JSON is large (~400 distinct fields), and complex with nesting, arrays, and so on. My process extracts a subset of the JSON data into flattened/un-nested/exploded columns.
Problem: the source JSON has inconsistent schema. E.g. some records having fields with scalars while others of the same name are objects (maps), or others have inconsistent data types, and so on.
I have a schema of the expected/allowed JSON and if I know that a record does not meet the schema, I can clean the data in many cases. But I don't know which records are problematic, and why
When the JSON string I am parsing does not meet the schema, I want to use the PySpark feature to set mode to PERMISSIVE and columnNameOfCorruptRecord to a column in my dataframe where JSON not meeting my schema are stored.
Here's one case of inconsistencies that I would need to clean first:
{
"location": "123 Main Street"
}
while a different record expresses location in this format:
{
"location": {
"value": "456 Elm Street",
"updated_at": "2023-02-06"
}
}
There are many such inconsistencies, and when they are found, it appears that Spark mangles most or all of the transformed record. When the JSON meets the schema, everything is good.
The PySpark code uses a specific JSON document to provide "sample JSON" that I turn into a valid JSON schema. Then, PySpark parses the column with a string of JSON (called attributes) as follows:
# `sample_json` is a string of the expected format of JSON, that this turns into a schema
# This JSON has a node called `_attributes_json_schema_errors`. See below.
json_schema = schema_of_json(sample_json)
# Later in the code, I use the `from_json` function, with options, to apply the schema.
# The code that isn't working is the `options` dict, so bad records are separated out.
attributes_df = (attributes_df
.withColumn(
'attributes_struct',
from_json(
col=col('attributes'),
schema=json_schema,
options={
"mode": "PERMISSIVE",
"columnNameOfCorruptRecord": "attributes_struct._attributes_json_schema_errors"
}
)
)
.drop('attributes')
.cache()
)
I understand that the from_json PySpark function accepts options allowing me to define how errors are managed. In my case, I expected that by using the mode and columnNameOfCorruptRecord options, then when PySpark encounters a record that cannot be converted using from_json and the supplied schema, it would be stored instead in the special column for corrupted records.
But this does not seem to be working. I have found several examples that do not meet the schema, and the transformed results have almost all records as null, with nothing in the special column.
Is there something about the way I am calling this, or other assumptions I am making that are incorrect?

Pymongo Insert Doc with String ID's rather than ObjectIDs

I am trying to get pymongo to insert new documents which have id's in string format rather than ObjectId's. The app I am building integrates meteor and python and meteor inserts string id's so having to work with both string and Objectids adds complexity.
Example:
Meteor-inserted doc:
{
"_id" : "22FHWpvqrAeyfvh7B"
}
Pymongo-inserted doc:
{
"_id" : ObjectId("5880387d1fd21c2dc66e9b7d")
}
You could just switch your Meteor app to insert ObjectIds instead of strings. Just use the idGeneration option property and set the value to MONGO.
Here is an example.
var todos = new Mongo.Collection('todos', {
idGeneration: 'MONGO'
});
It is described in the Meteor docs here.
Or, if you want Meteir to keep strings and can't figure out how to configure Pymongo to store as strings, then you can do the approach described here to convert between ObjectIds and strings in Pymongo..

How to use MongoDB find() to perform range queries on numeric strings?

How can I make a find() in MongoDB, using find to be >= with some value, but that value is a numeric string?
If I run the following line (that searches the MongoDB database for modes higher than 1):
cursor = db.foo.find({"mode": {"$gt": 1}})
This will work only if the data in MongoDB is in the format:
data = {"mode":3}
But I need to use the find() with this data:
data = {"mode":'3'} # as string
How can I do this?
Here is my example:
from pymongo import MongoClient
client = MongoClient()
db = client.test
db.foo.drop()
data = {"mode":3} # Works because this is a numeric
data = {"mode":'3'} # Won't work!!!!!!!!!! But my database contains only numeric strings...how can use like this?
db.foo.insert_one(data)
print(db.foo.count())
cursor = db.foo.find({"mode": {"$gt": 1}})
for document in cursor:
print(document)
If you leave your numeric data stored in the database as strings, in order to query your data with range operators such as $gt and $lt you're going to have to use one of two approaches.
First, you can use JavaScript's automatic conversion to run your range queries. This works as shown below, but it is very limited as you will not be able to use any indexes, as explained in the comments to previous answers. Thus for big data sets, this will be prohibitively slow.
db.foo.find("this.mode > 1");
A second approach would involve regular expressions. You will have to figure out what regex to use, but once you have that, you can use the syntax below to run your query or use the $regex operator as highlighted here.
db.foo.find({ mode: /pattern/<options> });
Aside from having to figure out some complex regex, again there are possible performance issues with this approach, as explained here (see extract below). Most likely, you will also run into issues where your query is not taking advantage of indexes.
If an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.
Because of this, if you're going to be running these queries often, I would recommend that you follow a third approach, which would be to change your schema and store your data as numbers. You can achieve this with a simple migration script such as the following in JavaScript, which you could run in the shell.
var cursor = db.foo.find();
while (cursor.hasNext()) {
var doc = cursor.next();
var _id = doc._id;
if (doc.mode) {
var modeString = doc.mode;
var modeInt = parseInt(modeString);
db.foo.update({ _id: _id }, { $set: { mode: modeInt } });
}
}
Having done that you will be able to query your data using operators such as $gt and $lt, sort it without much hassle, and take advantage of indexes.
From Mongo docs,
$type selects the documents where the value of the field is an instance of the specified BSON type. Querying by data type is useful when dealing with highly unstructured data where data types are not predictable.
{ field: { $type: BSON type number | String alias } }
$type returns documents where the BSON type of the field matches the BSON type passed to $type.
I guess you'll have to pass the $type explicitly in your case which might be:
data = {{"mode":{$type:"string"}}:'3'}
You could try this synthax (JavaScript's automatic conversion):
db.test.find("this.mode > 1")
source

Use ElasticSearch search URI into Python PyEs client

I have the following working query, using CURL:
curl -X GET 'http://myhost/myindex/_search?q=text:%7B1933%20TO%201949%7D'
which identifies all documents inside the index myindex which in the text field (string type) contains mentions of dates between 1933 and 1949. I would like to use this query programatically from Python, and in this sense I have the Python ElasticSearch client installed, Pyes:
from elasticsearch import ElasticSearch
from pyes import *
and then I would like to call
es = ElasticSearch('myhost')
totalDocs = es.search('myindex', body={'query':{'query_string':{'query': 'text:%7B1933%20TO%201949%7D'}},"size":0})['hits']['total']
but this syntax is not working. Also it is important that I look into the text field only for these date mentions. Are there any ways to use the initial query in Python? Many thanks!
Later edit: I have also tried it like this:
totalDocs = es.search('myindex', 'text:%7B1933%20TO%201949%7D')['hits']['total']
It works, but it returns no documents at all.
Actually my question resumes to this question: [ElasticSearch - specify range for a string field

How do I get json output from a date value from mongodb using pymongo

I am collecting stock market data periodically and I am storing in mongodb using pymongo as:
db.apple_stock.insert({'time': datetime.datetime.utcnow(), 'price': price})
Now I want the output in JSON format so that I can use highstocks:
[
[1403546401372, 343],
[1403560801637, 454],
[1403575202199, 345],
[1403618402379, 345]
]
The Tornado is running on the server and 'mysite.com/api/stock.json' should provide above data in JSON file.
So, I query my database and used pymongo's json_util to dump in json:
from bson.json_util import dumps
dumps(query_result)
I am getting output as:
[
[{"$date": 1403546401372}, 343],
[{"$date": 1403560801637}, 454],
[{"$date": 1403575202199}, 353]]
]
So how do I change the first item from dictionary to date, containing only value part? Is there any function available which does it or do I have to iterate through list and convert it myself?
secondly, if I really have to iterate the list, then what is the proper way of storing in MongoDB so that I get required output directly?
The dumps functionality is deliberately supplied in the pymongo driver utilities in order to provide what is known as extended JSON format.
The purpose of this is to provide "type fidelity" where JSON itself is not aware of strict "types" such as what the BSON format is desinged for, and what MongoDB itself uses. It allows the transfer of JSON to clients that "may" be able to understand the "types" presented, such as "$date" or "$oid" and then use those keys to "de-serialize" the JSON into a specific "type" under that language specification.
What you will find "otherwise" with a standard form of JSON encode under most language implementations, is either:
An interesting tuple of the time values
A string representing the time value
An epoch timestamp representing the time value
The best case for "custom serialization" is to either iterate the returned structure and implement serialization of the types yourself. Or to just use dumps form along with a "decode" of the JSON and then remove those "keys" identifying the "type".
Or of course just live with the base JSON encode outside of the special library and it's results.
The proper BSON date types that result in this are the "proper" way of storing in MongoDB. The usage of the library function is "by design". But the final way you use this is actually up to you.

Categories

Resources