I have a large amount of data in a collection in mongodb which I need to analyze, using pandas and pymongo in jupyter. I am trying to import specific data in a dataframe.
Sample data.
{
"stored": "2022-04-xx",
...
...
"completedQueues": [
"STATEMENT_FORWARDING_QUEUE",
"STATEMENT_PERSON_QUEUE",
"STATEMENT_QUERYBUILDERCACHE_QUEUE"
],
"activities": [
"https://example.com
],
"hash": "xxx",
"agents": [
"mailto:example#example.com"
],
"statement": { <=== I want to import the data from "statement"
"authority": {
"objectType": "Agent",
"name": "xxx",
"mbox": "mailto:example#example.com"
},
"stored": "2022-04-xxx",
"context": {
"platform": "Unknown",
"extensions": {
"http://example.com",
"xxx.com": {
"user_agent": "xxx"
},
"http://example.com": ""
}
},
"actor": {
"objectType": "xxx",
"name": "xxx",
"mbox": "mailto:example#example.com"
},
"timestamp": "2022-04-xxx",
"version": "1.0.0",
"id": "xxx",
"verb": {
"id": "http://example.com",
"display": {
"en-US": "viewed"
}
},
"object": {
"objectType": "xxx",
"id": "https://example.com",
"definition": {
"type": "http://example.com",
"name": {
"en-US": ""
},
"description": {
"en-US": "Viewed"
}
}
}
}, <=== up to here
"hasGeneratedId": true,
...
...
}
Notice that I am only interested in data nested under "statement", and not in any data containing the string, ie the "STATEMENT_FORWARDING_QUEUE" above it.
What I am trying to accomplish is import the data from "statement" (as indicated above) in a dataframe, and arrange them in a manner, like:
id
authority objectType
authority name
authority mbox
stored
context platform
context extensions
actor objectType
actor name
...
00
Agent
xxx
mailto
2022-
Unknown
http://1
xxx
xxx
...
01
Agent
yyy
mailto
2022-
Unknown
http://2
yyy
yyy
...
The idea is to be able to access any data like "authority name" or "actor objectType".
I have tried:
df = pd.DataFrame(list(collection.find(query)(filters)))
df = json_normalize(list(collection.find(query)(filters)))
with various queries, filter and slices, and also aggregate and map/reduce, but nothing produces the correct output.
I would also like to sort (newest to oldest) based on the "stored" field (sort('$natural',-1) ?), and maybe apply limit(xx) to the dataframe as well.
Any ideas?
Thanks in advance.
Try this
df = json_normalize(list(
collection.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
}
])
)
Thanks for the answer, #pavel. It is spot on and pretty much solves the problem.
I also added sorting and limit, so if anyone is interested, the final code looks like this:
df = json_normalize(list(
statements_coll.aggregate([
{
"$match": query
},
{
"$replaceRoot": {
"newRoot": "$statement"
}
},
{
"$sort": {
"stored": -1
}
},
{
"$limit": 10
}
])
))
Related
I'm using Flask with Jinja2 template engine and MongoDB via pymongo. This are my documents from two collections (phone and factory):
phone = db.get_collection("phone")
{
"_id": ObjectId("63d8d39206c9f93e68d27206"),
"brand": "Apple",
"model": "iPhone XR",
"year": NumberInt("2016"),
"image": "https://apple-mania.com.ua/media/catalog/product/cache/e026f651b05122a6916299262b60c47d/a/p/apple-iphone-xr-yellow_1.png",
"CPU": {
"manufacturer": "A12 Bionic",
"cores": NumberInt("10")
},
"misc": [
"Bluetooth 5.0",
"NFC",
"GPS"
],
"factory_id": ObjectId("63d8d42b7a4d7a7e825ef956")
}
factory = db.get_collection("factory")
{
"_id": ObjectId("63d8d42b7a4d7a7e825ef956"),
"name": "Foxconn",
"stock": NumberInt("1000")
}
In my python code to retrieve the data I do:
models = list(
phone.find({"brand": brand}, projection={"model": True, "image": True, "factory_id": True})
)
How can I retrieve relative factory document by factory_id and have it as an embedded document in a models list?
I think you are looking for this query using aggregation stage $lookup.
So this query:
First $match by your desired brand.
Then do a "join" between collections based on the factory_id and store it in an array called "factory". The $lookup output is always an array because can be more than one match.
Last project only values you want. In this case, as _id is unique you can get the factory using $arrayElemAt position 0.
So the code can be like this (I'm not a python expert)
models = list(
phone.aggregate([
{
"$match": {
"brand": brand
}
},
{
"$lookup": {
"from": "factory",
"localField": "factory_id",
"foreignField": "_id",
"as": "factories"
}
},
{
"$project": {
"model": True,
"image": True,
"factory": {
"$arrayElemAt": [
"$factories",
0
]
}
}
}
])
)
I have some stored data like this:
{
"_id" : 1,
"serverAddresses" : {
"name" : "0.0.0.0:8000",
"name2": "0.0.0.0:8001"
}
}
I need aggregated data to this:
[
{
"gameId": "1",
"name": "name1",
"url": "0.0.0.0:8000"
},
{
"gameId": "1",
"name": "name2",
"url": "0.0.0.0:8001"
}
]
What is the solution without using for loop?
$project - Add addresses field by converting $serverAddress to (key-value) array.
$unwind - Descontruct addresses field to multiple documents.
$replaceRoot - Decorate the output document based on (2).
db.collection.aggregate([
{
"$project": {
"addresses": {
"$objectToArray": "$serverAddresses"
}
}
},
{
$unwind: "$addresses"
},
{
"$replaceRoot": {
"newRoot": {
gameId: "$_id",
name: "$addresses.k",
address: "$addresses.v"
}
}
}
])
Sample Mongo Playground
I am currently using this to push a 'review' to my array of reviews in my perfumes collection:
mongo.db.perfumes.update(
{"_id": perfume["_id"]},
{
"$push": {
"reviews": {
"_id": review_id,
"review_content": form.review.data,
"reviewer": current_user.username,
"date_reviewed": datetime.utcnow(),
"reviewer_picture": current_user.avatar,
}
}
},
)
So as a result my document is:
[
{
"_id": {
"$oid": "5ebf29dd1f3fe19434e41761"
},
"author": "Guillermo",
"brand": "A test brand",
"name": "A test perfume",
"perfume_type": "Woody",
"description": "<p>A test description</p>",
"date_updated": {
"$date": "2020-05-15T23:46:37.242Z"
},
"public": false,
"picture": "generic.png",
"reviews": [
{
"_id": {
"$oid": "5ebf29e90000000000000000"
},
"review_content": "<p>A test review</p>",
"reviewer": "Guillermo",
"date_reviewed": {
"$date": "2020-05-15T23:46:49.308Z"
},
"reviewer_picture": "a92de23ae01cdfde.jpg"
}
]
}
]
I want to create another route to update or edit the contents of my review (review_content).
What's the way to update that subarray in my collection?
Thank you!!
Let's assume you want to update review_content of a particular review you will use below query
mongo.db.perfumes.update(
{"_id": perfume["_id"], "reviews._id": review["_id"]},
{ $set: { "reviews.$.review_content" : "This is my new content"} },
)
I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.
I have an index with the following mapping:
{
"mappings":{
"my_stuff_type":{
"properties":{
"location": {
"type": "geo_point",
"null_value": -1
}
}
}
}
}
I have to use the property null_value because some of my documents don't have information about their location (latitude/longitude), but I still would like to search by distance on a location, cf. here: https://www.elastic.co/guide/en/elasticsearch/reference/current/null-value.html
When checking the index mapping details, I can verify that the geo mapping is there:
curl -XGET http://localhost:9200/my_stuff_index/_mapping | jq '.my_stuff_index.mappings.my_stuff_type.properties.location'
{
"properties": {
"lat": {
"type": "float"
},
"lon": {
"type": "float"
}
}
}
However when trying to search for documents on that index using a geo distance filter (cf. https://www.elastic.co/guide/en/elasticsearch/guide/current/geo-distance.html), then I see this:
curl -XPOST http://localhost:9200/my_stuff_index/_search -d'
{
"query": {
"bool": {
"filter": {
"geo_distance": {
"location": {
"lat": <PUT_LATITUDE_FLOAT_HERE>,
"lon": <PUT_LONGITUDE_FLOAT_HERE>
},
"distance": "200m"
}
}
}
}
}' | jq
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "failed to find geo_point field [location]",
"index_uuid": "mO94yEsHQseQDFPkHjM6tA",
"index": "my_stuff_index"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "my_stuff_index",
"node": "MDueSn31TS2z0Lamo64zbw",
"reason": {
"type": "query_shard_exception",
"reason": "failed to find geo_point field [location]",
"index_uuid": "mO94yEsHQseQDFPkHjM6tA",
"index": "my_stuff_index"
}
}
],
"caused_by": {
"type": "query_shard_exception",
"reason": "failed to find geo_point field [location]",
"index_uuid": "mO94yEsHQseQDFPkHjM6tA",
"index": "my_stuff_index"
}
},
"status": 400
}
I think the null_value property should allow me to insert documents without that location filed and at the same time I should be able to search with filters on that same "optional" field.
Why I am not able to filter on that "optional" field? How could I do this?
Edit:
To reproduce this issue with python run the following code snippet, before performing the curl/jq operations from the command line.
The python code depends on this: pip install elasticsearch==5.4.0.
from elasticsearch import Elasticsearch
from elasticsearch import helpers
my_docs = [
{"xyz": "foo", "location": {"lat": 0.0, "lon": 0.0}},
{"xyz": "bar", "location": {"lat": 50.0, "lon": 50.0}}
]
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
index_mapping = '''
{
"mappings":{
"my_stuff_type":{
"properties":{
"location": {
"type": "geo_point",
"null_value": -1.0
}
}
}
}
}'''
es.indices.create(index='my_stuff_index', ignore=400, body=index_mapping)
helpers.bulk(es, my_docs, index='my_stuff_index', doc_type='my_stuff_type')
as #Val has said you should change your mapping. If you define the location field in this way:
"location": {
"type": "geo_point"
}
you could index lan and lon as two different subfield - without declaring them in the mapping as i shown - as described in the documentation - look here