Mongodb can I structure this data - python

Basically I am designing and developing an application in Python that each night executes and takes a website and a list of keywords and queries the Google API to obtain their position given a specific keyword.
I want to use a none sql approach and using objects that Mongodb offers this seems like the best approach however I'm confused about how to structure the data inside the database.
Each night new data will be generated this will contain 50 keywords and their positions this I presume will be stored inside its own object and will be able to be identified by a specific url.
So therefore will it be possible to query the database given a url and use a data range of say the past 30 days or 60 days? I'm confused if I will be able to fetch all of objects back

The main requirement for that structure will be ability to query on daily basis.
so let say we have a website www.stackoverflow.com and our X keywords.
The basic document shape could look like that:
{
_id : objectId, // this have timestamp
www : "www.stackoverflow.com",
rankings : [{
"key1" : "val1"
}, {
"key2" : "val2"
}
],
}
then, if we want to see a ranking history per key1, we can use aggregation framework to query:
db.ranking.aggregate(
[{
$unwind : "$rankings"
}, {
$match : {
"rankings.key1" : { $exists : true}
}
}
])
and response will be similar to:
{
"_id" : ObjectId("584dbe04f4ce077869fee3dc"),
"www" : "www.stackoverflow.com",
"rankings" : {
"key1" : "val1"
}
},
{
"_id" : ObjectId("584dbe07f4ce077869fee3dd"),
"www" : "www.stackoverflow.com",
"rankings" : {
"key1" : "val1"
}
}
seek more about grouping in aggregation framework to uncover power of mongo!

Related

How to ensure all data is captured in ES API?

I am trying to create an API in Python to pull the data from ES and feed it in a data warehouse. The data is live and being filled every second so I am going to create a near-real-time pipeline.
The current URL format is {{url}}/{{index}}/_search and the test payload I am sending is:
{
"from" : 0,
"size" : 5
}
On the next refresh it will pull using payload:
{
"from" : 6,
"size" : 5
}
And so on until it reaches the total amount of records. The PROD environment has about 250M rows and I'll set the size to 10 K per extract.
I am worried though as I don't know if the records are being reordered within ES. Currently, there is a plugin which uses a timestamp generated by the user but that is flawed as sometimes documents are being skipped due to a delay in the jsons being made available for extract in ES and possibly the way the time is being generated.
Does anyone know what is the default sorting when pulling the data using /_search?
I suppose what you're looking for is a streaming / changes API which is nicely described by #Val here and also an open feature request.
I the meantime, you cannot really count on the size and from parameters -- you could probably make redundant queries and handle the duplicates before they reach your data warehouse.
Another option would be to skip ES in this regard and stream directly to the warehouse? What I mean is, take an ES snapshot up until a given time once (so you keep the historical data), feed it to the warehouse, and then stream directly from where ever you're getting your data to the warehouse.
Addendum
AFAIK the default sorting is by the insertion date. But there's no internal _insertTime or similar.. You can use cursors -- it's called scrolling and here's a py implementation. But this goes from the 'latest' doc to the 'first', not vice versa. So it'll give you all the existing docs but I'm not so sure about the newly added ones while you were scrolling. You'd then wanna run the scroll again which is suboptimal.
You could also pre-sort your index which should work quite nicely for your use case when combined w/ scrolling.
Thanks for the responses. After consideration with my colleagues, we decided to implement and use the _ingest API instead to create a pipeline in ES which inserts the server document ingestion date on each doc.
Steps:
Create the timestamp pipeline
PUT _ingest/pipeline/timestamp_pipeline
{
"description" : "Inserts timestamp field for all documents",
"processors" : [
{
"set" : {
"field": "insert_date",
"value": "{{_ingest.timestamp}}"
}
}
]
}
Update indexes to add the new default field
PUT /*/_settings
{
"index" : {
"default_pipeline": "timestamp_pipeline"
}
}
In Python then I use the _scroll API like so:
es = Elasticsearch(cfg.esUrl, port = cfg.esPort, timeout = 200)
doc = {
"query": {
"range": {
"insert_date": {
"gte": lastRowDateOffset
}
}
}
}
res = es.search(
index = Index,
sort = "insert_date:asc",
scroll = "2m",
size = NumberOfResultsPerPage,
body = doc
)
Where lastRowDateOffset is the date of the last run

MongoDB PyMongo Listing all keys in a document

I have a question about how to manipulate a document in PyMongo to have it list all of its current keys, and I'm not quite sure how to do it. For example, if I had a document that looked like this:
{
"_id" : ObjectID("...")
"name": ABCD,
"info": {
"description" : "XYZ",
"type" : "QPR"
}
}
and I had a variable "document" that had this current document as its value, how could I write code to print the three keys:
"_id"
"name"
"info"
I don't want it to list the values, simply the names. The motivation for this is that the user would type one of the names and my program would do additional things after that.
As mentioned in the documentation:
In PyMongo we use dictionaries to represent documents.
So you can get all keys using .keys():
print(document.keys())
Using Python we can do the following which is to fetch all the documents in a variable as mydoc
mydoc = collections.find()
for x in mydoc:
l=list(x.keys())
print(l)
Using this we can get all the keys as a list and then we can use them for further user's need
the document is a python dictionary so you can just print its keys
e.g.
document = db.collection_name.find_one()
for k in document:
print(k)

MongoDb update by replacing the document while using update operation ($currentDate)[pymongo]

I'm using pymongo in python
I have a mongodb document like this
{u'_id': ObjectId('55110d55a5bd910f2513fc91'), u'ghi': u'jkl'}
I want to update the document by replacing
db['table_name'].update({'ghi':'jkl'},{'ghio':'jkl'}, True)
The problem is that I wanted to use $currentDate along with the update query as I'm required to add update time with the document. How do I do that?
This is what I've tried out so far
db['table_name'].update({'ghi':'jkl'},{'$set':{'ghik':'jkl'}, '$currentDate':{'date':True}}, True)
The issue with the above code is that I do not want to use $set as it will retain the other fields which I do not require.
db['table_name'].update({'ghi':'jkl'},{'$set':{'ghik':'jkl'}, '$unset':{'ghi':True}, '$currentDate':{'date':True}}, True)
The above code works, but I would like to know if there is a better way to do it.
currentDate only works with update operators like $set and not with a full document update. You can use the $unset update as you pointed out, although this only wipes out fields you specifically name, you can set the timestamp clientside
db.test.update({ "ghi" : "jkl" }, { "ghio" : "jkl", "date" : datetime.today() })
or you can do two updates
db.test.update({ "ghi" : "jkl" }, { "ghio" : "jkl" })
db.test.update({ "ghio" : "jkl" }, { "$currentDate" : { "date" : true } })

How to add data to a (heavily) nested collection in MongoDB

I am a noob at Python and MongoDB and would really appreciate your help with my problem. My collection in MongoDB looks like this:
{
"Segments" : [
{
Devices : [
"IP" : "",
"Interfaces" :
[
{
"Name" :""
}
],
],
"DeviceName" : "",
"SegmentName" : ""
}
]
}
I have an object like so:
Node Details: {'node:98a': ['Sweden', 'Stockholm', '98a-3470'], 'node:98b': ['Denmark', 'Copenhagen', '98b-3471', '98b-3472']}
I need to update the 'Name' within 'Interfaces' part in the collection above, with values from the Node Details dictionary. I have tried using $set, $addToSet, $push etc., but nothing is helping. I have already added the Segment and DeviceName information.
The output should be as follows:
{
"Segments" : [
{
Devices : [
{
"Interfaces" :
[
{
"Name" :"98a-3470"
}
],
"DeviceName" : "node:98a",
}
{
"Interfaces" :
[
{
"Name" :"98b-3471"
},
{
"Name" :"98b-3472"
}
],
"DeviceName" : "node:98b",
}
],
"SegmentName" : "segmentA"
}
]
}
Any help would be greatly appreciated. I have tried a lot in the MongoDB shell and also on Google, but to no avail. Thank you all.
Regards,
trupsster
[[ EDITED ]]
Okay, here is what I have got so far after continuing to poke around after posing the question: I used the following query in MongoDB shell:
db.test.mycoll.update({'Segments.SegmentName':'segmentA','Segments.Devices.Name':'node:98a'}, {$set: {"Segments.$.Devices.0.Interfaces.Name" : "98b-3470"}})
Now this inserted in the correct place as per my 'schema', but when I try to add the second interface, it simply replaces the earlier one. I tried using $push (complained about it not being an array), and $addToSet (showed another error), but none helped. Can you please help me from this point on?
Thanks,
trupsster
[[ Edited again ]]
I found the solution! Here is what I did:
To add an interface to an existing device:
db.test.mycoll.update({'Segments.SegmentName':'segmentA','Segments.Devices.Name':'node:98a'}, {$addToSet: {"Segments.$.Devices.0.Interfaces.Name" : "98a-3471"}})
Now, to append to the dict with a new 'Name' within the array 'Interfaces':
db.test.mycoll.update({'Segments.SegmentName':'segmentA','Segments.Devices.Name':'node:98a'}, {$addToSet: {"Segments.$.Devices.0.Interfaces" : {"Name" : "98a-3472"}}})
As you can see, I used $addToSet.
Now, next step was to add the same information (with different values) to 2nd device, which was done like so:
db.test.mycoll.update({'Segments.SegmentName':'segmentA','Segments.Devices.Name':'node:98b'}, {$addToSet: {"Segments.$.Devices.1.Interfaces" : {"Name" : "98b-3473"}}})
So that was it! I am so chuffed with myself! Thank you all who took time to read my problem. I hope my solution will help someone.
Regards,
trupsster
You did not say what you actually tried. To access a sub-document inside an array, you need to use dot notation with numeric indices. So to address the Name field in your example:
Segments.Devices.0.Interfaces.0.Name
Did you try that? Does it work?

pymongo+update throw $pull

I have a mongo document:
{ "_id" : 0, "name" : "Vasya", "fav" : [ { "type" : "t1", "weight" : 1.4163 }, { "type" : "t2", "weight" : 11.7772 }, { "type" : "t2", "weight" : 6.4615 }, { "type" : "homework", "score" : 35.8742 } ] }
For delete lowest element in array "fav", I use the following Python code:
db.people.update({"fav":{"type":"t2", "weight":lowest}}, {"$pull":{"fav"{"type":"t2", "weight":lowest}}})
where variable lowest is the lowest value between 6.4615 and 35.8742.
The problem is that this code does nothing. There are no errors, and the values are not deleted from the array. But if I write in the mongo shell the same code, the result is positive.
Unfortunately my experience in pymongo and in mongo is not so good. So if someone knows what the problem is, that would be great.
The syntax works fine for me in Mongo shell and with pymongo, so as suspected the issue is the precision of floating point numbers.
I don't know how you are deriving/computing lowest but you may want to consider standardizing on maximum number of significant digits after the decimal point, or maybe even have a function that normalizes your floats to the same precision, both when you are originally saving documents and when you are later querying or updating them.
Neither Mongo nor Python consider 6.676176060654615 to be equal to 6.67617606065 which explains why your update is having no effect.

Categories

Resources