Parsing key:value pairs in a list - python

I have inherited a Mongo structure with key:value pairs within an array. I need to extract the collected and spent values in the below tags, however I don't see an easy way to do this using the $regex commands in the Mongo Query documentation.
{
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
The ideal output of extracting these values is to dump them into a format below when querying them using pymongo. I really don't know how best to return only the values I need. Please advise.
94204a81-9540-4ba8-bb93-fc5475c278dc, 172, 150

print d['_id'], ' '.join([ x.replace('collected:', '').replace('spent:', '')\
for x in d['tags'] if 'collected' in x or 'spent' in x ] )
>>>
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150

In case you are having a hard time writing mongo query(your elements inside the list are actually string instead of key value which requires parsing), here is a solution in plain Python that might be helpful.
>>> import pymongo
>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client['test']
>>> collection = db['stackoverflow']
>>> collection.find_one()
{u'_id': u'94204a81-9540-4ba8-bb93-fc5475c278dc', u'tags': [u'collected:172', u'donuts_used:1', u'spent:150']}
>>> record = collection.find_one()
>>> print record['_id'], record['tags'][0].split(':')[-1], record['tags'][2].split(':')[-1]
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
Instead of using find_one(), you can retrieve all the record by using appropriate function here and looop through every record. I am not sure how consistent your data might be, so I hard coded using the first and third element in the list... you can want to tweak that part and have a try except at record level.

Here is one way to do it, if all you had was that sample JSON object.
Please pay attention to the note about the ordering of tags etc. It is probably best to revise your "schema" so that you can more easily query, collect and aggregate your "tags" as you call them.
import re
# Returns csv string of _id, collected, used
def parse(obj):
_id = obj["_id"]
# This is terribly brittle since the insertion of any other type of tag
# between 'c' and 's' will cause these indices to be messed up.
# It is probably much better to directly query these, or store them as individual
# entities in your mongo "schema".
collected = re.sub(r"collected:(\d+)", r"\1", obj["tags"][0])
spent = re.sub(r"spent:(\d+)", r"\1", obj["tags"][2])
return ", ".join([_id, collected, spent])
# Some sample object
parse_me = {
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
print parse(parse_me)

Related

How do you iterate over a set or a list in Flask and PyMongo?

I have produced a set of matching IDs from a database collection that looks like this:
{ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
Each ObjectId represents an ID on a collection in the DB.
I got that list by doing this: (which incidentally I also think I am doing wrong, but I don't yet know another way)
# Find all question IDs
question_list = list(mongo.db.questions.find())
all_questions = []
for x in question_list:
all_questions.append(x["_id"])
# Find all con IDs that match the question IDs
con_id = list(mongo.db.cons.find())
con_id_match = []
for y in con_id:
con_id_match.append(y["question_id"])
matches = set(con_id_match).intersection(all_questions)
print("matches", matches)
print("all_questions", all_questions)
print("con_id_match", con_id_match)
And that brings up all the IDs that are associated with a match such as the three at the top of this post. I will show what each print prints at the bottom of this post.
Now I want to get each ObjectId separately as a variable so I can search for these in the collection.
mongo.db.cons.find_one({"con": matches})
Where matches (will probably need to be a new variable) will be one of each ObjectId's that match the DB reference.
So, how do I separate the ObjectId in the matches so I get one at a time being iterated. I tried a for loop but it threw an error and I guess I am writing it wrong for a set. Thanks for the help.
Print Statements:
**matches** {ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
**all_questions** [ObjectId('5feafb52ae1b389f59423a91'), ObjectId('5feafb64ae1b389f59423a92'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feb247f1bb7a1297060342e'), ObjectId('6009b6e42b74a187c02ba9d7'), ObjectId('6010822e08050e32c64f2975'), ObjectId('601d125b3c4d9705f3a9720d')]
**con_id_match** [ObjectId('5feb247f1bb7a1297060342e'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8')]
Usually you can just use find method that yields documents one-by-one. And you can filter documents during iterating with python like that:
# fetch only ids
question_ids = {question['_id'] for question in mongo.db.questions.find({}, {'_id': 1})}
matches = []
for con in mongo.db.cons.find():
con_id = con['question_id']
if con_id in question_ids:
matches.append(con_id)
# you can process matched and loaded con here
print(matches)
If you have huge amount of data you can take a look to aggregation framework

Inserting documents in MongoDB in specific order using pymongo

I have to insert documents in MongoDB in a left-shift manner i.e if the collection contains 60 documents, I am removing the 1st document and I want to insert the new document at the rear of the database. But when I am inserting the 61st element and so forth, the documents are being inserted in random positions.
Is there any way I can insert the documents in the order that I specified above?
Or do I have to do this processing when I am retrieving the values from the database? If yes then how?
The data format is :
data = {"time":"10:14:23", #timestamp
"stats":[<list of dictionaries>]
}
The code I am using is
from pymongo import MongoClient
db = MongoClient().test
db.timestamp.delete_one({"_id":db.timestamp.find()[0]["_id"]})
db.timestamp.insert_one(new_data)
the timestamp is the name of the collection.
Edit: Changed the code. Is there any better way?
from pymongo.operations import InsertOne,DeleteOne
def save(collection,data,cap=60):
if collection.count() == cap:
top_doc_time= min(doc['time'] for doc in collection.find())
collection.delete_one({'time':top_doc_time['_time']})
collection.insert_one(data)
A bulk write operation guarantees query ordering by default.
This means that the queries are executed sequentially.
from pymongo.operations import DeleteOne, InsertOne
def left_shift_insert(collection, doc, cap=60):
ops = []
variance = max((collection.count() + 1) - cap, 0)
delete_ops = [DeleteOne({})] * variance
ops.extend(delete_ops)
ops.append(InsertOne(doc))
return collection.bulk_write(ops)
left_shift_insert(db.timestamp, new_data)

How do I get a list of just the ObjectId's using pymongo?

I have the following code:
client = MongoClient()
data_base = client.hkpr_restore
agents_collection = data_base.agents
agent_ids = agents_collection.find({},{"_id":1})
This gives me a result of:
{u'_id': ObjectId('553020a8bf2e4e7a438b46d9')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46da')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46db')}
How do I just get at the ObjectId's so I can then use each ID to search another collection?
Use distinct
In [27]: agent_ids = agents_collection.distinct('_id')
In [28]: agent_ids
Out[28]:
[ObjectId('553662940acf450bef638e6d'),
ObjectId('553662940acf450bef638e6e'),
ObjectId('553662940acf450bef638e6f')]
In [29]: agent_id2 = [str(id) for id in agents_collection.distinct('_id')]
In [30]: agent_id2
Out[30]:
['553662940acf450bef638e6d',
'553662940acf450bef638e6e',
'553662940acf450bef638e6f']
I solved the problem by following this answer.
Adding hint to the find syntax then simply iterate through the cursor returned.
db.c.find({},{_id:1}).hint(_id:1);
I am guessing without the hint the cursor would get the whole documentation back when iterated, causing the iteration to be extremely slow.
With hint, the cursor would only return ObjectId back and the iteration would finish very quickly.
The background is I am working on an ETL job that require sync one mongo collection to another while modify the data by some criteria. The total number of Object id is around
100000000.
I tried using distinct but got the following error:
Error in : distinct too big, 16mb cap
I tried using aggregation and did $group as answered from other similar question. Only to hit some memory consumption error.
Try creating a list comprehension with just the _ids as follows:
>>> client = MongoClient()
>>> data_base = client.hkpr_restore
>>> agents_collection = data_base.agents
>>> result = agents_collection.find({},{"_id":1})
>>> agent_ids = [x["_id"] for x in result]
>>>
>>> print agent_ids
[ ObjectId('553020a8bf2e4e7a438b46d9'), ObjectId('553020a8bf2e4e7a438b46da'), ObjectId('553020a8bf2e4e7a438b46db')]
>>>
I would like to add something which is more general than querying for all _id.
import bson
[...]
results = agents_collection.find({}})
objects = [v for result in results for k,v in result.items()
if isinstance(v,bson.objectid.ObjectId)]
Context: saving objects in gridfs creates ObjectIds, to retrieve all of them for further querying, this function helped me out.
Although I wasn't searching for the _id, I was extracting another field. I found this method was fast (assuming you have an index on the field):
list_of_strings = {x.get("MY_FIELD") for x in db.col.find({},{"_id": 0, "MY_FIELD": 1}).hint("MY_FIELDIdx")}
Where MY_FIELDIdx is the name of the index for the field I'm trying to extract.

How to use ResultSet in PyES

I'm using PyES to use ElasticSearch in Python.
Typically, I build my queries in the following format:
# Create connection to server.
conn = ES('127.0.0.1:9200')
# Create a filter to select documents with 'stuff' in the title.
myFilter = TermFilter("title", "stuff")
# Create query.
q = FilteredQuery(MatchAllQuery(), myFilter).search()
# Execute the query.
results = conn.search(query=q, indices=['my-index'])
print type(results)
# > <class 'pyes.es.ResultSet'>
And this works perfectly. My problem begins when the query returns a large list of documents.
Converting the results to a list of dictionaries is computationally demanding, so I'm trying to return the query results already in a dictionary. I came across with this documentation:
http://pyes.readthedocs.org/en/latest/faq.html#id3
http://pyes.readthedocs.org/en/latest/references/pyes.es.html#pyes.es.ResultSet
https://github.com/aparo/pyes/blob/master/pyes/es.py (line 1304)
But I can't figure out what exactly I'm supposed to do.
Based on the previous links, I've tried this:
from pyes import *
from pyes.query import *
from pyes.es import ResultSet
from pyes.connection import connect
# Create connection to server.
c = connect(servers=['127.0.0.1:9200'])
# Create a filter to select documents with 'stuff' in the title.
myFilter = TermFilter("title", "stuff")
# Create query / Search object.
q = FilteredQuery(MatchAllQuery(), myFilter).search()
# (How to) create the model ?
mymodel = lambda x, y: y
# Execute the query.
# class pyes.es.ResultSet(connection, search, indices=None, doc_types=None,
# query_params=None, auto_fix_keys=False, auto_clean_highlight=False, model=None)
resSet = ResultSet(connection=c, search=q, indices=['my-index'], model=mymodel)
# > resSet = ResultSet(connection=c, search=q, indices=['my-index'], model=mymodel)
# > TypeError: __init__() got an unexpected keyword argument 'search'
Anyone was able to get a dict from the ResultSet?
Any good sugestion to efficiently convert the ResultSet to a (list of) dictionary will be appreciated too.
I tried too many ways directly to cast ResultSet into dict but got nothing. The best way I recently use is appending ResultSet items into another list or dict. ResultSet covers every single item in itself as a dict.
Here is how I use:
#create a response dictionary
response = {"status_code": 200, "message": "Successful", "content": []}
#set restul set to content of response
response["content"] = [result for result in resultset]
#return a json object
return json.dumps(response)
Its not that complicated: just iterate over the result set. For example with a for loop:
for item in results:
print item

MongoEngine query list for objects having properties starting with prefixes specified in a list

I need to query Mongo database for elements that have a certain property beginning with any prefix in the list. Now I have a piece of code like this:
query = mymodel(terms__term__in=query_terms)
and this matches objects that have an item on a list "terms" that has StringField "term" explicitly occurring on a list "query_terms". What I want to achieve is having objects that have an item on a list "terms" that has StringField "term" beginning with any prefix that occurs on a list "query_terms". Is it possible to do it in one query and without storing every possible prefix of "term" in database?
EDIT:
The solution below works great but now I have to find objects with terms starting with every prefix on a list. I changed
query = reduce(lambda q1, q2: q1.__or__(q2),
map(lambda prefix: Q(terms__term__startswith=prefix)))
to
query = reduce(lambda q1, q2: q1.__and__(q2),
map(lambda prefix: Q(terms__term__startswith=prefix)))
but this does not work. I end up getting the following error:
InvalidQueryError: Duplicate query conditions: terms__term__startswith
Any ideas?
If your querying a term for it's value, you can filter the values that begin with a perfix like so:
MyModel.objects.filter(terms__term__startswith='foo')
If you need to filter for multiple prefixes you'll have to create Q objects for that:
MyModel.objects.filter(Q(terms__term__startswith='foo') | Q(terms__term__startswith='bar'))
If you need to create the query dynamically:
prefixes = ['foo', 'bar', 'baz']
query = reduce(lambda q1, q2: q1.__or__(q2),
map(lambda prefix: Q(terms__term__startswith=prefix), prefixes))
MyModel.objects.filter(query)
You can use a regular expression like ^(prefix1 | prefix2 etc):
prefixes = [....]
regex = '^(%s)' % '|'.join(map(re.escape, prefixes))
docs = your_collection.find({'term': {'$regex': regex}})
upd: didn't notice this question is about mongoengine. The above is for pure pymongo, don't know if ME allows this.

Categories

Resources