I'm trying to write a generic method for querying one mongodb collection (A) about ids from a separate mongodb collection (B). Here's what I have so far:
def getOtherCollInfo(self, otherCollObj, queryField, outputField="_id"):
selfIdList = self.getIds() # gets a set of ids from whole collection (B)
returned_dict = {}
for selfId in selfIdList:
curs_obj = otherCollObj.find({queryField : str(selfId)}).distinct(outputField) #otherCollObj is the 'collection (A)
temp_list=[]
for obj in curs_obj:
temp_list.append(obj)
returned_dict[selfId]=temp_list
return returned_dict
This works fine for collection A where the query_field looks like:
542de00c763f4a7f558be12f
When attempting this method on a third collection (C), it fails (I think) because the query_field is just the hex sting:
ObjectId('542de00c763f4a7f558be12f')
Is there a way to test the formatting of the id so I can make the method more generic?
Related
I am reading data from nested json with this code:
data = json.loads(json_file.json)
for nodesUni in data["data"]["queryUnits"]['nodes']:
try:
tm = (nodesUni['sql']['busData'][0]['engine']['engType'])
except:
tm = ''
try:
to = (nodesUni['sql']['carData'][0]['engineData']['producer']['engName'])
except:
to = ''
json_output_for_one_GU_owner = {
"EngineType": tm,
"EngineName": to,
}
I am having an issue with None type error (eg. this one doesn't exists at all nodesUni['sql']['busData'][0]['engine']['engType'] cause there are no data, so I am using try/except. But my code is more complex and having a try/except for every value is crazy. Is there any other option how to deal with this?
Error: "TypeError: 'NoneType' object is not subscriptable"
This is non-trivial as your requirement is to traverse the dictionaries without errors, and get an empty string value in the end, all that in a very simple expression like cascading the [] operators.
First method
My approach is to add a hook when loading the json file, so it creates default dictionaries in an infinite way
import collections,json
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads('{"foo":"bar"}',object_hook=hook)
print(data["x"][0]["zzz"]) # doesn't exist
print(data["foo"]) # exists
prints:
defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {})
bar
when accessing some combination of keys that don't exist (at any level), superdefaultdict recursively creates a defaultdict of itself (this is a nice pattern, you can read more about it in Is there a standard class for an infinitely nested defaultdict?), allowing any number of non-existing key levels.
Now the only drawback is that it returns a defaultdict(<function superdefaultdict at 0x000001ECEFA47160>, {}) which is ugly. So
print(data["x"][0]["zzz"] or "")
prints empty string if the dictionary is empty. That should suffice for your purpose.
Use like that in your context:
def superdefaultdict():
return collections.defaultdict(superdefaultdict)
def hook(s):
c = superdefaultdict()
c.update(s)
return(c)
data = json.loads(json_file.json,object_hook=hook)
for nodesUni in data["data"]["queryUnits"]['nodes']:
tm = nodesUni['sql']['busData'][0]['engine']['engType'] or ""
to = nodesUni['sql']['carData'][0]['engineData']['producer']['engName'] or ""
Drawbacks:
It creates a lot of empty dictionaries in your data object. Shouldn't be a problem (except if you're very low in memory) as the object isn't dumped to a file afterwards (where the non-existent values would appear)
If a value already exists, trying to access it as a dictionary crashes the program
Also if some value is 0 or an empty list, the or operator will pick "". This can be workarounded with another wrapper that tests if the object is an empty superdefaultdict instead. Less elegant but doable.
Second method
Convert the access of your successive dictionaries as a string (for instance just double quote your expression like "['sql']['busData'][0]['engine']['engType']", parse it, and loop on the keys to get the data. If there's an exception, stop and return an empty string.
import json,re,operator
def get(key,data):
key_parts = [x.strip("'") if x.startswith("'") else int(x) for x in re.findall(r"\[([^\]]*)\]",key)]
try:
for k in key_parts:
data = data[k]
return data
except (KeyError,IndexError,TypeError):
return ""
testing with some simple data:
data = json.loads('{"foo":"bar","hello":{"a":12}}')
print(get("['sql']['busData'][0]['engine']['engType']",data))
print(get("['hello']['a']",data))
print(get("['hello']['a']['e']",data))
we get, empty string (some keys are missing), 12 (the path is valid), empty string (we tried to traverse a non-dict existing value).
The syntax could be simplified (ex: "sql"."busData".O."engine"."engType") but would still have to retain a way to differentiate keys (strings) from indices (integers)
The second approach is probably the most flexible one.
I'm trying to create generator object for the list of records with the data from mysql database, so I'm passing the mysql cursor object to the function as parameter.
My issue here is if the "if block" containing yield records is commented then cust_records function works perfectly fine but if I uncomment the line then the function is not working.
Not sure if this is not the way to yield the list object in Python 3
My code so far:
def cust_records(result_set) :
block_id = None
records = []
i = 0
for row in result_set :
records.append((row['id'], row, smaller_ids))
if records :
yield records
The point of generators is lazy evaluation, so storing all records in a list and yielding the list makes no sense at all. If you want to retain lazy evalution (which is IMHO preferable, specially if you have to work on arbitrary datasets that might get huge), you want to yield each record, ie:
def cust_records(result_set) :
for row in result_set :
yield (row['id'], row, smaller_ids)
# then
def example():
cursor.execute(<your_sql_query_here>)
for record in cust_records(cursor):
print(record)
else (if you really want to consume as much memory as possible) just male cust_record a plain function:
def cust_records(result_set) :
records = []
for row in result_set :
records.append((row['id'], row, smaller_ids))
return records
I am trying to print the results of my query to the console, with the below code, but it keeps returning me the object location instead.
test = connection.execute('SELECT EXISTS(SELECT 1 FROM "my_table" WHERE Code = 08001)')
print(test)
Result that I get is - <sqlalchemy.engine.result.ResultProxy object at 0x000002108508B278>
How do I print out the value instead of the object?
Reason I am asking is because I want to use the value to do comparison tests.
EG:
if each_item not in test:
# do stuffs
Like this, test contains all the rows returned by your query.
If you want something you can iterate over, you can use fetchall for example. Like this:
test = connection.execute('SELECT EXISTS(SELECT 1 FROM "my_table" WHERE Code = 08001)').fetchall()
Then you can iterate over a list of rows. Each row will contain all the fields. In your example you can access fields by their position. You only have one at position one. So that's how you can access 1:
for row in test:
print(row[0])
test is an object containing the rows values. So if the column's name is value, you can call it using test.value.
If you're looking for more "convenient" way of doing so (like iterating through each column of test), you'd have to explicitly define these functions (either as methods of test or as other functions designed to iterate through those types of rows).
I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]
You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)
Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.
I have inherited a Mongo structure with key:value pairs within an array. I need to extract the collected and spent values in the below tags, however I don't see an easy way to do this using the $regex commands in the Mongo Query documentation.
{
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
The ideal output of extracting these values is to dump them into a format below when querying them using pymongo. I really don't know how best to return only the values I need. Please advise.
94204a81-9540-4ba8-bb93-fc5475c278dc, 172, 150
print d['_id'], ' '.join([ x.replace('collected:', '').replace('spent:', '')\
for x in d['tags'] if 'collected' in x or 'spent' in x ] )
>>>
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
In case you are having a hard time writing mongo query(your elements inside the list are actually string instead of key value which requires parsing), here is a solution in plain Python that might be helpful.
>>> import pymongo
>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client['test']
>>> collection = db['stackoverflow']
>>> collection.find_one()
{u'_id': u'94204a81-9540-4ba8-bb93-fc5475c278dc', u'tags': [u'collected:172', u'donuts_used:1', u'spent:150']}
>>> record = collection.find_one()
>>> print record['_id'], record['tags'][0].split(':')[-1], record['tags'][2].split(':')[-1]
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
Instead of using find_one(), you can retrieve all the record by using appropriate function here and looop through every record. I am not sure how consistent your data might be, so I hard coded using the first and third element in the list... you can want to tweak that part and have a try except at record level.
Here is one way to do it, if all you had was that sample JSON object.
Please pay attention to the note about the ordering of tags etc. It is probably best to revise your "schema" so that you can more easily query, collect and aggregate your "tags" as you call them.
import re
# Returns csv string of _id, collected, used
def parse(obj):
_id = obj["_id"]
# This is terribly brittle since the insertion of any other type of tag
# between 'c' and 's' will cause these indices to be messed up.
# It is probably much better to directly query these, or store them as individual
# entities in your mongo "schema".
collected = re.sub(r"collected:(\d+)", r"\1", obj["tags"][0])
spent = re.sub(r"spent:(\d+)", r"\1", obj["tags"][2])
return ", ".join([_id, collected, spent])
# Some sample object
parse_me = {
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
print parse(parse_me)