Selective indexing in bulbflow without using Models

Selective indexing in bulbflow without using Models - python

I'm using bulbflow (python) with Neo4j and I'm trying to add an index only on a subset of my keys (for now, simply keys named 'name' for optional index-based lookup).
I don't love the bulbflow Models (too restrictive) and I couldn't figure out how to do selective indexing without changing code since the 'autoindex' is a global setting -- I don't see how to configure it based on the key.
Has anyone done something like this?
-Andrew

You can disable Bulbs auto-indexing by setting g.config.autoindex to False.
See https://github.com/espeed/bulbs/blob/master/bulbs/config.py#L62
>>> from bulbs.neo4jserver import Graph
>>> g = Graph()
>>> g.config.autoindex = False
>>> g.vertices.create(name="James")
In the example above, this will cause the name property not to be indexed automatically.
Setting autoindex to False will switch to using the low-level client's create_vertex() method instead of the create_indexed_vertex() method:
See https://github.com/espeed/bulbs/blob/master/bulbs/neo4jserver/client.py#L422
The create_indexed_vertex() method has a keys arg, which you can use for selective indexing:
See https://github.com/espeed/bulbs/blob/master/bulbs/neo4jserver/client.py#L424
This is the low-level client method used by Bulbs models. You generally don't need to explicitly call the low-level client methods, but if you do, you can selectively index properties by including the property name in the keys arg.
To selectively index properties in a Model, simply override get_index_keys() in your Model definition:
See https://github.com/espeed/bulbs/blob/master/bulbs/model.py#L383
By default, Bulbs models index all properties. If no keys are provided, then all properties are indexed (like in TinkerPop/Blueprints).
See the Model _create() and get_bundle() methods:
_create() https://github.com/espeed/bulbs/blob/master/bulbs/model.py#L583
get_bundle() https://github.com/espeed/bulbs/blob/master/bulbs/model.py#L363
get_index_keys() https://github.com/espeed/bulbs/blob/master/bulbs/model.py#L383
To enable selective indexing for generic vertices and edges, I updated the Bulbs generic vertex/edge methods to include a _keys arg where you can supply a list of property names (keys) to index.
See https://github.com/espeed/bulbs/commit/4fe39d5a76675020286ec9aeaa8e71d58e3a432a
Now, to selectively index properties on generic vertices/edges, you can supply a list of property names to index:
>>> from bulbs.neo4jserver import Graph
>>> g = Graph()
>>> g.config.autoindex = False
>>> james = g.vertices.create(name="James", city="Dallas", _keys=["name"])
>>> julie = g.vertices.create(name="Julie", city="Dallas", _keys=["name"])
>>> g.edges.create(james, "knows", julie, timestamp=12345, someprop="somevalue", _keys=["someprop"])
In the example above, the name property will be indexed for each vertex, and someprop will be indexed for the edge. Note that city and timestamp will not be indexed because those property names were not explicitly included in the list of index keys.
If g.config.autoindex is True and _keys is None (the default), all properties will be indexed (just like before).
If g.config.autoindex is False and _keys is None, no properties will be indexed.
If _keys is explicitly set to a list of property names, only those properties will be indexed, regardless if g.config.autoindex is True or False.
See https://github.com/espeed/bulbs/blob/master/bulbs/neo4jserver/client.py#L422
NOTE: How auto-indexing works differs somewhat if you're using Neo4j Server, Rexster, or Titan Server, and the indexing architecture for all the graph-database servers has been in a state of flux for the past few months. It appears that all are moving from a manual-indexing system to auto-indexing.
For graph-database servers that did not have auto-indexing capability until recently (e.g. Neo4j Server), Bulbs enabled auto-indexing via custom Gremlin scripts that used the database's low-level manual indexing methods:
https://github.com/espeed/bulbs/blob/master/bulbs/neo4jserver/client.py#L1008
https://github.com/espeed/bulbs/blob/master/bulbs/neo4jserver/gremlin.groovy#L11
However, manual indexing has been deprecated among Neo4j Server, TinkerPop/Rexster, and Titan Server so Bulbs 0.4 indexing architecture will change accordingly. Selective indexing will still be possible by declaring your index keys upfront, like you would in an SQL create table statement.
BTW: What about did you find restrictive about Models? Bulbs Models (actually the entire library) is designed to be flexible so you can modify it to whatever you need.
See the Lightbulb example for how to customize Bulbs Models: Is there a equivalent to commit in bulbs framework for neo4j
Let me know if you have any questions.

Related

how to pass 'a{sv}' dbus signature to udisks2.Mount() from python?

dbus api uses a special format to describe complex parameters.
Since dbus specification wasn't written with Python in mind, it's a far fetch to find out what parameter structure you exactly have to pass.
In my example I want to call the Mount() method of the Filesystem object. This method got the signature a{sv}.
Mount() is defined like this
org.freedesktop.UDisks2.Filesystem
...
The Mount() method
Mount (IN a{sv} options,
OUT s mount_path);
source: http://storaged.org/doc/udisks2-api/latest/gdbus-org.freedesktop.UDisks2.Filesystem.html#gdbus-method-org-freedesktop-UDisks2-Filesystem.Mount
The complete code to mount a partition is this:
bus = dbus.SystemBus()
device = "/org/freedesktop/UDisks2/block_devices/sdi1"
obj = bus.get_object('org.freedesktop.UDisks2', device)
obj.Mount(..., dbus_interface="org.freedesktop.UDisks2.Filesystem")
Where ... is the parameters in question.

The answer is separated into different layers:
parameter structure
key names
legal values
The parameter structure for dbus is defined here: https://dbus.freedesktop.org/doc/dbus-specification.html#type-system
We learn from it that a{sv} is an ARRAY that contains one (or multiple?) DICT (list of key-value pairs). The key is STRING, the value is VARIANT which is data of any type preceded by a type code.
Thankfully we don't have to deal with low-level details. Python is going to deal with that.
So the solution simply is:
obj.Mount(dict(key="value", key2="value2"),
dbus_interface="org.freedesktop.UDisks2.Filesystem")
The actual key names are defined in udisks docs
IN a{sv} options: Options - known options (in addition to standard options)
includes fstype (of type 's') and options (of type 's').
OUT s mount_path: The filesystem path where the device was mounted.
from http://storaged.org/doc/udisks2-api/latest/gdbus-org.freedesktop.UDisks2.Filesystem.html#gdbus-method-org-freedesktop-UDisks2-Filesystem.Mount
while standard options refers to
Option name, Value type, Description
auth.no_user_interaction, 'b', If set to TRUE, then no user interaction will happen when checking if the method call is authorized.
from http://storaged.org/doc/udisks2-api/latest/udisks-std-options.html
So, adding the key names we have
obj.Mount(dict(fstype="value", options="value2"),
dbus_interface="org.freedesktop.UDisks2.Filesystem")
Regarding the values I think you have to study the sections Filesystem Independent Mount Options and Filesystem Dependent Mount Options from https://linux.die.net/man/8/mount
So the final solution looks like
obj.Mount(dict(fstype="vfat", options="ro"),
dbus_interface="org.freedesktop.UDisks2.Filesystem")

Why Flask Migrations does not detect a field's length change?

I have the following model, I want to change the length of name, when I do the migration it does not detect the changes
class Client(db.Model):
__tablename__ = "client"
client_id = db.Column(
db.Integer,
primary_key=True,
autoincrement=True
)
name = db.Column(db.String(65))
email = db.Column(db.String(255))
For example change to
name = db.Column(db.String(100))
NFO [alembic.env] No changes in schema detected.
But when I change the name, if it detects the changes
INFO [alembic.autogenerate.compare] Detected added column 'client.name_test'
INFO [alembic.autogenerate.compare] Detected removed column 'client.name'

Update - June 2020
Type comparison changed in Alembic 1.4, so field length changes should be more reliably identified. From the changelog:
A major rework of the “type comparison” logic is in place which
changes the entire approach by which column datatypes are compared.
Types are now compared based on the DDL string generated by the
metadata type vs. the datatype reflected from the database. This means
we compare types based on what would actually render and additionally
if elements of the types change like string length, those changes are
detected as well. False positives like those generated between
SQLAlchemy Boolean and MySQL TINYINT should also be resolved. Thanks
very much to Paul Becotte for lots of hard work and patience on this
one.
The change log also cites this issue and this documentation.
Original Answer
TL;DR:
context.configure(
# ...
compare_type = True
)
I've testing this on a string length change in PG backend and it does work, however as you can see below, the docs currently state that it should not. Here's the relevant section of the docs:
Autogenerate can optionally detect:
Change of column type. This will occur if you set the
EnvironmentContext.configure.compare_type parameter to True, or to a
custom callable function. The default implementation only detects
major type changes, such as between Numeric and String, and does not
detect changes in arguments such as lengths, precisions, or
enumeration members. The type comparison logic is extensible to work
around these limitations, see Comparing Types for details.
And the API Reference for compare_type states:
Indicates type comparison behavior during an autogenerate operation.
Defaults to False which disables type comparison. Set to True to turn
on default type comparison, which has varied accuracy depending on
backend. See Comparing Types for an example as well as information on
other type comparison options.
Lastly, in the section titled, Comparing Types, the following example is given for how to enable the type comparisons:
context.configure(
# ...
compare_type = True
)
You'll find the context.configure() call in the env.py script that is automatically generated by alembic nested inside the connect context:
with connectable.connect() as connection:
context.configure(
connection=connection, target_metadata=target_metadata
)
... and just add the compare_type param in there.
In the same section, they go on to say:
Note The default type comparison logic (which is end-user extensible)
currently works for major changes in type only, such as between
Numeric and String. The logic will not detect changes such as:
changes between types that have the same “type affinity”, such as
between VARCHAR and TEXT, or FLOAT and NUMERIC
changes between the arguments within the type, such as the lengths of
strings, precision values for numerics, the elements inside of an
enumeration.
Detection of these kinds of parameters is a long term project on the SQLAlchemy side.
So it's interesting to see it mentioned a couple of times in the docs that this should not work. As mentioned earlier, I have tested this on postgres and can confirm that setting compare_type=True does generate a revision for the length of the column, so perhaps the docs are lagging a little behind on this, or the maintainers aren't ready to declare it as a feature yet.
I've also tested on MySQL and can confirm that string length changes are also picked up if compare_type=True.

I am assuming you are using the Flask-Migrate package for migrations.
By default, Alembic doesn't recognize the change on the existing column attribute changes.
You should add
compare_type=True when you create the Migrate class
migrate = Migrate(app, db, compare_type=True)
Now again you try to flask db migrate to generate new alembic migration .py file for newly changes on existing column

Short answer for SuperShoot's answer above:
Open migrations/env.py
Find
with connectable.connect() as connection:
context.configure(
...
)
Add compare_type=True, inside.
Run again.

Zope AdvancedQuery ICatalogTool

I have an ICatalogTool, and catalog which I could query using AdvancedQuery and I want to learn how to use this tool, which queries I could use to find something in that Catalog.
I have an example of usage of this tool:
results = ICatalogTool(dc).search(query=Eq('id', self._object.ip))
# Eq - is an "EQUALS" in AdvancedQuery
# dc - instance of DeviceList class
# self._object.ip - some IP for search
I have read a documentation and found that each function like Eq takes some index. So I want to know which other indexes except 'id' are in my catalog. How to look for that? Are there some tools for introspection?

Look in the Zope Management Interface in the Indexes tab. Otherwise, you can list index names programmatically by calling the indexes() method of the catalog object.
IMHO, you should familiarize with the basic query interface (calling searchResults() method using queries specified as mappings) before attempting to use the AdvancedQuery add-on.

Why required and default are mutally exclusive in ndb?

In old google appengine datastore API "required" and "default" could be used together for property definitions. Using ndb I get a
ValueError: repeated, required and default are mutally exclusive.
Sample code:
from google.appengine.ext import ndb
from google.appengine.ext import db
class NdbCounter(ndb.Model):
# raises ValueError
count = ndb.IntegerProperty(required=True, default=1)
class DbCounter(db.Model):
# Doesn't raise ValueError
count = db.IntegerProperty(required=True, default=1)
I want to instantiate a Counter without having to specify a value. I also want to avoid someone to override that value to None. The example above is constructed. I could probably live without a required attribute and instead add an increment() method. Still I don't see the reason why required and default are mutually exclusive.
Is it a bug or a feature?

I think you are right. Perhaps I was confused when I write that part of the code. It makes sense that "required=True" means "do not allow writing the value None" so it should be possible to combine this with a default value. Please file a feature request in the NDB tracker: http://code.google.com/p/appengine-ndb-experiment/issues/list
Note that for repeated properties things are more complicated, to repeated will probably continue to be incompatible with either required or default, even if the above feature is implemented.

Im not sure what was intended, heres is the "explanation" from appengine.ext.ndb.model.py:
The repeated, required and default options are mutually exclusive: a
repeated property cannot be required nor can it specify a default
value (the default is always an empty list and an empty list is always
an allowed value), and a required property cannot have a default.
Beware that ndb has some other really annoying behaviour ( Text>500 Bytes not possible without monkey-patching the expando-model, filtering by .IN( [] ) raises exception, ..).
So unless you need the speed-improvements by its caching you should might consider staying with ext.db atm.

Completing object with its relations and avoiding unnecessary queries in sqlalchemy

I have some database structure; as most of it is irrelevant for us, i'll describe just some relevant pieces. Let's lake Item object as example:
items_table = Table("invtypes", gdata_meta,
Column("typeID", Integer, primary_key = True),
Column("typeName", String, index=True),
Column("marketGroupID", Integer, ForeignKey("invmarketgroups.marketGroupID")),
Column("groupID", Integer, ForeignKey("invgroups.groupID"), index=True))
mapper(Item, items_table,
properties = {"group" : relation(Group, backref = "items"),
"_Item__attributes" : relation(Attribute, collection_class = attribute_mapped_collection('name')),
"effects" : relation(Effect, collection_class = attribute_mapped_collection('name')),
"metaGroup" : relation(MetaType,
primaryjoin = metatypes_table.c.typeID == items_table.c.typeID,
uselist = False),
"ID" : synonym("typeID"),
"name" : synonym("typeName")})
I want to achieve some performance improvements in the sqlalchemy/database layer, and have couple of ideas:
1) Requesting the same item twice:
item = session.query(Item).get(11184)
item = None (reference to item is lost, object is garbage collected)
item = session.query(Item).get(11184)
Each request generates and issues SQL query. To avoid it, i use 2 custom maps for an item object:
itemMapId = {}
itemMapName = {}
#cachedQuery(1, "lookfor")
def getItem(lookfor, eager=None):
if isinstance(lookfor, (int, float)):
id = int(lookfor)
if eager is None and id in itemMapId:
item = itemMapId[id]
else:
item = session.query(Item).options(*processEager(eager)).get(id)
itemMapId[item.ID] = item
itemMapName[item.name] = item
elif isinstance(lookfor, basestring):
if eager is None and lookfor in itemMapName:
item = itemMapName[lookfor]
else:
# Items have unique names, so we can fetch just first result w/o ensuring its uniqueness
item = session.query(Item).options(*processEager(eager)).filter(Item.name == lookfor).first()
itemMapId[item.ID] = item
itemMapName[item.name] = item
return item
I believe sqlalchemy does similar object tracking, at least by primary key (item.ID). If it does, i can wipe both maps (although wiping name map will require minor modifications to application which uses these queries) to not duplicate functionality and use stock methods. Actual question is: if there's such functionality in sqlalchemy, how to access it?
2) Eager loading of relationships often helps to save alot of requests to database. Say, i'll definitely need following set of item=Item() properties:
item.group (Group object, according to groupID of our item)
item.group.items (fetch all items from items list of our group)
item.group.items.metaGroup (metaGroup object/relation for every item in the list)
If i have some item ID and no item is loaded yet, i can request it from the database, eagerly loading everything i need: sqlalchemy will join group, its items and corresponding metaGroups within single query. If i'd access them with default lazy loading, sqlalchemy would need to issue 1 query to grab an item + 1 to get group + 1*#items for all items in the list + 1*#items to get metaGroup of each item, which is wasteful.
2.1) But what if i already have Item object fetched, and some of the properties which i want to load are already loaded? As far as i understand, when i re-fetch some object from the database - its already loaded relations do not become unloaded, am i correct?
2.2) If i have Item object fetched, and want to access its group, i can just getGroup using item.groupID, applying any eager statements i'll need ("items" and "items.metaGroup"). It should properly load group and its requested relations w/o touching item stuff. Will sqlalchemy properly map this fetched group to item.group, so that when i access item.group it won't fetch anything from the underlying database?
2.3) If i have following things fetched from the database: original item, item.group and some portion of the items from the item.group.items list some of which may have metaGroup loaded, what would be best strategy for completing data structure to the same as eager list above: re-fetch group with ("items", "items.metaGroup") eager load, or check each item from items list individually, and if item or its metaGroup is not loaded - load them? It seems to depend on the situation, because if everything has already been loaded some time ago - issuing such heavy query is pointless. Does sqlalchemy provide a way to track if some object relation is loaded, with the ability to look deeper than just one level?
As an illustration to 2.3 - i can fetch group with ID 83, eagerly fetching "items" and "items.metaGroup". Is there a way to determine from an item (which has groupID of an 83), does it have "group", "group.items" and "group.items.metaGroup" loaded or not, using sqlalchemy tools (in this case all of them should be loaded)?

To force loading lazy attributes just access them. This the simplest way and it works fine for relations, but is not as efficient for Columns (you will get separate SQL query for each column in the same table). You can get a list of all unloaded properties (both relations and columns) from sqlalchemy.orm.attributes.instance_state(obj).unloaded.
You don't use deferred columns in your example, but I'll describe them here for completeness. The typical scenario for handling deferred columns is the following:
Decorate selected columns with deferred(). Combine them into one or several groups by using group parameter to deferred().
Use undefer() and undefer_group() options in query when desired.
Accessing deferred column put in group will load all columns in this group.
Unfortunately this doesn't work reverse: you can combine columns into groups without deferring loading of them by default with column_property(Column(…), group=…), but defer() option won't affect them (it works for Columns only, not column properties, at least in 0.6.7).
To force loading deferred column properties session.refresh(obj, attribute_names=…) suggested by Nathan Villaescusa is probably the best solution. The only disadvantage I see is that it expires attributes first so you have to insure there is not loaded attributes among passed as attribute_names argument (e.g. by using intersection with state.unloaded).
Update
1) SQLAlchemy does track loaded objects. That's how ORM works: there must be the only object in the session for each identity. Its internal cache is weak by default (use weak_identity_map=False to change this), so the object is expunged from the cache as soon as there in no reference to it in your code. SQLAlchemy won't do SQL request for query.get(pk) when object is already in the session. But this works for get() method only, so query.filter_by(id=pk).first() will do SQL request and refresh object in the session with loaded data.
2) Eager loading of relations will lead to fewer requests, but it's not always faster. You have to check this for your database and data.
2.1) Refetching data from database won't unload objects bound via relations.
2.2) item.group is loaded using query.get() method, so there won't lead to SQL request if object is already in the session.
2.3) Yes, it depends on situation. For most cases it's the best is to hope SQLAlchemy will use the right strategy :). For already loaded relation you can check if related objects' relations are loaded via state.unloaded and so recursively to any depth. But when relation is not loaded yet you can't get know whether related objects and their relations are already loaded: even when relation is not yet loaded the related object[s] might be already in the session (just imagine you request first item, load its group and then request other item that has the same group). For your particular example I see no problem to just check state.unloaded recursively.

1)
From the Session documentation:
[The Session] is somewhat used as a cache, in that
it implements the identity map
pattern, and stores objects keyed to
their primary key. However, it doesn’t
do any kind of query caching. ... It’s only
when you say query.get({some primary
key}) that the Session doesn’t have to
issue a query.
2.1) You are correct, relationships are not modified when you refresh an object.
2.2) Yes, the group will be in the identity map.
2.3) I believe your best bet will be to attempt to reload the entire group.items in a single query. From my experience it is usually much quicker to issue one large request than several smaller ones. The only time it would make sense to only reload a specific group.item is there was exactly one of them that needed to be loaded. Though in that case you are doing one large query instead of one small one so you don't actually reduce the number of queries.
I have not tried it, but I believe you should be able to use the sqlalchemy.orm.util.identity_key method to determine whether an object is in sqlalchemy's identiy map. I would be interested to find out what calling identiy_key(Group, 83) returns.
Initial Question)
If I understand correctly you have an object that you fetched from the database where some of its relationships were eagerloaded and you would like to fetch the rest of the relationships with a single query? I believe you may be able to use the Session.refresh() method passing in the the names of the relationships that you want to load.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.