Add Label to all nodes in arbitrary cipher query - python

I am writing a web app with a single neo4j database as the datastore. Different users have their own separate charts. I have checked other documentation and found that neo4j doesn't have a concept of schema and instead it is suggested that labels are used against every node to indicate which graph it belongs to.
I am writing a middleware for the web app in Python which deals with user logins and sends queries onwards to the database. I am building a webservice in this middleware which accepts a cipher query string and returns the query result. Before it forwards the query string on to the database it needs to alter it, adding a label depending on the logged in user.
I need to build a function as follows:
def query_for_user(origional_query, users_label):
return XXX
My first idea on how to do this is to use string processing based on information here: https://neo4j.com/docs/getting-started/current/cypher-intro/patterns/
Every node in a query is surrounded by brackets so I could replace every part of the query inside brackets with a new version. The brackets always seem to start with a node identifier so I could read past this and add the required label. Seems like it would work but would be error prone and brittle because I don't know all the features of cpher.
Does anyone know of a more elegant way of achieving my goal? Is there a cypher parser somewhere that I could use instead of string manipulation?

Neo4j labels are not meant to be used to separate nodes into different "graphs".
A label is meant to be used as a node "type". For example, "Person" would be an example of a typical label, but having a huge number of labels like "User049274658188" or "Graph93458428834" would not be good practice.
As appropriate, nodes should have a "type"-like label, and can include a property (say, id) whose value is a unique identifier. Your queries can then start by matching a specific label/id combination in order to indicate which "graph" you want to work on. All other nodes in that "graph" would be linked to that start node via relationship paths.
And typically one would generate Cypher from raw data, not Cypher templates. However, if you really want to use Cypher templating, the templates could just contain Cypher parameters, as in:
MATCH (p:Person)-[:LIVES_AT]->(a:Address) WHERE p.id = $personId RETURN p, a;
Your client code will just need to provide the parameter values when invoking the query.

Related

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

How to check if an specific Grakn instance already exists before trying to insert it into the KG?

Let's assume that a Grakn KG contains entities of type 'product' and that they are uniquely identified by the key 'id_prod'. As I understand it, the attempt to insert an instance of product with a repeated id_prod will generate an error.
Assuming that the insertion is being done through a console script, how could the previous existence of the instance be checked with graql during the insertion? And via the python client, are there any special recommendations or patterns to follow?
Your assertion is correct. At present Graql doesn't have a PUT behaviour built-in that would check for existence and insert only if not present. This is a feature that should be included in the future (I work at Grakn).
Instead, you have broadly two options:
You match for concepts by their keys. If there are no results then you insert them. Then you can match insert for the keyed concepts to add relations to them etc.
You first ensure that you've inserted all of the keyed concepts into the KB (may not be possible). You then make the match insert queries directly, matching for the keyed concepts, with no need for checking the keys exist

DELETE and CREATE Cypher statements in one transaction

I'm writing a python 3.6 application that uses Neo4j as a backend with the official python driver. I'm new to Neo4j and Cypher. Data coming into the database needs to replace previous 'versions' of that data. I'm attempting to do this by creating a root node, indicating version. Eg.
MATCH (root:root_node)-[*..]->(any_node:) DETACH DELETE root, any_node
CREATE(root:new_root_node)
...
...
... represents all the new data i'm attaching to the new_root_node
The above doesn't work. How do I incorporate DELETE and CREATE statements in one transaction?
Thanks!
There's not a problem with DELETE and CREATE statements in a single transaction.
There are two problems here needing fixing.
The first is with (anynode:). The : separates the variable from the node label. If : is present, then the node label must also be present, and because no label is supplied here you're getting an error. To fix this, remove : completely like so: (anynode)
The second problem is with CREATE(root:new_root_node). The problem here is that the root variable is already in scope from your previous MATCH, so you'll need to use a different variable.
Also, your :new_root_node label doesn't seem useful, as any previously created queries for querying data from the root node will need to change to use the new label. I have a feeling you might be misunderstanding something about Neo4j labels, so a quick review of the relevant documentation may be helpful.

Neo4j - Cypher read-write-return query

I'm fairly new to neo4j. I've played a little bit with cypher and REST API. I want to be able to create a leaf node along certain path, consider these nodes are some types of events. I wouldn't know during run time the id of the node this event will be attached to. I need to either do a look-up and then get the id of the node and then create my new node.
So during run time I was hoping I can do a MATCH using cypher to get the node to which I can attach the event and CREATE new node along with the relationship to the existing node returned by MATCH. So I came across the cypher cheat sheet which has a read-write-return query which I thought would be good fit. But there is nothing much mentioned about it in documentation or may be I'm not a super googler!!
Could someone please tell me if this(read-write-return) is the right/valid approach?
Many Thanks!
Yep. That's a good approach. That's one of the nice things about how CREATE works in Cypher. You can also optionally use create unique which creates the rel/node at the same time. Something like:
start n=node(1)
create unique n-[:event]->(event {prop:"val"})
return n, event;
Or without create unique:
start n=node(1)
create (event {prop:"val"}), n-[:event]->event
return n, event;

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources