I have a performance critical application which has to match multiple nodes to another node based on regex matching. My current query is as follows:
MATCH (person: Person {name: 'Mark'})
WITH person
UNWIND person.match_list AS match
MATCH (pet: Animal)
WHERE pet.name_regex =~ match
MERGE (person)-[:OWNS_PET]->(pet)
RETURN pet
However, this query runs VERY slow (around 500ms on my workstation).
The graph contains around 500K nodes, and around 10K will match the regex.
I'm wondering whether there is a more efficient way to re-write this query to work the same but provide a performance increase.
EDIT:
When I run this query on several Persons multithreaded I get a TransientError exception
neo4j.exceptions.TransientError: ForsetiClient[3] can't acquire ExclusiveLock{owner=ForsetiClient[14]} on NODE(1889), because holders of that lock are waiting for ForsetiClient[3].
EDIT 2:
Person:name is unique and indexed
Animal:name_regex is not indexed
First, I would start by simplifying your query as much as possible. The way you are doing it now creates a lot of wasted effort after a match has been found
MATCH (person: Person {name: 'Mark'}), (pet: Animal)
WHERE ANY(match in person.match_list WHERE pet.name_regex =~ match)
MERGE (person)-[:OWNS_PET]->(pet)
RETURN pet
This will make it so that only 1 merge is attempted if there are multiple matches, and once one match is found, the rest won't be attempted on the same pet. This also allows Cypher to optimize to the best of it's ability on your data.
To improve the cypher further, you will need to optimize your data. For example, regex match is expensive (requires a node+string scan), if the match statements can be largely reused between people, it would be better to break them out into a node, and then connect to those so that the work of one regex match can be reused everywhere it's repeated.
Related
I have a dataset which has first and last names along with their respective email ids. Some of the email ids follow a certain pattern such as:
Fn1 = John , Ln1 = Jacobs, eid1= jj#xyz.com
Fn2 = Emily , Ln2 = Queens, eid2= eq#pqr.com
Fn3 = Harry , Ln3 = Smith, eid3= hsm#abc.com
The content after # has no importance for finding the pattern. I want to find out how many people follow a certain pattern and what is that pattern. Is it possible to do so using nlp and python?
EXTRA: To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
You certainly could - e.g., you could try to learn a relationship between your input and output data as
(Fn, Ln) --> eid
and further disect this relationship into patterns.
However before hitting the problem with complex tools (especially if new to ml/nlp), I'd do further analysis of the data first.
For example, I'd first be curious to see what portion of your data displays the clear patterns you've shown in the examples - using first character(s) from the individual's first/last name to build the corresponding eid (which could be determined easily programatically).
Setting aside that portion of the data that satisfies this clear pattern - what does the remainder look like?
Is there are another clear, but different pattern to some of this data?
If there is - I'd then perform the same exercise again - construct a proper filter to collect and set aside data satisfying that pattern - and examine the remainder.
Doing this analysis might help determine at least a partial answer to your inquiry rather quickly
To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
Moreover it will help determine
a) whether you need to even use more complex tooling (if enough patterns can be easily seived out this way - is it worth the investment to go heavier?) or
b) if not, which portion of the data to target with heavier tools (the remainder of this process - those not containing simple patterns).
I'm trying to perform searches by multiple prefixes at Google Cloud Bigtable with the Python SDK. I'm using read_rows, and I can't see a good way to search by prefix explicitly.
My first option is RowSet + RowRange. I'm testing three queries, and the times that I'm getting are ~1.5s, ~3.5s and ~4.2s, which are an order of magnitude slower than the searches with the Node SDK (which has a filter option) ~0.19, ~0.13, ~0.46.
The second option is using RowFilterChain + RowKeyRegexFilter. Performance is terrible for two of the queries: ~3.1s, ~70s, ~75s ~0.124s, ~72s, ~69s. It looks like it's doing a full scan. This is the code section:
regex = f'^{prefix}.*'.encode()
filters.append(RowKeyRegexFilter(regex))
My third option is using the alternative Happybase-based SDK, which has prefix filtering. With that, I'm getting ~36s, ~3s, ~1s ~0.4, ~0.1, ~0.17. The first query involves multiple prefixes, and it doesn't seem to have support for multiple filtering in the same request, so I'm performing as many requests as prefixes and then concatenating the iterators. The other two seem to leverage the prefix filter.
UPDATE: I deleted the first times because there was a mistake with the environment. After doing it properly, times are not bad for range query, but it seems to be room for improvement, as Happybase tests are still faster when they leverage prefix search.
Would appreciate help about using multiple prefix searches in Happybase, or actual prefix search in the main Python SDK.
The read_rows method have two parameters start_key and end_key that you can use to filter efficiently rows based on the row key (see docs). Behind the scenes, this method performs a Scan, so that's why this is probably the most efficient way to filter rows based on their row keys.
For example, let's suppose you have the following row keys in your table :
a
aa
b
bb
bbb
and you want to retrieve all rows with a row key prefixed by a, you can run :
rows_with_prefix_a = my_table.read_rows(start_key="a", end_key="b")
This will only scan rows between a and b (b excluded), so this will return all rows with row key prefix a (a and aa in the previous example).
A beginner question with python probably.
I am able to iterate over the results of aerospike db query like this -
client = aerospike.client(config).connect()
scan = client.scan('namespace', 'setName')
scan.select('PK','expiresIn','clientId','scopes','roles') # scan from aerospike
scan.foreach(process_result)
def process_result((key, metadata, record)):
expiresIn = record.get("expiresIn")
Now, all I want to do is get the nth record from this set, without having to iterate through all.
I tried looking at Get the nth item of a generator in Python but could not make much sense.
Results from scan operation come from all the nodes in the cluster, pipelined, in no particular order. In that sense, there is no difference between the first record or the Nth record in terms of ordering. There is no order.
I wrote some Medium posts on how to sort results from a scan or query:
Sorted Results from a Secondary Index Query in Aerospike — Part I
Sorted Results from a Secondary Index Query — Part II
As usual, the workaround would be to set the scan policy to return just the digests, store them as a list (or several records with smaller lists) and paginate over those wth batch reads. You can set reasonable TTLs so that this result set has a reasonable length of time.
I can provide sample code if needed.
I am trying to apply filter on two diffrent properties but it GAE isn't allow me to do this what will be the solution then, there it is the code snipt:
if searchParentX :
que.filter("parentX >=", searchParentX).filter("parentX <=", unicode(searchParentX) + u"\ufffd")
que.order('parentX')
if searchParentY :
que.filter("parentY >=", searchParentY).filter("parentY <=", unicode(searchParentY) + u"\ufffd")
The solution would be to do an in memory filtering:
You can run two queries (filtering on one property each) and do an intersection on the results (depending on the size of the data, you may need to limit your results for one query but not the other so it can fit in memory)
Run one query and filter out the other property in memory (in this case it would be beneficial if you know which property would return a more filtered list)
Alternatively, if your data is structured in such a way that you can break the data into sets you can perform equality filters on that set and finish filtering in memory. For example, if you are searching on strings but you know the strings to be a fixed length (say 6 characters), you can create a "lookup" field with the begging 3/4 characters. Then when you need to search on this field, you do so by matching on the first few characters, and finish the search in memory. Another example: when searching for integer ranges, if you can define common grouping of ranges (say decades for a year, or price ranges), then you can define a "range" field to do equality searches on and continue filtering in memory
Inequality filters are limited to at most one property, i think this restriction is because the data in bigtable is stored in lexical sorted form so at one time only one search can be perform
https://developers.google.com/appengine/docs/python/datastore/queries#Restrictions_on_Queries
I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.