How to apply rules on a large dataframe based on matching criteria

How to apply rules on a large dataframe based on matching criteria - python

Is there a python based rule engine that can be applied to dataframe
I have daily transaction file with abt 2M records. It is as follows
TransactionId,region,account,currency,cost_center,amount
T1,R1,A1,USD,CC1,1200000
T2,R1,A1,USD,CC2,1000000
T3,R2,A1,EUR,CC1,1100000
T4,R2,A2,EUR,CC3,900000
T5,R3,A1,XNY,CC2,1200000
Rules are defined as follows which has attributes region,account,currency,cost_center used for comparing transaction. It also has inclusion and exclusion criteria
region,account,currency,cost_center,pct,priority,Inclusion,Exclusion
Rule1:ALL,A1,USD,CC2,0.2,10,NONE,NONE
Rule2:ALL,A1,ALL,ALL,0.1,1,NONE,NONE
Rule3:R2,A2,EUR,CC3,0.3,10,NONE,NONE
Rule4:ALL,A2,EUR,ALL,0.3,1,<Region in R1,R2>&<cost_center in CC1,CC3>,<Account in A4,A5>
Task is find rules that match with highest priority
So for example : Rule1 and Rule2 both applies to T1. But will pick Rule1 as it has highest priority.
Is there a python based rule engine that can be applied to dataframe and can handle inclusion and exclusion

Related

string matching with NLP

I have two dataframes, df1 and df2, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.
The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1 has latitude and longitude columns, whereas df2 doesn't, so I am trying to do string matching between the two dataframes so at least some rows of df2 will have geolocations available. I am blocking on the 'district' column while doing the string matching.
This is the code that I have so far:
import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)
This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu' and 'government girls primary school muzaka' are clearly different, yet they are being matched.
I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government', 'girls', 'boys', 'primary', 'school', etc. so I can put less emphasis on those words, and put more emphasis on meilabu, muzaka etc. while doing the string matching, but I am not so sure where to start.
(For reference, many of the polling stations are 'government (i.e.public) schools')
Any advice would be greatly appreciated!

The topic is very broad, just pay attention to standard approaches:
TFIDF: term frequency–inverse document frequency is often used as a weighting factor.
Measure similarity between two sentences using cosine similarity

#ipj said it correct, the topic is very broad. You can try out below methods,
def get_sim_measure(sentence1, sentence1):
vec1 = get_vector(sentence1)
vec2 = get_vector(sentence2)
return cosine_similarity(vec1, vec2)
Now the get_vector method can be many things.
Remove the stop words first and then you can use word2vec, GloVe on a word level and average them for the sentence. (simple)
Use doc2vec from Gensim for vector embedding of the sentence. (medium)
Use Bert (DistilBert or something lighter) for dynamic embedding with context. (hard)
Use TF-IDF and then use GloVe embedding. (simple)
Use spaCy's entity recognition and then do similarity matching (in this case words from government girls primary school will act as stop words) on entity labels. (slow process but simple)
Use BleuScore for measuring the similar words (in case you need it). (maybe misguiding)
There can be many situations, so better give few simple ones a try and go ahead.

Scanning Bigtable by prefix with Python SDKs

I'm trying to perform searches by multiple prefixes at Google Cloud Bigtable with the Python SDK. I'm using read_rows, and I can't see a good way to search by prefix explicitly.
My first option is RowSet + RowRange. I'm testing three queries, and the times that I'm getting are ~1.5s, ~3.5s and ~4.2s, which are an order of magnitude slower than the searches with the Node SDK (which has a filter option) ~0.19, ~0.13, ~0.46.
The second option is using RowFilterChain + RowKeyRegexFilter. Performance is terrible for two of the queries: ~3.1s, ~70s, ~75s ~0.124s, ~72s, ~69s. It looks like it's doing a full scan. This is the code section:
regex = f'^{prefix}.*'.encode()
filters.append(RowKeyRegexFilter(regex))
My third option is using the alternative Happybase-based SDK, which has prefix filtering. With that, I'm getting ~36s, ~3s, ~1s ~0.4, ~0.1, ~0.17. The first query involves multiple prefixes, and it doesn't seem to have support for multiple filtering in the same request, so I'm performing as many requests as prefixes and then concatenating the iterators. The other two seem to leverage the prefix filter.
UPDATE: I deleted the first times because there was a mistake with the environment. After doing it properly, times are not bad for range query, but it seems to be room for improvement, as Happybase tests are still faster when they leverage prefix search.
Would appreciate help about using multiple prefix searches in Happybase, or actual prefix search in the main Python SDK.

The read_rows method have two parameters start_key and end_key that you can use to filter efficiently rows based on the row key (see docs). Behind the scenes, this method performs a Scan, so that's why this is probably the most efficient way to filter rows based on their row keys.
For example, let's suppose you have the following row keys in your table :
a
aa
b
bb
bbb
and you want to retrieve all rows with a row key prefixed by a, you can run :
rows_with_prefix_a = my_table.read_rows(start_key="a", end_key="b")
This will only scan rows between a and b (b excluded), so this will return all rows with row key prefix a (a and aa in the previous example).

neo4j - Improving a Cypher query

I have a performance critical application which has to match multiple nodes to another node based on regex matching. My current query is as follows:
MATCH (person: Person {name: 'Mark'})
WITH person
UNWIND person.match_list AS match
MATCH (pet: Animal)
WHERE pet.name_regex =~ match
MERGE (person)-[:OWNS_PET]->(pet)
RETURN pet
However, this query runs VERY slow (around 500ms on my workstation).
The graph contains around 500K nodes, and around 10K will match the regex.
I'm wondering whether there is a more efficient way to re-write this query to work the same but provide a performance increase.
EDIT:
When I run this query on several Persons multithreaded I get a TransientError exception
neo4j.exceptions.TransientError: ForsetiClient[3] can't acquire ExclusiveLock{owner=ForsetiClient[14]} on NODE(1889), because holders of that lock are waiting for ForsetiClient[3].
EDIT 2:
Person:name is unique and indexed
Animal:name_regex is not indexed

First, I would start by simplifying your query as much as possible. The way you are doing it now creates a lot of wasted effort after a match has been found
MATCH (person: Person {name: 'Mark'}), (pet: Animal)
WHERE ANY(match in person.match_list WHERE pet.name_regex =~ match)
MERGE (person)-[:OWNS_PET]->(pet)
RETURN pet
This will make it so that only 1 merge is attempted if there are multiple matches, and once one match is found, the rest won't be attempted on the same pet. This also allows Cypher to optimize to the best of it's ability on your data.
To improve the cypher further, you will need to optimize your data. For example, regex match is expensive (requires a node+string scan), if the match statements can be largely reused between people, it would be better to break them out into a node, and then connect to those so that the work of one regex match can be reused everywhere it's repeated.

Can I group graphite results by regex?

I've been using graphite for some time now in order to power our backend pythonic program. As part of my usage of it, I need to sum (using sumSeries) different metrics using wildcards.
Thing is, I need to group them according to a pattern; say I have the following range of metric names:
group.*.item.*
I need to sum the values of all items, for a given group (meaning: group.1.item.*, group.2.item.*, etc)
Unfortunately, I do not know in advance the set of existing group values, and so what I do right now is that I query metrics/index.json, parse the list, and generated the desired query (manually creating sumSeries(group.NUMBER.item.*) for every NUMBER I find in the metrics index).
I was wondering if there was a way to have graphite do this for me, and save the first roundtrip, as the communication and pre-processing are costly (taking more than half the time of the entire process)
Thanks in advance!

If you want a separate line for each group you could use the groupByNode function.
groupByNode(group.*.item.*, 1, "sumSeries")
Where '1' is the node you're selecting (indexed by 0) and "sumSeries" is the function you are feeding each group into.
You can read more about this here: http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.groupByNode
If you want to restrict the second node to only numeric values you can use a character range. You do this by specifying the range in square brackets [...]. A character range is indicated by 2 characters separated by a dash (-).
group.[0-9].item.*
You can read more about this here:
http://graphite.readthedocs.io/en/latest/render_api.html#paths-and-wildcards

BadFilterError: invalid filter: Only one property per query may have inequality filters (<=, >=, <, >)

I am trying to apply filter on two diffrent properties but it GAE isn't allow me to do this what will be the solution then, there it is the code snipt:
if searchParentX :
que.filter("parentX >=", searchParentX).filter("parentX <=", unicode(searchParentX) + u"\ufffd")
que.order('parentX')
if searchParentY :
que.filter("parentY >=", searchParentY).filter("parentY <=", unicode(searchParentY) + u"\ufffd")

The solution would be to do an in memory filtering:
You can run two queries (filtering on one property each) and do an intersection on the results (depending on the size of the data, you may need to limit your results for one query but not the other so it can fit in memory)
Run one query and filter out the other property in memory (in this case it would be beneficial if you know which property would return a more filtered list)
Alternatively, if your data is structured in such a way that you can break the data into sets you can perform equality filters on that set and finish filtering in memory. For example, if you are searching on strings but you know the strings to be a fixed length (say 6 characters), you can create a "lookup" field with the begging 3/4 characters. Then when you need to search on this field, you do so by matching on the first few characters, and finish the search in memory. Another example: when searching for integer ranges, if you can define common grouping of ranges (say decades for a year, or price ranges), then you can define a "range" field to do equality searches on and continue filtering in memory

Inequality filters are limited to at most one property, i think this restriction is because the data in bigtable is stored in lexical sorted form so at one time only one search can be perform
https://developers.google.com/appengine/docs/python/datastore/queries#Restrictions_on_Queries

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to apply rules on a large dataframe based on matching criteria - python

Related

string matching with NLP

Scanning Bigtable by prefix with Python SDKs

neo4j - Improving a Cypher query

Can I group graphite results by regex?

BadFilterError: invalid filter: Only one property per query may have inequality filters (<=, >=, <, >)

Categories

Resources