I want to recognize my custom pattern using Microsoft's Presidio library in python.
while passing the regex I am getting this error.
AttributeError: 'str' object has no attribute 'regex'
from presidio_analyzer import PatternRecognizer
regex = ("^[2-9]{1}[0-9]{3}\\" +
"s[0-9]{4}\\s[0-9]{4}$")
#p = re.compile(regex)
aadhar_number_recognizer = PatternRecognizer(supported_entity="AADHAR_NUMBER",
patterns=[regex])```
PatternRecognizer receives list of 'Pattern' objects as the 'patterns' argument. You are passing plain regex string.
Should be:
from presidio_analyzer import PatternRecognizer, Pattern
aadhar_number_recognizer = PatternRecognizer(supported_entity = "AADHAR_NUMBER",
deny_list=[],
patterns=[Pattern(name="AADHAR Number", score=0.8,
regex="(^[2-9]{1}[0-9]{3}\\s[0-9]{4}\\s[0-9]{4}$)")],
context=[])
For more references you can take a look at how Presidio implements PatternRecognizer in its built-in recognizers.
I was working on this. Although I'm using a different Regex, you can use the code below as a template.
In the example I'm using the base sentence:
"Hi, Java is awesome"
and by using Presidio custom Regex it will be "anonymized" into:
"Hi, Python is awesome"
The code below is just an example, if you just want to replace "Java" with "Python" there are easier ways. This was just the first thing that came to mind. When anonymizing it makes more sense to replace "Java" or "Python" with something like <PROGRAMMING_LANGUAGE>.
from presidio_analyzer import PatternRecognizer, Pattern
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
base_sentence = "Hi, Java is awesome!"
# Define the regex pattern in a Presidio `Pattern` object:
java_pattern = Pattern(name="java_pattern",regex="Java", score = 0.5)
# Define the recognizer with one or more patterns
java_pattern = PatternRecognizer(supported_entity="JAVA", patterns = [java_pattern])
java_pattern_result = java_pattern.analyze(text=base_sentence, entities=["JAVA"])
print("Sentence:", base_sentence)
print("Found:", java_pattern_result)
print()
# Now anonymize
# Initialize the engine:
engine = AnonymizerEngine()
anonymize_result = engine.anonymize(
text=base_sentence,
analyzer_results=java_pattern_result,
operators={"JAVA":OperatorConfig("replace",
{"new_value": "Python"})})
print("Anonymized result:")
print(anonymize_result)
This will print:
Sentence: Hi, Java is awesome!
Found: [type: JAVA, start: 4, end: 8, score: 0.5]
Anonymized result:
text: Hi, Python is awesome!
items:
[
{'start': 4, 'end': 10, 'entity_type': 'JAVA', 'text': 'Python', 'operator': 'replace'}
]
Related
In Python, I'm looking for a way to extract regex groups given a string and a matching template pattern, for example:
file_path = "/101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
file_template = "/{CODE}_{ELEMENT}/{WIDTH}x{HEIGHT}/{CODE}_{ELEMENT}.{FRAME}.exr"
The output I'm looking for is the following:
{
"CODE": "101-001-015",
"ELEMENT": "fg01",
"WIDTH": "4312",
"HEIGHT: "2156",
"FRAME": "0001"
}
My initial approach was to format my template and find any and all matches, but it's not ideal:
import re
re_format = file_template.format(SHOT='(.*)', ELEMENT='(.*)', WIDTH='(.*)', HEIGHT='(.*)', FRAME='(.*)')
search = re.compile(re_format)
result = search.findall(file_path)
# result: [('101-001-015', 'fg01', '4312', '2156', '101-001-015', 'fg01.000', '')]
All template keys could be contain various characters and be of various lengths so I'm looking for a good matching algorithm. Any ideas if and how this could be done with Python re or any alternative libraries?
Thanks!
I would go for named capturing groups and extract the desired results with the groupdict() function:
import re
file_path = "/101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
rx = r"\/(?P<CODE>.+)_(?P<ELEMENT>.+)\/(?P<WIDTH>.+)x(?P<HEIGHT>.+)\/.+\.(?P<FRAME>\w+).exr"
m = re.match(rx, file_path)
result = m.groupdict()
# {'CODE': '101-001-015', 'ELEMENT': 'fg01', 'WIDTH': '4312', 'HEIGHT': '2156', 'FRAME': '0001'}
Kind of similar like Simon did, I'll also try with named captured group
import re
regex = r"(?P<CODE>[0-9-]+)_(?P<ELEMENT>[0-9a-z]+)\/(?P<WIDTH>[0-9]+)x(?P<HEIGHT>[0-9]+)\/\1_\2\.(?P<FRAME>[0-9]+)\.exr"
test_str = "101-001-015_fg01/4312x2156/101-001-015_fg01.0001.exr"
matches = re.match(regex, test_str)
print(matches.groupdict())
DEMO: https://rextester.com/BEZH21139
I want to replace the value of first variable using second variable but i want to keep the commas. i used regex, but i don't know if its possible cause i'm still learning it. so here is my code.
import re
names = 'Mat,Rex,Jay'
nicknames = 'AgentMat LegendRex KillerJay'
split_nicknames = nicknames.split(' ')
for a in range(len(split_nicknames)):
replace = re.sub('\\w+', split_nicknames[a], names)
print(replace)
my output is:
KillerJay,KillerJay,KillerJay
and i want a output like this:
AgentMat,LegendRex,KillerJay
I suspect what you are looking for should resemble something like this:
import re
testString = 'This is my complicated test string where Mat, Rex and Jay are all having a lark, but MatReyRex is not changed'
mapping = { 'Mat' : 'AgentMat',
'Jay' : 'KillerJay',
'Rex' : 'LegendRex'
}
reNames = re.compile(r'\b('+'|'.join(mapping)+r')\b')
res = reNames.sub(lambda m: mapping[m.group(0)], testString)
print(res)
Executing this results in the mapped result:
This is my complicated test string where AgentMat, LegendRex and KillerJay are all having a lark, but MatReyRex is not changed
We can build the mapping as follows :
import re
names = 'Mat,Rex,Jay'
nicknames = 'AgentMat LegendRex KillerJay'
my_dict = dict(zip(names.split(','), nicknames.split(' ')))
replace = re.sub(r'\b\w+\b', lambda m:my_dict[m[0]], names)
print(replace)
Then use lambda to apply the mapping.
I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
Results,
149.01646934738716
foo: 9998872
18.093295297389773
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
Solution
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
count_dict
Results:
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
nlp.vocab.strings[8566208034543834098]
Returns
'apple'
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:
2
https://spacy.io/api/doc#count_by
Okay, so I have the following piece of code.
out = out + re.sub('\{\{([A-z]+)\}\}', values[re.search('\{\{([A-z]+)\}\}',item).group().strip('{}')],item) + " "
Or, more broken down:
out = out + re.sub(
'\{\{([A-z]+)\}\}',
values[
re.search(
'\{\{([A-z]+)\}\}',
item
).group().strip('{}')
],
item
) + " "
So, basically, if you give it a string which contains {{reference}}, it will find instances of that, and replace them with the given reference. The issue with it in it's current form is that it can only work based on the first reference. For example, say my values dictionary was
values = {
'bob': 'steve',
'foo': 'bar'
}
and we passed it the string
item = 'this is a test string for {{bob}}, made using {{foo}}'
I want it to put into out
'this is a test string for steve, made using bar'
but what it currently outputs is
'this is a test string for steve, made using steve'
How can I change the code such that it takes into account the position in the loop.
It should be noted, that doing a word split would not work, as the code needs to work even if the input is {{foo}}{{steve}}
I got the output using the following code,
replace_dict = { 'bob': 'steve','foo': 'bar'}
item = 'this is a test string for {{foo}}, made using {{steve}}'
replace_lst = re.findall('\{\{([A-z]+)\}\}', item)
out = ''
for r in replace_lst:
if r in replace_dict:
item = item.replace('{{' + r + '}}', replace_dict[r])
print item
How's this?
import re
values = {
'bob': 'steve',
'foo': 'bar'
}
item = 'this is a test string for {{bob}}, made using {{foo}}'
pat = re.compile(r'\{\{(.*?)\}\}')
fields = pat.split(item)
fields[1] = values[fields[1]]
fields[3] = values[fields[3]]
print ''.join(fields)
If you could change the format of reference from {{reference}} to {reference}, you could achieve your needs just with format method (instead of using regex):
values = {
'bob': 'steve',
'foo': 'bar'
}
item = 'this is a test string for {bob}, made using {foo}'
print(item.format(**values))
# prints: this is a test string for steve, made using bar
In your code, re.search will start looking from the beginning of the string each time you call it, thus always returning the first match {{bob}}.
You can access the match object you are currently replacing by passing a function as replacement to re.sub:
values = { 'bob': 'steve','foo': 'bar'}
item = 'this is a test string for {{bob}}, made using {{foo}}'
pattern = r'{{([A-Za-z]+)}}'
# replacement function
def get_value(match):
return values[match.group(1)]
result = re.sub(pattern, get_value, item)
# print result => 'this is a test string for steve, made using bar'
This is a follow-up of my question. I am using nltk to parse out persons, organizations, and their relationships. Using this example, I was able to create chunks of persons and organizations; however, I am getting an error in the nltk.sem.extract_rel command:
AttributeError: 'Tree' object has no attribute 'text'
Here is the complete code:
import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)
# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]
# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
print nltk.sem.show_raw_rtuple(rel)
This example is very similar to the one given in the book, but the example uses prepared 'parsed docs,' which appears of nowhere and I don't know where to find its object type. I scoured thru the git libraries as well. Any help is appreciated.
My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.
It looks like to be a "Parsed Doc" an object needs to have a headline member and a text member both of which are lists of tokens, where some of the tokens are marked up as trees. For example this (hacky) example works:
import nltk
import re
IN = re.compile (r'.*\bin\b(?!\b.+ing)')
class doc():
pass
doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
When run this provides the output:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
Obviously you wouldn't actually code it like this, but it provides a working example of the data format expected by extract_rels, you just need to determine how to do your preprocessing steps to get your data massaged into that format.
Here is the source code of nltk.sem.extract_rels function :
def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.
The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').
:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
'ieer' and 'conll2002'
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
"""
....
So if you pass corpus parameter as ieer, the nltk.sem.extract_rels function expects the doc parameter to be a IEERDocument object. You should pass corpus as ace or just don't pass it(default is ace). In this case it expects a list of chunk trees(that's what you wanted). I modified the code as below:
import nltk
import re
from nltk.sem import extract_rels,rtuple
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read().decode('utf-8')
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
# here i changed reg ex and below i exchanged subj and obj classes' places
OF = re.compile(r'.*\bof\b.*')
for i, sent in enumerate(tagged_sentences):
sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence
for rel in rels:
print('{0:<5}{1}'.format(i, rtuple(rel)))
And it gives the result :
[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']
this is nltk version problem. your code should work in nltk 2.x
but for nltk 3 you should code like this
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
print (nltk.sem.relextract.rtuple(rel))
NLTK Example for Relation Extraction Does not work