Entity Recognition in Stanford NLP using Python - python

I am using Stanford Core NLP using Python.I have taken the code from here.
Following is the code :
from stanfordcorenlp import StanfordCoreNLP
import logging
import json
class StanfordNLP:
def __init__(self, host='http://localhost', port=9000):
self.nlp = StanfordCoreNLP(host, port=port,
timeout=30000 , quiet=True, logging_level=logging.DEBUG)
self.props = {
'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation,sentiment',
'pipelineLanguage': 'en',
'outputFormat': 'json'
}
def word_tokenize(self, sentence):
return self.nlp.word_tokenize(sentence)
def pos(self, sentence):
return self.nlp.pos_tag(sentence)
def ner(self, sentence):
return self.nlp.ner(sentence)
def parse(self, sentence):
return self.nlp.parse(sentence)
def dependency_parse(self, sentence):
return self.nlp.dependency_parse(sentence)
def annotate(self, sentence):
return json.loads(self.nlp.annotate(sentence, properties=self.props))
#staticmethod
def tokens_to_dict(_tokens):
tokens = defaultdict(dict)
for token in _tokens:
tokens[int(token['index'])] = {
'word': token['word'],
'lemma': token['lemma'],
'pos': token['pos'],
'ner': token['ner']
}
return tokens
if __name__ == '__main__':
sNLP = StanfordNLP()
text = r'China on Wednesday issued a $50-billion list of U.S. goods including soybeans and small aircraft for possible tariff hikes in an escalating technology dispute with Washington that companies worry could set back the global economic recovery.The country\'s tax agency gave no date for the 25 percent increase...'
ANNOTATE = sNLP.annotate(text)
POS = sNLP.pos(text)
TOKENS = sNLP.word_tokenize(text)
NER = sNLP.ner(text)
PARSE = sNLP.parse(text)
DEP_PARSE = sNLP.dependency_parse(text)
I am only interested in Entity Recognition which is being saved in the variable NER. The command NER is giving the following result
The same thing if I run on Stanford Website, the output for NER is
There are 2 problems with my Python Code:
1. '$' and '50-billion' should be combined and named a single entity.
Similarly, I want '25' and 'percent' as a single entity as it is showing in the online stanford output.
2. In my output, 'Washington' is shown as State and 'China' is shown as Country. I want both of them to be shown as 'Loc' as in the stanford website output. The possible solution to this problem lies in the documentation .
But I don't know which model am I using and how to change the model.

Here is a way you can solve this
Make sure to download Stanford CoreNLP 3.9.1 and the necessary models jars
Set up the server properties in this file "ner-server.properties"
annotators = tokenize,ssplit,pos,lemma,ner
ner.applyFineGrained = false
Start the server with this command:
java -Xmx12g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties ner-server.properties
Make sure you've installed this Python package:
https://github.com/stanfordnlp/python-stanford-corenlp
Run this Python code:
import corenlp
client = corenlp.CoreNLPClient(start_server=False, annotators=["tokenize", "ssplit", "pos", "lemma", "ner"])
sample_text = "Joe Smith was born in Hawaii."
ann = client.annotate(sample_text)
for mention in ann.sentence[0].mentions:
print([x.word for x in ann.sentence[0].token[mention.tokenStartInSentenceInclusive:mention.tokenEndInSentenceExclusive]])
Here are all the fields available in the EntityMention for each entity:
sentenceIndex: 0
tokenStartInSentenceInclusive: 5
tokenEndInSentenceExclusive: 7
ner: "MONEY"
normalizedNER: "$5.0E10"
entityType: "MONEY"

Related

Missing value on Streamlit

I tried to connect the backend(FastAPI) with the frontend that I do with Streamlit to predict a value.
I followed this tutorial : https://medium.com/codex/streamlit-fastapi-%EF%B8%8F-the-ingredients-you-need-for-your-next-data-science-recipe-ffbeb5f76a92
But whatever I do I keep getting the same error on Streamlit:
Response from API ={"detail":[{"loc":["body","name_contract_type"],"msg":"field required","type":"value_error.missing"},{"loc":["body","children_count"],"msg":"field required","type":"value_error.missing"},{"loc":["body","fam_members"],"msg":"field required","type":"value_error.missing"},{"loc":["body","amt_credit_sum"],"msg":"field required","type":"value_error.missing"},{"loc":["body","DAYS_INSTALMENT_delay"],"msg":"field required","type":"value_error.missing"},{"loc":["body","bureau_year"],"msg":"field required","type":"value_error.missing"}]}
Here's my API code :
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import json
app = FastAPI()
class User_input(BaseModel):
name_contract_type: int
children_count: int
fam_members: int
amt_credit_sum: float
DAYS_INSTALMENT_delay: int
amt_income_total: float
credit_active: int
bureau_year: int
with open(PATH + "lr.pkl", "rb") as f:
model = pickle.load(f)
#app.post('/Loan')
def loan_pred(input_parameters: User_input):
input_data= input_parameters.json()
input_dictionary = json.loads(input_data)
#input features
contract = input_dictionary\['name_contract_type'\]
children = input_dictionary\['children_count'\]
members = input_dictionary\['fam_members'\]
credit_amt =input_dictionary\['amt_credit_sum'\]
delay = input_dictionary\['DAYS_INSTALMENT_delay'\]
amt_income_total =input_dictionary\['amt_income_total'\]
credit_active =input_dictionary\['credit_active'\]
bureau =input_dictionary\['bureau_year'\]
input_list = [contract, children, members, credit_amt, delay, credit_active,
amt_income_total,bureau]
prediction = model.predict([input_list])
if (prediction[0]== 0):
return'The customer will refund his loan'
else:
return'The customer will not refund his loan'
# Here's my code for streamlit :
\#input features
contract = st.sidebar.slider("X",0,100,20)
children = st.sidebar.slider("a",0,100,20)
credit_amnt =st.sidebar.slider("b",0,100,20)
members = st.sidebar.slider("c",0,100,20)
credit_active =st.sidebar.slider("d",0,100,20)
amt_income_total =st.sidebar.slider("e",0,100,20)
bureau =st.sidebar.slider("f",0,100,20)
delay =st.sidebar.slider("g",0,100,20)
user_input_dict={"contract": contract, "children":children, "credit_amnt":credit_amnt, "members":members, "credit_active":credit_active,
"amt_income_total":amt_income_total,"delay":delay,"bureau":bureau}
btn_predict = st.sidebar.button("Predict")
if btn_predict:
res = requests.post(url='https://66c4-34-73-148-78.ngrok.io/Loan',data=json.dumps(user_input_dict))
st.subheader(f"Response from API ={res.text}")
Thanks for your help
I tried everything I could but still could not figure it out
Well, let me recap the whole context :
I did a project that consisted of predicting if a customer would be able to refund a loan or not , 0 : the customer refunds it,1: the customer doesn’t
Then I deployed my model with FastAPI as a backend and Streamlit as the frontend, my goal is to connect Streamlit and FastAPI. For this, I followed the tutorial that I mentionned in my question.
My FastAPI code works and returns a prediction. What I did next was using Streamlit for the model inputs and use a request post to get a prediction by the backend(FastAPI). But when I did that I got the error that I mentionned above (dfloc error missing value ) I hope now it’s less confusing and more clear.

Luis python SDK Utterance addition

we are trying to create a Chatbot using Luis framework and Python SDK using the Azure documentation as a reference. We have been able to add Intent, entity and pre-built entities using the same. These changes show up on the portal verifying the addition.
But the code for Utterance addition is not showing on the portal or being listed on the terminal.
def create_utterance(intent, utterance, *labels):
"""
Add an example LUIS utterance from utterance text and a list of
labels. Each label is a 2-tuple containing a label name and the
text within the utterance that represents that label.
Utterances apply to a specific intent, which must be specified.
"""
text = utterance.lower()
def label(name, value):
value = value.lower()
start = text.index(value)
return dict(entity_name=name, start_char_index=start,
end_char_index=start + len(value), role=None)
return dict(text=text, intent_name=intent,
entity_labels=[label(n, v) for (n, v) in labels])
utterances = [create_utterance("FindFlights", "find flights in economy to Madrid",
("Flight", "economy to Madrid"),
("Location", "Madrid"),
("Class", "economy")),
create_utterance("FindFlights", "find flights to London in first class",
("Flight", "London in first class"),
("Location", "London"),
("Class", "first")),
create_utterance("FindFlights", "find flights from seattle to London in first class",
("Flight", "flights from seattle to London in first class"),
("Location", "London"),
("Location", "Seattle"),
("Class", "first"))]
client.examples.batch(appId, appVersion, utterances, raw=True)
client.examples.list(appId, appVersion)
This code does not return any error but does not list the Utterances either.

Django/PostgreSQL Full Text Search - Different search results when using SearchVector versus SearchVectorField on AWS RDS PostgreSQL

I'm trying to use the Django SearchVectorField to support full text search. However, I'm getting different search results when I use the SearchVectorField on my model vs. instantiating a SearchVector class in my view. The problem is isolated to an AWS RDS PostgreSQL instance. Both perform the same on my laptop.
Let me try to explain it with some code:
# models.py
class Tweet(models.Model):
def __str__(self):
return self.tweet_id
tweet_id = models.CharField(max_length=25, unique=True)
text = models.CharField(max_length=1000)
text_search_vector = SearchVectorField(null=True, editable=False)
class Meta:
indexes = [GinIndex(fields=['text_search_vector'])]
I've populated all rows with a search vector and have established a trigger on the database to keep the field up to date.
# views.py
query = SearchQuery('chance')
vector = SearchVector('text')
on_the_fly = Tweet.objects.annotate(
rank=SearchRank(vector, query)
).filter(
rank__gte=0.001
)
from_field = Tweet.objects.annotate(
rank=SearchRank(F('text_search_vector'), query)
).filter(
rank__gte=0.001
)
# len(on_the_fly) == 32
# len(from_field) == 0
The on_the_fly queryset, which uses a SearchVector instance, returns 32 results. The from_field queryset, which uses the SearchVectorField, returns 0 results.
The empty result prompted me to drop into the shell to debug. Here's some output from the command line in my python manage.py shell environment:
>>> qs = Tweet.objects.filter(
... tweet_id__in=[949763170863865857, 961432484620787712]
... ).annotate(
... vector=SearchVector('text')
... )
>>>
>>> for tweet in qs:
... print(f'Doc text: {tweet.text}')
... print(f'From db: {tweet.text_search_vector}')
... print(f'From qs: {tweet.vector}\n')
...
Doc text: #Espngreeny Run your 3rd and long play and compete for a chance on third down.
From db: '3rd':4 'chanc':12 'compet':9 'espngreeni':1 'long':6 'play':7 'run':2 'third':14
From qs: '3rd':4 'a':11 'and':5,8 'chance':12 'compete':9 'down':15 'espngreeny':1 'for':10 'long':6 'on':13 'play':7 'run':2 'third':14 'your':3
Doc text: No chance. It was me complaining about Girl Scout cookies. <url-removed-for-stack-overflow>
From db: '/aggcqwddbh':13 'chanc':2 'complain':6 'cooki':10 'girl':8 'scout':9 't.co':12 't.co/aggcqwddbh':11
From qs: '/aggcqwddbh':13 'about':7 'chance':2 'complaining':6 'cookies':10 'girl':8 'it':3 'me':5 'no':1 'scout':9 't.co':12 't.co/aggcqwddbh':11 'was':4
You can see that the search vector looks very different when comparing the value from the database to the value that's generated via Django.
Does anyone have any ideas as to why this would happen? Thanks!
SearchQuery translates the terms the user provides into a search query object that the database compares to a search vector. By default, all the words the user provides are passed through the Stemming algorithms , and then it looks for matches for all of the resulting terms.
there two issue need to be solved first gave stemming algorithm information about language.
query = SearchQuery('chance' , config="english")
and second is replace this line
rank=SearchRank(F('text_search_vector'), query)
with
rank=SearchRank('text_search_vector', query)
about the missing word in text_search_vector this is standard procedure of Stemming algorithms to remove common word known as stop word

How to retrieve well formatted JSON from AWS Lambda using Python

I have a function in AWS Lambda that connects to the Twitter API and returns the tweets which match a specific search query I provided via the event. A simplified version of the function is below. There's a few helper functions I use like get_secret to manage API keys and process_tweet which limits what data gets sent back and does things like convert the created at date to a string. The net result is that I should get back a list of dictionaries.
def lambda_handler(event, context):
twitter_secret = get_secret("twitter")
auth = tweepy.OAuthHandler(twitter_secret['api-key'],
twitter_secret['api-secret'])
auth.set_access_token(twitter_secret['access-key'],
twitter_secret['access-secret'])
api = tweepy.API(auth)
cursor = tweepy.Cursor(api.search,
q=event['search'],
include_entities=True,
tweet_mode='extended',
lang='en')
tweets = list(cursor.items())
tweets = [process_tweet(t) for t in tweets if not t.retweeted]
return json.dumps({"tweets": tweets})
From my desktop then, I have code which invokes the lambda function.
aws_lambda = boto3.client('lambda', region_name="us-east-1")
payload = {"search": "paint%20protection%20film filter:safe"}
lambda_response = aws_lambda.invoke(FunctionName="twitter-searcher",
InvocationType="RequestResponse",
Payload=json.dumps(payload))
results = lambda_response['Payload'].read()
tweets = results.decode('utf-8')
The problem is that somewhere between json.dumpsing the output in lambda and reading the payload in Python, the data has gotten screwy. For example, a line break which should be \n becomes \\\\n, all of the double quotes are stored as \\" and Unicode characters are all prefixed by \\. So, everything that was escaped, when it was received by Python on my desktop with the escaping character being escaped. Consider this element of the list that was returned (with manual formatting).
'{\\"userid\\": 190764134,
\\"username\\": \\"CapitalGMC\\",
\\"created\\": \\"2018-09-02 15:00:00\\",
\\"tweetid\\": 1036267504673337344,
\\"text\\": \\"Protect your vehicle\'s paint! Find out how on this week\'s blog.
\\\\ud83d\\\\udc47\\\\n\\\\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW\\"}'
I can use regex to fix some problems (\\" and \\\\n) but the Unicode is tricky because even if I match it, how do I replace it with a properly escaped character? When I do this in R, using the aws.lambda package, everything is fine, no weird escaped escapes.
What am I doing wrong on my desktop with the response from AWS Lambda that's garbling the data?
Update
The process tweet function is below. It literally just pulls out the bits I care to keep, formats the datetime object to be a string and returns a dictionary.
def process_tweet(tweet):
bundle = {
"userid": tweet.user.id,
"username": tweet.user.screen_name,
"created": str(tweet.created_at),
"tweetid": tweet.id,
"text": tweet.full_text
}
return bundle
Just for reference, in R the code looks like this.
payload = list(search="paint%20protection%20film filter:safe")
results = aws.lambda::invoke_function("twitter-searcher"
,payload = jsonlite::toJSON(payload
,auto_unbox=TRUE)
,type = "RequestResponse"
,key = creds$key
,secret = creds$secret
,session_token = creds$session_token
,region = creds$region)
tweets = jsonlite::fromJSON(results)
str(tweets)
#> 'data.frame': 133 obs. of 5 variables:
#> $ userid : num 2231994854 407106716 33553091 7778772 782310 ...
#> $ username: chr "adaniel_080213" "Prestige_AdamL" "exclusivedetail" "tedhu" ...
#> $ created : chr "2018-09-12 14:07:09" "2018-09-12 11:31:56" "2018-09-12 10:46:55" "2018-09-12 07:27:49" ...
#> $ tweetid : num 1039878080968323072 1039839019989983232 1039827690151444480 1039777586975526912 1039699310382931968 ...
#> $ text : chr "I liked a #YouTube video https://url/97sRShN4pM Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film" "Another #Corvette #ZO6 full body clearbra wrap completed using #xpeltech ultimate plus PPF ... Paint protection"| __truncated__ "We recently protected this Tesla Model 3 with Paint Protection Film and Ceramic Coating.#teslamodel3 #charlotte"| __truncated__ "Tesla Model 3 - Front End Package - Suntek Ultra Paint Protection Film https://url/AD1cl5dNX3" ...
tweets[131,]
#> userid username created tweetid
#> 131 190764134 CapitalGMC 2018-09-02 15:00:00 1036267504673337344
#> text
#> 131 Protect your vehicle's paint! Find out how on this week's blog.👇\n\nhttps://url/XYMxPhVhdH https://url/mFL2Zv8nWW
In your lambda function you should return a response object with a JSON object in the response body.
# Lambda Function
def get_json(event, context):
"""Retrieve JSON from server."""
# Business Logic Goes Here.
response = {
"statusCode": 200,
"headers": {},
"body": json.dumps({
"message": "This is the message in a JSON object."
})
}
return response
Don't use json.dumps()
I had a similar issue, and when I just returned "body": content instead of "body": json.dumps(content) I could easily access and manipulate my data. Before that, I got that weird form that looks like JSON, but it's not.

Google Cloud NL entity recognizer grouping words together

When attempting to find the entities in a long input of text, Google Cloud's natural language program is grouping together words and then getting their incorrect entity. Here is my program:
def entity_recognizer(nouns):
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/superaitor/Downloads/link"
text = ""
for words in nouns:
text += words + " "
client = language.LanguageServiceClient()
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
document = types.Document(
content=text.encode('utf-8'),
type=enums.Document.Type.PLAIN_TEXT)
encoding = enums.EncodingType.UTF32
if sys.maxunicode == 65535:
encoding = enums.EncodingType.UTF16
entity = client.analyze_entities(document, encoding).entities
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')
for entity in entity:
#if entity_type[entity.type] is "PERSON":
print(entity_type[entity.type])
print(entity.name)
Here nouns is a list of words. I then turn that into a string(i've tried multiple ways of doing so, all give the same result), but yet the program spits out output like:
PERSON
liberty secularism etching domain professor lecturer tutor royalty
government adviser commissioner
OTHER
business view society economy
OTHER
business
OTHER
verge industrialization market system custom shift rationality
OTHER
family kingdom life drunkenness college student appearance income family
brink poverty life writer variety attitude capitalism age process
production factory system
Any input on how to fix this?
To analyze entities in a text you can use a sample from the documentation which looks something like this:
import argparse
import sys
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import six
def entities_text(text):
"""Detects entities in the text."""
client = language.LanguageServiceClient()
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
# Instantiates a plain text document.
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
# Detects entities in the document. You can also analyze HTML with:
# document.type == enums.Document.Type.HTML
entities = client.analyze_entities(document).entities
# entity types from enums.Entity.Type
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')
for entity in entities:
print('=' * 20)
print(u'{:<16}: {}'.format('name', entity.name))
print(u'{:<16}: {}'.format('type', entity_type[entity.type]))
print(u'{:<16}: {}'.format('metadata', entity.metadata))
print(u'{:<16}: {}'.format('salience', entity.salience))
print(u'{:<16}: {}'.format('wikipedia_url',
entity.metadata.get('wikipedia_url', '-')))
entities_text("Donald Trump is president of United States of America")
The output of this sample is:
====================
name : Donald Trump
type : PERSON
metadata : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d0125170>
salience : 0.9564903974533081
wikipedia_url : https://en.wikipedia.org/wiki/Donald_Trump
====================
name : United States of America
type : LOCATION
metadata : <google.protobuf.pyext._message.ScalarMapContainer object at 0x7fd9d01252b0>
salience : 0.04350961744785309
wikipedia_url : https://en.wikipedia.org/wiki/United_States
As you can see in this example, Entity Analysis inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.). It's not gonna provide you entity for each word in the text.
Instead of classifying according to entities, I would use Google default categories directly, changing
entity = client.analyze_entities(document, encoding).entities
to
categories = client.classify_text(document).categories
and consequently up-dating the code. I wrote the following sample code based on this tutorial, further developed in github.
def run_quickstart():
# [START language_quickstart]
# Imports the Google Cloud client library
# [START migration_import]
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
# [END migration_import]
# Instantiates a client
# [START migration_client]
client = language.LanguageServiceClient()
# [END migration_client]
# The text to analyze
text = u'For its part, India has said it will raise taxes on 29 products imported from the US - including some agricultural goods, steel and iron products - in retaliation for the wide-ranging US tariffs.'
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
# Detects the sentiment of the text
sentiment = client.analyze_sentiment(document=document).document_sentiment
# Classify content categories
categories = client.classify_text(document).categories
# User category feedback
for category in categories:
print(u'=' * 20)
print(u'{:<16}: {}'.format('name', category.name))
print(u'{:<16}: {}'.format('confidence', category.confidence))
# User sentiment feedback
print('Text: {}'.format(text))
print('Sentiment: {}, {}'.format(sentiment.score, sentiment.magnitude))
# [END language_quickstart]
if __name__ == '__main__':
run_quickstart()
Does this solution works for you? If not, why?

Categories

Resources