Why is spacy failing at tokenizing a particular quotation mark?

Why is spacy failing at tokenizing a particular quotation mark? - python

I am running spacy on a paragraph of text and it's not extracting text in quote the same way for each, and I don't understand why that is
nlp = spacy.load("en_core_web_lg")
doc = nlp("""A seasoned TV exec, Greenblatt spent eight years as chairman of NBC Entertainment before WarnerMedia. He helped revive the broadcast network's primetime lineup with shows like "The Voice," "This Is Us," and "The Good Place," and pushed the channel to the top of the broadcast-rating ranks with 18-49-year-olds, Variety reported. He also drove Showtime's move into original programming, with series like "Dexter," "Weeds," and "Californication." And he was a key programming exec at Fox Broadcasting in the 1990s.""")
Here's the whole output:
A
seasoned
TV
exec
,
Greenblatt
spent
eight years
as
chairman
of
NBC Entertainment
before
WarnerMedia
.
He
helped
revive
the
broadcast
network
's
primetime
lineup
with
shows
like
"
The Voice
,
"
"
This
Is
Us
,
"
and
"The Good Place
,
"
and
pushed
the
channel
to
the
top
of
the
broadcast
-
rating
ranks
with
18-49-year-olds
,
Variety
reported
.
He
also
drove
Showtime
's
move
into
original
programming
,
with
series
like
"
Dexter
,
"
"
Weeds
,
"
and
"
Californication
.
"
And
he
was
a
key
programming
exec
at
Fox Broadcasting
in
the 1990s
.
The one that bothers me the most is The Good Place, which is extracted as "The Good Place. Since the quotation is part of the token, I then can't extract text in quote with a Token Matcher later on… Any idea what's going on here?

The issue isn't the tokenization (which should always split " off in this case), but the NER, which uses a statistical model and doesn't always make 100% perfect predictions.
I don't think you've shown all your code here, but from the output, I would assume you've merged entities by adding merge_entities to the pipeline. These are the resulting tokens after entities are merged, and if an entity wasn't predicted correctly, you'll get slightly incorrect tokens.
I tried the most recent en_core_web_lg and couldn't replicate these NER results, but the models for each version of spacy have slightly different results. If you haven't, try v2.2, which uses some data augmentation techniques to improve the handling of quotes.

Related

Extract probabilities and labels from FARM TextClassification

I have spent a few days exploring the excellent FARM library and its modular approach to building models. The default output (result) however is very verbose, including a multiplicity of texts, values and ASCII artwork. For my research I only require the predicted labels from my NLP text classification model, together with the individual probabilities. How do I do that? I have been experimenting with nested lists/dictionaries but am unable to neatly produce a simple list of output labels and probabilities.
enter code here
# Test your model on a sample (Inference)
from farm.infer import Inferencer
from pprint import PrettyPrinter
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
basic_texts = [
# a snippet or two from Dickens
{"text": "Mr Dombey had remained in his own apartment since the death of his wife, absorbed in visions of the youth, education, and destination of his baby son. Something lay at the bottom of his cool heart, colder and heavier than its ordinary load; but it was more a sense of the child’s loss than his own, awakening within him an almost angry sorrow."},
{"text": "Soon after seven o'clock we went down to dinner, carefully, by Mrs. Jellyby's advice, for the stair-carpets, besides being very deficient in stair-wires, were so torn as to be absolute traps."},
{"text": "Walter passed out at the door, and was about to close it after him, when, hearing the voices of the brothers again, and also the mention of his own name, he stood irresolutely, with his hand upon the lock, and the door ajar, uncertain whether to return or go away."},
# from Lewis Carroll
{"text": "I have kept one for many years, and have found it of the greatest possible service, in many ways: it secures my _answering_ Letters, however long they have to wait; it enables me to refer, for my own guidance, to the details of previous correspondence, though the actual Letters may have been destroyed long ago;"},
{"text": "The Queen gasped, and sat down: the rapid journey through the air had quite taken away her breath and for a minute or two she could do nothing but hug the little Lily in silence."},
{"text": "Rub as she could, she could make nothing more of it: she was in a little dark shop, leaning with her elbows on the counter, and opposite to her was an old Sheep, sitting in an arm-chair knitting, and every now and then leaving off to look at her through a great pair of spectacles."},
# G K Chesterton
{"text": "Basil and I walked rapidly to the window which looked out on the garden. It was a small and somewhat smug suburban garden; the flower beds a little too neat and like the pattern of a coloured carpet; but on this shining and opulent summer day even they had the exuberance of something natural, I had almost said tropical. "},
{"text": "This is the whole danger of our time. There is a difference between the oppression which has been too common in the past and the oppression which seems only too probable in the future."},
{"text": "But whatever else the worst doctrine of depravity may have been, it was a product of spiritual conviction; it had nothing to do with remote physical origins. Men thought mankind wicked because they felt wicked themselves. "},
]
result = infer_model.inference_from_dicts(dicts=basic_texts)
PrettyPrinter().pprint(result)
#print(result)

All logging (incl. the ASCII artwork) is done in FARM via Python's logging framework. You can simply disable the logs up to a certain level like this at the beginning of your script:
import logging
logging.disable(logging.ERROR)
Is that what you are looking for or do you rather want to adjust the output format of the model predictions? If you only need label and probability, you could do something like this:
...
basic_texts = [
{"text": "Stackoverflow is a great community"},
{"text": "It's snowing"},
]
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
result = infer_model.inference_from_dicts(dicts=basic_texts)
minimal_results = []
for sample in result:
# Only extract the top 1 prediction per sample
top_pred = sample["predictions"][0]
minimal_results.append({"label": top_pred["label"], "probability": top_pred["probability"]})
PrettyPrinter().pprint(minimal_results)
infer_model.close_multiprocessing_pool()
(I left out the initial model loading etc. - see this example for more details)

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Barrieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the approaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agitated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recognized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."

I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Is there a way to retrieve the whole noun chunk using a root token in spaCy?

I'm very new to using spaCy. I have been reading the documentation for hours and I'm still confused if it's possible to do what I have in my question. Anyway...
As the title says, is there a way to actually get a given noun chunk using a token containing it. For example, given the sentence:
"Autonomous cars shift insurance liability toward manufacturers"
Would it be possible to get the "autonomous cars" noun chunk when what I only have the "cars" token? Here is an example snippet of the scenario that I'm trying to go for.
startingSentence = "Autonomous cars and magic wands shift insurance liability toward manufacturers"
doc = nlp(startingSentence)
noun_chunks = doc.noun_chunks
for token in doc:
if token.dep_ == "dobj":
print(child) # this will print "liability"
# Is it possible to do anything from here to actually get the "insurance liability" token?
Any help will be greatly appreciated. Thanks!

You can easily find the noun chunk that contains the token you've identified by checking if the token is in one of the noun chunk spans:
doc = nlp("Autonomous cars and magic wands shift insurance liability toward manufacturers")
interesting_token = doc[7] # or however you identify the token you want
for noun_chunk in doc.noun_chunks:
if interesting_token in noun_chunk:
print(noun_chunk)
The output is not correct with en_core_web_sm and spacy 2.0.18 because shift isn't identified as a verb, so you get:
magic wands shift insurance liability
With en_core_web_md, it's correct:
insurance liability
(It makes sense to include examples with real ambiguities in the documentation because that's a realistic scenario (https://spacy.io/usage/linguistic-features#noun-chunks), but it's confusing for new users if they're ambiguous enough that the analysis is unstable across versions/models.)

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî#POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî#POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.

This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî#POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî#POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))

How mysql is ordering textual data?

This is the query result from some dataset of articles ordered by article title ascending with limit 10 in MySQL. Encoding is set to utf8_unicode_ci.
'GTA 5' Sells $800 Million In One Day
'Infinity Blade III' hits the App Store ahead of i...
‘Have you lurked her texts?’ How the directors of ...
‘Second Moon’ by Katie Paterson now on a journey a...
"Do not track" effort in trouble as Digital Advert...
"Forbes": Bill Gates wciąż najbogatszym obywatelem...
"Here Is Something False: You Only Live Once"
“That's The Dumbest Thing I've Ever Heard Of.”
[Introduction to Special Issue] The Future Is Now
1 Great Dividend You Can Buy Right Now
I thought ordering works by getting the position of the character in the encoding table.
like ' is 39 and " is 34 in unicode but apostrophe ʼ and double quote “ have much higher position. From my understanding ʼ“ shouldn't make it into the result and " should be at the top. I'm clearly missing something here.
My goal is to order this data by title in python to get the same results as if data was ordered in mysql.

The gist of it is that in order to get better sort orders the Unicode Collation Algorithm is used, which would (probably) convert “ into " and ‘ into ' when sorting.
Unfortunately this is not simple to emulate in Python as the algorithm is non-trivial. You can look for a wrapper library like PyICU to do the hard work, although I've no guarantees they'll work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is spacy failing at tokenizing a particular quotation mark? - python

Related

Extract probabilities and labels from FARM TextClassification

Analysing English text with some French name

Is there a way to retrieve the whole noun chunk using a root token in spaCy?

Keep text clean from url

How mysql is ordering textual data?

Categories

Resources