Working with span objects. [spaCy, python]

Working with span objects. [spaCy, python] - python

I'm not sure if this is a really dumb question, but here goes.
text_corpus = '''Insurance bosses plead guilty\n\nAnother three US insurance executives have pleaded guilty to fraud charges stemming from an ongoing investigation into industry malpractice.\n\nTwo executives from American International Group (AIG) and one from Marsh & McLennan were the latest. The investigation by New York attorney general Eliot Spitzer has now obtained nine guilty pleas. The highest ranking executive pleading guilty on Tuesday was former Marsh senior vice president Joshua Bewlay.\n\nHe admitted one felony count of scheming to defraud and faces up to four years in prison. A Marsh spokeswoman said Mr Bewlay was no longer with the company. Mr Spitzer\'s investigation of the US insurance industry looked at whether companies rigged bids and fixed prices. Last month Marsh agreed to pay $850m (£415m) to settle a lawsuit filed by Mr Spitzer, but under the settlement it "neither admits nor denies the allegations".\n'''
def get_entities(document_text, model):
analyzed_doc = model(document_text)
entities = [entity for entity in analyzed_doc.ents if entity.label_ in ["PER", "ORG", "LOC", "GPE"]]
return entities
model = spacy.load("en_core_web_sm")
entities_1 = get_entities(text_corpus, model)
entities_2 = get_entities(text_corpus, model)
but when it run the following,
entities_1[0] in entities_2
The output is False.
Why is that? The objects in both the entity lists are the same. Yet an item from one list is not in the other one. That's extremely odd. Can someone please explain why that is so to me?

This is due to the way ents's are represented in spaCy. They are classes with specific implementations so even entities_2[0] == entities_1[0] will evaluate to False. By the looks of it, the Span class does not have an implementation of __eq__ which, at first glance at least, is the simple reason why.
If you print out the value of entities_2[0] it will give you US but this is simply because the span class has a __repr__ method implemented in the same file. If you want to do a boolean comparison, one way would be to use the text property of Span and do something like:
entities_1[0].text in [e.text for e in entities_2]
edit:
As #abb pointed out, Span implements __richcmp__, however this is applicable to the same instance of Span since it checks the position of the token itself.

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Extract probabilities and labels from FARM TextClassification

I have spent a few days exploring the excellent FARM library and its modular approach to building models. The default output (result) however is very verbose, including a multiplicity of texts, values and ASCII artwork. For my research I only require the predicted labels from my NLP text classification model, together with the individual probabilities. How do I do that? I have been experimenting with nested lists/dictionaries but am unable to neatly produce a simple list of output labels and probabilities.
enter code here
# Test your model on a sample (Inference)
from farm.infer import Inferencer
from pprint import PrettyPrinter
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
basic_texts = [
# a snippet or two from Dickens
{"text": "Mr Dombey had remained in his own apartment since the death of his wife, absorbed in visions of the youth, education, and destination of his baby son. Something lay at the bottom of his cool heart, colder and heavier than its ordinary load; but it was more a sense of the child’s loss than his own, awakening within him an almost angry sorrow."},
{"text": "Soon after seven o'clock we went down to dinner, carefully, by Mrs. Jellyby's advice, for the stair-carpets, besides being very deficient in stair-wires, were so torn as to be absolute traps."},
{"text": "Walter passed out at the door, and was about to close it after him, when, hearing the voices of the brothers again, and also the mention of his own name, he stood irresolutely, with his hand upon the lock, and the door ajar, uncertain whether to return or go away."},
# from Lewis Carroll
{"text": "I have kept one for many years, and have found it of the greatest possible service, in many ways: it secures my _answering_ Letters, however long they have to wait; it enables me to refer, for my own guidance, to the details of previous correspondence, though the actual Letters may have been destroyed long ago;"},
{"text": "The Queen gasped, and sat down: the rapid journey through the air had quite taken away her breath and for a minute or two she could do nothing but hug the little Lily in silence."},
{"text": "Rub as she could, she could make nothing more of it: she was in a little dark shop, leaning with her elbows on the counter, and opposite to her was an old Sheep, sitting in an arm-chair knitting, and every now and then leaving off to look at her through a great pair of spectacles."},
# G K Chesterton
{"text": "Basil and I walked rapidly to the window which looked out on the garden. It was a small and somewhat smug suburban garden; the flower beds a little too neat and like the pattern of a coloured carpet; but on this shining and opulent summer day even they had the exuberance of something natural, I had almost said tropical. "},
{"text": "This is the whole danger of our time. There is a difference between the oppression which has been too common in the past and the oppression which seems only too probable in the future."},
{"text": "But whatever else the worst doctrine of depravity may have been, it was a product of spiritual conviction; it had nothing to do with remote physical origins. Men thought mankind wicked because they felt wicked themselves. "},
]
result = infer_model.inference_from_dicts(dicts=basic_texts)
PrettyPrinter().pprint(result)
#print(result)

All logging (incl. the ASCII artwork) is done in FARM via Python's logging framework. You can simply disable the logs up to a certain level like this at the beginning of your script:
import logging
logging.disable(logging.ERROR)
Is that what you are looking for or do you rather want to adjust the output format of the model predictions? If you only need label and probability, you could do something like this:
...
basic_texts = [
{"text": "Stackoverflow is a great community"},
{"text": "It's snowing"},
]
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
result = infer_model.inference_from_dicts(dicts=basic_texts)
minimal_results = []
for sample in result:
# Only extract the top 1 prediction per sample
top_pred = sample["predictions"][0]
minimal_results.append({"label": top_pred["label"], "probability": top_pred["probability"]})
PrettyPrinter().pprint(minimal_results)
infer_model.close_multiprocessing_pool()
(I left out the initial model loading etc. - see this example for more details)

Segmentation Error using python nested list in Cython cdef

I am trying to implement spaCy tokenization through Cython for speed optimization. The input is a list of texts, and the expected output is a nested 2-d list where each sublist is a list of tokens. I refer to the post at https://github.com/huggingface/100-times-faster-nlp/blob/master/100-times-faster-nlp-in-python.ipynb for implementation.
# cython: infer_types=True
import re
cimport numpy
from cpython cimport *
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.lexeme cimport Lexeme
from spacy.structs cimport TokenC
from spacy.typedefs cimport hash_t
from spacy.attrs cimport IS_SPACE, IS_PUNCT, LIKE_NUM
cdef struct DocElement:
TokenC* c
int length
cdef list tokenize(DocElement* docs, int n_docs):
cdef list out = []
cdef int i
cdef list sub_out
for doc in docs[:n_docs]:
sub_out = []
for c in doc.c[:doc.length]:
if (not Lexeme.c_check_flag(c.lex, IS_SPACE) and not Lexeme.c_check_flag(c.lex, IS_PUNCT)
and not Lexeme.c_check_flag(c.lex, LIKE_NUM)):
sub_out.append(c.lex.lower)
out.append(sub_out)
return out
def tokenization(text_ls, stopwords, tokenizer):
cdef int i, n_out, n_docs = len(text_ls)
cdef Pool mem = Pool()
cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
cdef Doc doc
text_ls = [re.sub('[^\\w ]+', ' ', text).lower() if text else '' for text in text_ls]
text_ls = tokenizer.tokenizer.pipe(text_ls)
for i, doc in enumerate(text_ls):
docs[i].c = doc.c
docs[i].length = (<Doc>doc).length
stops = set()
for word in stopwords:
stops.add(tokenizer.vocab.strings[word])
out = tokenize(docs, n_docs)
out = [[tokenizer.vocab.strings[item] for item in subset] for subset in out]
out = [[item for item in subset if item not in stops] for subset in out]
return out
I use setup.py to install this package with C++ language. My concern for this design is, in the tokenize function which requires C-structure data input, the declared output is python list. I tested with one text line and it works:
import spacy
text = 'SINGAPORE - Protecting the jobs of Singaporeans and ensuring the survival of businesses will be the Government\'s primary focus, said Trade and Industry Minister Chan Chun Sing, as the country hunkers down for what could be a protracted battle with the novel coronavirus.\n\n"I would like to reassure Singaporean businesses and workers that we stand together with them. We do have the means to help them tide over this difficult moment but we must do this with a long-term perspective," said Mr Chan.\n\nThe impact of the Wuhan virus could be "wider, deeper and longer" than that of the severe acute respiratory syndrome (Sars) epidemic in 2003, and Singaporeans need to be mentally prepared for this, he said, adding that measures put in place must be sustainable.\n\n\nHe was speaking to reporters after visiting Oasia Hotel Downtown with Manpower Minister Josephine Teo, where they inspected precautionary measures put in place by the hotel after a hotel guest was found to have come down with the virus.\n\nSingapore\'s 13th case of the novel coronavirus, a 73-year-old female Chinese national, had stayed there.\n\nMr Chan\'s comments echoed those he made earlier in the day at a Chinese New Year lunch for residents of Tanjong Pagar GRC and Radin Mas constituency.\n\n\nIn that speech, Mr Chan called on Singaporeans to gird themselves "psychologically, emotionally, economically and socially" as the battle with the virus could be one for the long haul.\n\nGet Wuhan virus alerts\nReceive e-mail updates and top stories from The Straits Times.\n\nEnter your e-mail\n Sign up\nBy signing up, you agree to our Privacy Policy and Terms and Conditions.\n\nPrevious epidemics have lasted from a few months to a year, but they have had wide implications, disrupting global supply chains and affecting industries from tourism to manufacturing.\n\n"Because we don\'t know how long this situation will last, all the measures we take, be it in health, or economics and jobs... must be sustainable. We cannot just be taking measures for the short haul, thinking that it will blow over," said Mr Chan.\n\nThe novel coronavirus, which first emerged in the Chinese city of Wuhan in December last year, has so far proved to be more infectious than Sars.\n\nIt seems, however, to be less deadly, with a fatality rate of 2 to 3 per cent in China, said Mr Chan. On the other hand, Sars had a fatality rate of about 9.6 per cent.\n\nRelated Story\nWuhan virus: Get latest updates\nRelated Story\nInteractive: What we know so far about the Wuhan virus\nRelated Story\nWuhan virus: 15 people refused entry into Singapore following new travel restrictions\nChina has been grappling with containing the infectious virus, which has sickened thousands and killed over 300 people. So far 18 people, including two Singaporeans, have been found infected by the virus here.\n\nLater at the hotel, while Mr Chan said it was still too early to to put a number to the economic hit from the outbreak, he said the Government would be taking several measures with immediate effect to help tourism businesses mitigate the impact.\n\nIt will waive licence fees for hotels, travel agents and tourist guides, as well as defray the cleaning and disinfection costs of hotels that had confirmed and suspected cases of the novel coronavirus.\n\nThis initial package is part of a full raft of measures that will be detailed by Finance Minister Heng Swee Keat at the upcoming Budget speech on Feb 18.\n\nHotel operators typically have to pay between $300 and $500 to renew their licences yearly, depending on the number of rooms each hotel has.\n\nThose hotels where suspected and confirmed cases of the virus had been found, have also had to do enhanced environmental cleaning and disinfection - the Singapore Tourism Board will bear up to half the cost of such cleaning fees.'
tokenizer = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner', 'textcat'])
spacy_test.tokenization([text], {}, tokenizer)
However, when I use a long list of text, for this text I replicate 50 times, it shows segmentation error. If I do even more, it gives me "KeyError: "[E018] Can't retrieve string for hash '10283392'." which is likely due to the spacy string error. I also tried to use PyObject to replace direct list declaration in tokenize function, but it does not solve. I am new to Cython and I have only written C-extension before, what should be the best way to write such functions?

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.

Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Matching company names in the news data using Python

I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.

Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break

It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.

You can try
difflib.get_close_matches
with the full company name.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.