I am working on a Kaggle dataset and trying to extract BILUO entities using spacy
'training.offsets_to_biluo_tags'
function. The original data is in CSV format which I have managed to convert into below JSON format:
{
"entities": [
{
"feature_text": "Lack-of-other-thyroid-symptoms",
"location": "['564 566;588 600', '564 566;602 609', '564 566;632 633', '564 566;634 635']"
},
{
"feature_text": "anxious-OR-nervous",
"location": "['13 24', '454 465']"
},
{
"feature_text": "Lack of Sleep",
"location": "['289 314']"
},
{
"feature_text": "Insomnia",
"location": "['289 314']"
},
{
"feature_text": "Female",
"location": "['6 7']"
},
{
"feature_text": "45-year",
"location": "['0 5']"
}
],
"pn_history": "45 yo F. CC: nervousness x 3 weeks. Increased stress at work. Change in role from researcher to lecturer. Also many responsibilities at home, caring for elderly mother and in-laws, and 17 and 19 yo sons. Noticed decreased appetite, but forces herself to eat 3 meals a day. Associated with difficulty falling asleep (duration 30 to 60 min), but attaining full 7 hours with no interruptions, no early morning awakenings. Also decreased libido for 2 weeks. Nervousness worsened on Sunday and Monday when preparing for lectures for the week. \r\nROS: no recent illness, no headache, dizziness, palpitations, tremors, chest pain, SOB, n/v/d/c, pain\r\nPMH: none, no pasMeds: none, Past hosp/surgeries: 2 vaginal births no complications, FHx: no pysch hx, father passed from acute MI at age 65 yo, no thyroid disease\r\nLMP: 1 week ago \r\nSHx: English literature professor, no smoking, occasional EtOH, no ilicit drug use, sexually active."
}
In the JSON the entities part contains feature text and its location in the text and the pn_history part contains the entire text document.
The first problem I have is that the dataset contains instances where a single text portion is tagged with more than one unique entity. For instance, text located at position [289 314] belongs to two different entities 'Insomnia' and 'Lack of Sleep'. While processing this type of instance Spacy runs into:
ValueError [E103] Trying to set conflicting doc.ents while creating
custom NER
The second problem that I have in the dataset is for some cases the starting and ending positions are clearly mentioned for instance [13 24] but there are some cases where the
indices are scattered. e.g. for '564 566;588 600' which contains a semicolumn it is expected to pick the first set word(s) from the location 564 566 and the second set of word(s) from the location 588 600. These types of indexes I cannot pass to the Spacy function.
Please advise how can I solve these problems.
OK, it sounds like you have two separate problems.
Overlapping entities. You'll need to decide what to do with these and filter your data, spaCy won't automatically handle this for you. It's up to you to decide what's "correct". Usually you would want the longest entities. You could also use the recently released spancat, which is like NER but can handle overlapping annotations.
Discontinuous entities. These are your annotations with ;. These are harder, spaCy has no way to handle them at the moment (and in my experience, few systems handle discontinuous entities). Here's an example annotation from your sample:
[no] headache, dizziness, [palpitations]
Sometimes with discontinuous entities you can just include the middle part, but that won't work here. I don't think there's any good way to translate this into spaCy, because your input tag is "lack of thyroid symptoms". Usually I would model this as "thyroid symptoms" and handle negation separately; in this case that means you could just tag palpitations.
I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is being read by python so all of the whitespace on the PDF is being turned in '\n' making it very hard to extract the information necessary.
I have tried cleaning it up using a bunch of lines of code, but it still is not very readable. and it gets rid of all the colons which i feel will make identifying the keys and values a lot easier.
stringedText = str(text)
cleanText = rmStop.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
I am still left with quite a bit of unclean almost over processed data. so at this point idk how to use it.
this is how i extracted the data
text = textract.process(filename, method="tesseract", language="eng")
I have tried nltk as well and that took out some data and made it a little easier to read but there are still a lot of \n muddling up the data.
here is the nltk code:
stringedText = str(text)
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(stringedText)
rmStop = [i for i in tokens if not i in ENGLISH_STOP_WORDS]
here is what I get from the first clean up i tried:
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
from that i need a JSON array that looks like:
"lacz-rp" : {
"Date" : "21-feb-2019",
"Sequence" : "gatctctaccatggcgcacatttccccgaaaagtgc"
"Order No." : "15775199"
"Ref No." : "207335463"
}
and so on... I am just not sure what to do. I can also provide the raw output. This is what it looked like before I touched it. the above data is all the information I need to make a complete array.
b' \n\nIDT\nINTEGRATED DNA TECHNOLOGIES,\nOLIGONUCLEOTIDE SPECIFICATION SHEET\n\n \n\n21-Feb-2019\n\nSequence - LacZ-RP\n\n5\'- GAT CTC TAC CAT GGC GCA CAT TTC CCC GAA AAG TGC -3\'\n\nOrder No. 15775199\nref.No. 207335463\n\n25 nmole DNA Oligo, 36 bases\n\n \n\nProperties Amount Of Oligo Shipped To\nTm (50mM NaCl)*:66.8 \xc2\xb0C 11.0= 32.6 DAVID COOK\nGC Content: 52.8% D260 mmoles KANSAS STATE UNIVERSITY-BIOTECHNO.\n\nMolecular Weight: 10,965.1\nnmoles/OD260: 3.0\nug/OD260: 32.6\nExt. Coefficient: 336,200 L/(mole-cm)\n\nSecondary Structure Calcul\n\n \n\nns\nLowest folding free energy (kcal/mole): -3.53 at 25 \xc2\xb0C\n\nStrongest Folding Tm: 46.6 \xc2\xb0C\n\n \n\nOligo Base Types Quantity\n\nDi eo\nModifications and Services Quantity\nStandard Desalting 7\n\nMfg. 1D 289505556\n\n207335463 ~<<IDT\nD.cooK,\n\n2eosoesse 2uren20%9\n\n207335463 ~XIDT\nD.cooK,\n\n \n \n \n\n \n\nINSTRUCTIONS\n\n.d contents may appear as either a translucent film or a white powder.\nice does not affect the quality of the oligo,\n\n\xe2\x80\x9cPlease centrifuge tubes prior to opening. Some of the product may have been\ndislodged during shipping.\n\n\xe2\x80\x9cThe Tm shown takes no account of Mg?* and dNTP concentrations. Use the\nOligoAnalyzer\xc2\xae Program at www.idtdna.com/scitools to calculate accurate Tm for\nyour reaction conditions.\n\nFor 100 |M: add 326 [iL\n\nBURT HALL #207\n\nMANHATTAN, KS 66506\n\nUSA\n\n7855321362\n\nCustomer No. 378741 PO No.06BF3000\n\nDisclaimer\n\nSee on reverse page notes (I) (Il) & (lll) for usage, label\nlicense, and product warranties\n\x0cUse Restrictions: Oligonucleotides and nucleic acid products are manufactured and sold by IDT for the\ncustomer\'s research purposes only. Resale of IDT products requires the express written consent of IDT.\nUnless pursuant to a separate signed agreement by authorized IDT officials, IDT products are not sold\nfor (and have not been approved) for use in any clinical, diagnostic or therapeutic applications.\nObtaining license or approval to use IDT products in proprietary applications or in any non-research\n(clinical) applications is the customer\'s exclusive responsibility. DT will not be responsible or liable for\nany losses, costs, expenses, or other forms of liability arising out of the unauthorized or unlicensed use\nof IDT products. Purchasers of IDT products shall indemnify and hold IDT harmless for any and all\ndamages and/or liability, however characterized, related to the unauthorized or unlicensed use of IDT\nproducts. Under no circumstances shall IDT be liable for any consequential damages, resulting from\nany use (approved or otherwise) of IDT products. All orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned use restrictions and customer indemnification of IDT.\n\nGeneral Warranty: IDT\'s products are guaranteed to meet or exceed our published specifications for\nidentity, purity and yield as measured under normal laboratory conditions. If our product fails to meet\nsuch specifications, IDT will promptly replace the product. A// other warranties are hereby expressly\ndisclaimed, including but not limited to, the implied warranties of merchantability and fitness for a\nparticular purpose, and any warranty that the products, or the use of products, manufactured by IDT will\nnot infringe the patents of one or more third-partiesAll orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned disclaimers of warranties.\n\nSee http://www.idtdna.com/Catalog/Usage/Page1.aspx for further details\na) Cy Dyes: The purchase of this Product includes a limited non-exclusive sublicense under U.S\n\nPatent Nos. 5 556 959 and 5 808 044 and foreign equivalent patents and other foreign and U.S\ncounterpart applications to use the amidites in the Product to perform research. NO OTHER\nLICENSE IS GRANTED EXPRESSLY, IMPLIEDLY OR BY ESTOPPEL. Use of the Product for\ncommercial purposes is strictly prohibited without written permission from Amersham Biosciences\nCorp. For information concerning availability of additional licenses to practice the patented\nmethodologies, please contact Amersham Biosciences Corp, Business Licensing Manager,\nAmersham Place, Little Chalfont, Bucks, HP79NA.\n\nb) \xe2\x80\x94 BHQ: Black Hole Quencher, BHQ-0, BHQ-1, BHQ-2 and BHQ-3 are registered trademarks of\nBiosearch Technologies, Inc., Novato, California, U.S.A Patents are currently pending for the BHQ\ntechnology and such BHQ technology is licensed by the manufacturer pursuant to an agreement\nwith BTI and these products are sold exclusively for research and development use only. They\nmay not be used for human veterinary in vitro or clinical diagnostic purposes and they may not be\nre-sold, distributed or re-packaged. For information on licensing programs to permit use for human\nor veterinary in vitro or clinical diagnostic purposes, please contact Biosearch at\nlicensing#biosearchtech.com.\n\nc) MPI dyes: MPI dyes. This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nd) Molecular Beacons: Molecular Beacons. This product is sold under license from the Public Health\nResearch Institute only for use in the purchaser\'s research and development activities.\n\ne) ddRNAi: This product is sold solely for use for research purposes in fields other than plants. This\nproduct is not transferable. If the purchaser is not willing to accept the conditions of this label\nlicense, supplier is willing to accept the return of the unopened product and provide the purchaser\nwith a full refund. However if the product is opened, then the purchaser agrees to be bound by the\nconditions of this limited use statement. This product is sold by supplier under license from\nBenitec Australia Ltd and CSIRO as co-owners of U.S Patent No. 6,573,099 and foreign\ncounterparts. For information regarding licenses to these patents for use of ddRNAi as a\ntherapeutic agent or as a method to treat/prevent human disease, please contact Benitec at\nlicensing#benitec.com. For the use of ddRNAi in other fields, please contact CSIRO at\nwww.pi.csiro.au/RNAi.\n\x0cf)\n\n9)\n\nh)\n\nk)\n\n))\n\nm)\n\nn)\n\nDicer Substrate RNAi:\n\n* These products are not for use in humans or non-human animals and may not be used for\nhuman or veterinary diagnostic, prophylactic or therapeutic purposes. Sold under license of\npatents pending jointly assigned to IDT and the City of Hope Medical Center.\n\nThis product is licensed under European Patents 1144623, 121945 and foreign equivalents\nfrom Alnylam Pharmaceuticals, Inc., Cambridge, USA and is provided only for use in\nacademic and commercial research whose purpose is to elucidate gene function, including\nresearch to validate potential gene products and pathways for drug discovery and\ndevelopment and to screen non-siRNA based compounds (but excluding the evaluation or\ncharacterization of this product as the potential basis for a siRNA based drug) and not for\nany other commercial purposes. Information about licenses for commercial use (including\ndiscovery and development of siRNA-based drugs) is available from Alnylam\nPharmaceuticals, Inc., 300 Third Street, Cambridge MA 02142, USA\n\nLicense under U.S. Patent # 6506559; Domestic and Foreign Progeny; including European\nPatent Application # 98964202\n\nLNAs: Protected by US. Pat No. 6,268,490 and foreign applications and patents owned or\ncontrolled by Exiqon A/S. For Research Use Only. Not for resale or for therapeutic use or use in\nhumans\n\nOther siRNA duplexes: This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nAcrydite: IDT is licensed under U.S Patent Number 6,180,770 and 5,932,711 to sell this product\nfor use solely in the purchaser\'s own life sciences research and development activities. Resale, or\nuse of this product in clinical or diagnostic applications, or other commercial applications, requires\nseparate license from Mosaic, Inc.\n\nlso-Bases: Licensed under EraGen, Inc. United States Patents Number 5,432,272; 6,001,983;\n6,037,120; and 6,140,496. For research use Only.\n\nDig: Licensed from Roche Diagnostics GmbH\n\n5\' Nuclease Assay: The 5\' Nuclease Assay and other homogenous amplification methods used in\nconnection with the Polymerase Chain Reaction (PCR) process are covered by patents owned by\nRoche Molecular Systems, Inc. and F. Hoffman La-Roche Ltd (Roche). No license to use the 5"\nNuclease Assay or any Roche patented homogenous amplification process is conveyed expressly\nor by implication to the purchaser by the purchase of the above listed products or any other IDT\nproducts.\n\nlowa Black\xc2\xae FQ and RQ: lowa Black is a registered trademark of IDT, and lowa Black-labeled\noligos are covered by pending patents owned and controlled by IDT.\n\nIRDye\xc2\xae 700 and IRDye\xc2\xae 800: IRDye\xc2\xae 700 and IRDye\xc2\xae 800 are products manufactured under\nlicense from LI-COR\xc2\xae Biosciences, which expressly excludes the right to use this product in\nQPCR or AFLP applications.\n\x0c'
You can convert your \n with newline. Please use following;
formatted_text = text.replace('\\n', '\n')
This will replace escaped newlines by actual newlines in the output.
I am trying to remove the bold tag (<b> Some text in bold here </b>) from this xml document (but want to keep the text covered by the tags intact). The bold tags are present around the following words/text: Objectives, Design, Setting, Participants, Interventions, Main outcome measures, Results, Conclusion, and Trial registrations.
This is my Python code:
import requests
import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as etree
from time import sleep
import json
urlHead = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id='
pmid = "28420629"
completeUrl = urlHead + pmid
response = urllib.request.urlopen(completeUrl)
tree = etree.parse(response)
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
print(studyAbstractPart.text)
The problem with this code is that it finds all the text under "AbstractText" tag but it stops (or ignores) the text in bold tags and after it. In principle, I need all the text between the "<AbstractText> </AbstractText>" tags, but the bold formatting <b> </b> is just a shitty obstruction to it.
You can use the itertext() method to get all the text in <AbstractText> and its subelements.
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
for t in studyAbstractPart.itertext():
print(t)
Output:
Objectives
To determine whether preoperative dexamethasone reduces postoperative vomiting in patients undergoing elective bowel surgery and whether it is associated with other measurable benefits during recovery from surgery, including quicker return to oral diet and reduced length of stay.
Design
Pragmatic two arm parallel group randomised trial with blinded postoperative care and outcome assessment.
Setting
45 UK hospitals.
Participants
1350 patients aged 18 or over undergoing elective open or laparoscopic bowel surgery for malignant or benign pathology.
Interventions
Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia compared with standard care.
Main outcome measures
Primary outcome: reported vomiting within 24 hours reported by patient or clinician.
vomiting with 72 and 120 hours reported by patient or clinician; use of antiemetics and postoperative nausea and vomiting at 24, 72, and 120 hours rated by patient; fatigue and quality of life at 120 hours or discharge and at 30 days; time to return to fluid and food intake; length of hospital stay; adverse events.
Results
1350 participants were recruited and randomly allocated to additional dexamethasone (n=674) or standard care (n=676) at induction of anaesthesia. Vomiting within 24 hours of surgery occurred in 172 (25.5%) participants in the dexamethasone arm and 223 (33.0%) allocated standard care (number needed to treat (NNT) 13, 95% confidence interval 5 to 22; P=0.003). Additional postoperative antiemetics were given (on demand) to 265 (39.3%) participants allocated dexamethasone and 351 (51.9%) allocated standard care (NNT 8, 5 to 11; P<0.001). Reduction in on demand antiemetics remained up to 72 hours. There was no increase in complications.
Conclusions
Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia significantly reduces both the incidence of postoperative nausea and vomiting at 24 hours and the need for rescue antiemetics for up to 72 hours in patients undergoing large and small bowel surgery, with no increase in adverse events.
Trial registration
EudraCT (2010-022894-32) and ISRCTN (ISRCTN21973627).
I have a text file where I have this information:
BRIEF DESCRIPTION A herbaceous, upright, often much branched, slightly woody plant, up to 2-4 m in height, with spiny pubescence, large yellow flowers, and fruits which at maturity dry to a longitudinally dehiscent capsule, 25 cm long or more. USES The young immature fruits are eaten fresh, cooked or fried as vegetables and the can be frozen, canned or dried. Fruits have medicinal properties. Ripe seeds contain 20% edible oil and they can be used as a substitute for coffee. In India, mucilage from the roots and stems has industrial value for clarifying sugarcane juice in gur manufacture. Dried okra powder is used in salad dressings, ice creams, cheese spreads, and confectionery. The stems provide a fiber of inferior quality. GROWING PERIOD Annual. May require 50-90 days to first harvest and the harvest period may continue up to 180 days. COMMON NAMES Okra, Ochro, Lady's Finger, Gumbo, Gombo, Cantarela, Quingombo, Rosenapfel, Bindi, Bhindee, Bhindi, Mesta, Vendakai, Kachang bendi, Kachang lender, Sayur bendi, Kachieb, Grajee-ap morn, You-padi, Ch'aan K'e, Tsau Kw'ai, Ila, Ilasha, Ilashodo, Quimbambo, Kopi arab, Khua ngwang, Krachiap mon, Dau bap. FURTHER INF Scientific synonym: Hibiscus esculentus. Okra originated in South-East Asia. Most varieties grow well in the lowland humid tropics up to elevations of 1000 m. Adapted to moderate to high humidity. Okra is a short-day plant, but it has a wide geographic distribution, up to latitudes 35-40°S and N. Yields of green pods are often low, about 2-4 t/ha owing to extreme growing conditiuons, but up to 10-40 t/ha may be produced.
I am using the library quantulum which extracts all the measurement automatically.
BriefDescription is a variable that contains text
The QuantDescription stores all the quantity from the BriefDescription
I need to get the values that are "metre" string in the second parameter of the quantity tuple
I need to figure out how to get the index of the tuples
quantsDescription = parser.parse(BriefDescription)
quantsUses = parser.parse(Uses)
quantsPeriod = parser.parse(GrowingPeriod)
print 'BriefDescription Quant:'
print quantsDescription
print 'Uses Quant:'
print quantsUses
print 'GrowingPeriod Quant:'
print quantsPeriod
for i in quantsDescription:
print type(i[1]) # indexing the second element of the tuple?
This the output list for quantsDescription:
[Quantity(2, "metre"), Quantity(4, "metre"), Quantity(25, "centimetre")]
print quantsDescription[0].unit.name # to get the unit's quantity
print quantsDescription[0].value # to get the amount of the quantity