extract disease names from raw data which has no pattern - python

I want to extract disease words from medical data to make a disease word dictionary (consider the notes written by doctors, test results). I'm using python. I tried the following ways:
Used google API to check whether the word is a disease or not depending on the results. It didn't go well because it was extracting medical words too and i even tried modify the search and also i had to buy google CSE which i feel is costly because i have huge data. Its a huge code to include in the post.
Used weka to predict the words but the data which i have is normal text data and wont follow any rules and not in ARFF or CSV type.
Tried checking NER for extracting disease words. But, all the models which i have seen needed a predefined dictionary to search and perform tf-idf on the input data. I don't have such kind of dictionary.
In all the models which i have seen they suggest me to tokenize then POS for the data which I did and couldn't find another way to extract only the disease words.
I even tried extracting only the nouns which didn't do well because other medical terms were also considered as nouns.
My data is in the following way and doesn't follow the same way in the whole document:
After conducting clinical reviews the patient was suffering with
diabetes,htn which was revealed when a complete blood picture of the
patient's blood was done. He was advised to take PRINIVIL TABS 20 MG
(LISINOPRIL) 1.
Believe me, I googled a lot and couldn't come with a perfect solution. Please suggest a way for me to move forward.
The following is one of the approaches I tried which extracted the medical terms too. Sorry, the code looks a bit clumsy and i am positing the main function in it as posting the whole code will be veryy lenghty. Look the search_word variable main logic lies there :
def search(self,wordd): #implemented google custom search engine api
#responseData = 'None'
global flag
global page
search_word="\"is+%s+an+organ?\"" %(wordd)
search_word=str(search_word)
if flag == 1:
search_word="\"%s+is+a+disease\"" %(wordd)
try: #searching google for the word
url = 'https://www.googleapis.com/customsearch/v1?key=AIzaSyAUGKCa2oHSYeZynSMD6zElBKUrg596G_k&cx=00262342415310682663:xy7prswaherw&num=3&q='+search_word
print url
data = urllib2.urlopen(url)
response_data = json.load(data)
results=response_data['queries']['request'][0]['totalResults']
results_count=int(results)
print "the results is: ",results_count
if(results_count == 0):
print "no results found"
flag = 0
return 0
else:
return 1
#except IOError:
#print "network issues!"
except ValueError:
print "Problem while decoding JSON data!"

Related

What is the best way to extract the body of an article with Python?

Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!

Is there a way to identify cities in a text without maintaining a prior vocabulary, in Python?

I have to identify cities in a document (has only characters), I do not want to maintain an entire vocabulary as it is not a practical solution. I also do not have Azure text analytics api account.
I have already tried using Spacy, I did ner and identified geolocation and that output is passed to spellchecker() to train the model. But the issue with this is that ner requires sentences and my input has words.
I am relatively new to this field.
You can check out the geotext library.
Working example with a sentence:
text = "The capital of Belarus is Minsk. Minsk is not so far away from Kiev or Moscow. Russians and Belarussians are nice people."
from geotext import GeoText
places = GeoText(text)
print(places.cities)
Output:
['Minsk', 'Minsk', 'Kiev', 'Moscow']
Working example with list of words:
wordList = ['London', 'cricket', 'biryani', 'Vilnius', 'Delhi']
for i in range(len(wordList)):
places = GeoText(wordList[i])
if places.cities:
print(places.cities)
Output:
['London']
['Vilnius']
['Delhi']
geograpy is another alternative. However, I find geotext light due to lesser number of external dependencies.
there is a list of libraries that may help you,
but from my experience, there is not a perfect library for this. If you know all the cities that may appear in the text, then vocabulary is the best thing

Finding full taxonomy (heirarchical hypernymy sequence) of a given DBpedia resource using SPARQL

Given a DBpedia resource, I want to find the entire taxonomy till root.
For example, if I were to say in plain English, for Barack Obama I want to know the entire taxonomy which goes as Barack Obama → Politician → Person → Being.
I have written the following recursive function for the same:
import requests
import json
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
def get_taxonomy(results,entity,hypernym_list):
'''This recursive function keeps on fetching the hypernyms of the
DBpedia resource recursively till the highest concept or root is reached'''
if entity == 'null':
return hypernym_list
else :
query = ''' SELECT ?hypernyms WHERE {<'''+entity+'''> <http://purl.org/linguistics/gold/hypernym> ?hypernyms .}
'''
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
hypernym_list.append(result['hypernyms']['value'])
if len(results["results"]["bindings"]) == 0:
return get_taxonomy(results,'null',hypernym_list)
return get_taxonomy(results,results["results"]["bindings"][0]['hypernyms']['value'],hypernym_list)
def get_taxonomy_of_resource(dbpedia_resource):
list_for_hypernyms=[]
results = {}
results["results"]={}
results["results"]["bindings"]=[1,2,3]
taxonomy_list = get_taxonomy(results,dbpedia_resource,list_for_hypernyms)
return taxonomy_list
The code works for the following input:
get_taxonomy_of_resource('http://dbpedia.org/resource/Barack_Obama')
Output:
['http://dbpedia.org/resource/Politician',
'http://dbpedia.org/resource/Person', 'http://dbpedia.org/resource/Being']
Problem :
But for following output it only gives hypernym till one level above and stops:
get_taxonomy_of_resource('http://dbpedia.org/resource/Steve_Jobs')
Output:
['http://dbpedia.org/resource/Entrepreneur']
Research:
On doing some research on their site dbpedia.org/page/<term> I realized that the reason it stopped at Entrepreneur is that when I click on this resource on their site, it takes me to resource 'Entrepreneurship' and state its hypernym as 'Process'. So now my problem has been directed to the question:
How do I know that Entrepreneur is directing to Entrepreneurship even though both are valid DBpedia entities? My recursive function fails due to this as in next iteration it attempts to find hypernym for Entrepreneur rather than Entrepreneurship.
Any help is duly appreciated
I have faced this same problem before while writing a program to generate taxonomies and my solution was to use in addition wiktionary when my main ressource failed to provide a hypernym.
The wiktionary dump can be downloaded and parsed into a python dictionary.
For example, the wiktionary entry for 'entrepreneur' contains the following:
Noun
entrepreneur (plural entrepreneurs)
A person who organizes and operates a business venture and assumes much of the associated risk.
From this definition, the hypernym ('person') can be extracted.
Naturally, this approach entails writing code to extract the hypernym from a definition (a task which is at times easy and at times hard depending on the wording of the definition).
This approach provides a fallback routine in cases when the main ressource (DBpedia in your case) fails to provide a hypernym.
Finally, as stated by AKSW, it is good to have a method to capture incorrect hypernym relations (e.g. Entrepreneur - Process). There is the area of textual entailment in natural language processing, which studies methods for determining if a statement contradicts (or implies or .. ) another statement.

Python Whoosh - Combining Results

Thanks for taking the time to answer this in advance. I'm relatively new to both Python (3.6) and Whoosh (2.7.4), so forgive me if I'm missing something obvious.
Whoosh 2.7.4 — Combining Results Error
I'm trying to follow the instructions in the Whoosh Documentation here on How to Search > Combining Results. However, I'm really lost in this section:
# Get the terms searched for
termset = set()
userquery.existing_terms(termset)
As I run my code, it produces this error:
'set' object has no attribute 'schema'
What went wrong?
I also looked into the docs about the Whoosh API on this, but I just got more confused about the role of ixreader. (Or is it index.Index.reader()?) Shrugs
A Peek at My Code
Schema
schema = Schema(uid=ID(unique=True, stored=True), # unique ID
indice=ID(stored=True, sortable=True),
title=TEXT,
author=TEXT,
body=TEXT(analyzer=LanguageAnalyzer(lang)),
hashtag=KEYWORD(lowercase=True, commas=True,
scorable=True)
)
The relevant fieldnames are the 'hashtag' and 'body'. Hashtags are user selected keywords for each document, and body is the text in the document. Pretty self-explanatory, no?
Search Function
Much of this is lifted directly from Whoosh Doc. Note, dic is just a dictionary containing the query string. Also, it should be noted that the error occurs during userquery.existing_terms(termset), so if the rest of it is bunk, my apologies, I haven't gotten that far.
try:
ix = index.open_dir(self.w_path, indexname=lang)
qp = QueryParser('body', schema=ix.schema)
userquery = qp.parse(dic['string'])
termset = set()
userquery.existing_terms(termset)
bbq = Or([Term('hashtag', text) for fieldname, text
in termset if fieldname == 'body'])
s = ix.searcher()
results = s.search(bbq, limit=5)
allresults = s.search(userquery, limit=10)
results.upgrade_and_extend(allresults)
for r in results:
print(r)
except Exception as e:
print('failed to search')
print(e)
return False
finally:
s.close()
Goal of My Code
I am taking pages from different files (pdf, epub, etc) and storing each page's text as a separate 'document' in a whoosh index (i.e. the field 'body'). Each 'document' is also labeled with a unique ID (uid) that allows me to take the search Results and determine the pdf file from which it comes and which pages contain the search hit (e.g. the document from page 2 of "1.pdf" has the uid 1.2). In other words, I want to give the user a list of page numbers that contain the search term and perhaps the pages with the most hits. For each file, the only document that has hashtags (or keywords) is the document with a uid ending in zero (i.e. page zero, e.g. uid 1.0 for "1.pdf"). Page zero may or may not have a 'body' too (e.g. the publish date, author names, summary, etc). I did this in order to prevent one document with more pages to be dramatically ranked higher from another with considerably less pages because of the multiple repetitions of the keyword over each 'document' (i.e. page).
Ultimately, I just want the code to elevate documents with the hashtag over documents with just search hits in the body text. I thought about just boosting the hashtag field instead, but I'm not sure what the mechanics of that is and the documentation recommends against this.
Suggestions and corrections would be greatly appreciated. Thank you again!
The code from your link doesn't look right to me. It too gives me the same error. Try rearranging your code as follows:
try:
ix = index.open_dir(self.w_path, indexname=lang)
qp = QueryParser('body', schema=ix.schema)
userquery = qp.parse(dic['string'])
s = ix.searcher()
allresults = s.search(userquery, limit=10)
termset = userquery.existing_terms(s.reader())
bbq = Or([Term('hashtag', text) for fieldname, text in termset if fieldname == 'body'])
results = s.search(bbq, limit=5)
results.upgrade_and_extend(allresults)
for r in results:
print(r)
except Exception as e:
print('failed to search')
print(e)
return False
finally:
s.close()
existing_terms requires a reader so I create the searcher first and give its reader to it.
As for boosting a field, the mechanics are quite simple:
schema = Schema(title=TEXT(field_boost=2.0), body=TEXT).
Add a sufficiently high boost to bring hashtag documents to the top and be sure to apply a single query on both body and hashtag fields.
Deciding between boosting or combining depends on whether you want all matching hashtag documents to be always, absolutely at the top before any other matches show. If so, combine. If instead you prefer to strike a balance in relevance albeit with a stronger bias for hashtags, boost.

Extracting Fasta Moonlight Protein Sequences with Python

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:
for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:
go to the uniprot option
download the fasta file
store it in a .txt file inside a given folder
Thanks in advance!
I would strongly suggest to ask the authors for the database. From the FAQ:
I would like to use the MoonProt database in a project to analyze the
amino acid sequences or structures using bioinformatics.
Please contact us at bioinformatics#moonlightingproteins.org if you are
interested in using MoonProt database for analysis of sequences and/or
structures of moonlighting proteins.
Assuming you find something interesting, how are you going to cite it in your paper or your thesis?
"The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.
That's a good introduction to scraping
But back to your your original question.
import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')
for i, cell in enumerate(cells):
if cell.text:
#if we get something which looks like a FASTA sequence, print it
if cell.text.startswith('>'):
print(cell.text)
#if we find a table cell which has UniProt in it
#let's print the link from the next cell
if 'UniProt' in cell.text_content():
if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
print(cells[i + 1].find('a').attrib['href'])

Categories

Resources